New Collaborative Research Center: Easier and reusable Data Analysis for the Natural Sciences
The recently established Collaborative Research Center “FONDA – Foundations of Workflows for Large Scale Scientific Data Analysis” researches new methods for analyzing large data sets.
Such data emerges from experiments in all fields of modern natural sciences, and their timely analysis requires the usage of complex computational infrastructures which are difficult to program. The main aim of the CRC is to reduce the effort required for developing such programs. To this end, it gathered a highly interdisciplinary team of researchers in Computer Science, the Life Sciences, Material Sciences, and Remote Sensing, spanning all universities of Berlin and a number of further research institutions from Berlin and Brandenburg.
An exemplary problem FONDA will study are workflows for the analysis of very large satellite image sets. Prof. Hostert, an expert in remote sensing at Humboldt-Universität zu Berlin, states: “Modern satellites continuously acquire high resolution images across the entire globe. We analyze large series of such images to detect, for instance, hot spots of deforestation or desertification around the world, by programming complex workflows composed of multiple steps of image preprocessing, registration, filtering, and classification using methods from Machine Learning “. However, running these workflows for larger geographic regions, like countries or continents, is only feasible on large compute clusters which adds an additional layer of complexity to the code. He teamed up with Prof. Ulf Leser, also at Humboldt-Universität zu Berlin, to research methods to reduce the complexity in programming these workflows. Prof. Leser, who also is the speaker of FONDA, adds: “A unique feature of FONDA is this focus on reducing development times. We observed that scientists often need weeks or months just to adapt a workflow intended for a single machine to be also executable on a compute cluster. This is a much stronger impediment to scientific advances than the actual runtime of the workflows.”
In another project, Prof. Kerstin Ritter from Charité Berlin studies related problems occurring in biomedical image analysis. “We work on the prediction of Alzheimer’s disease from brain scans, which is a highly exploratory research field requiring interactive methods for data analysis – while at the same time having to consider very large image collections for training modern Machine Learning methods.”, Prof. Ritter explains. “Our image analysis workflows are continuously adapted to new brain regions, new scanning devices, or new patient cohorts. This currently involves a lot of time-consuming low-level programming”. Together with Dr. Dagmar Kainmüller from MDC Berlin, her aim within FONDA is to develop a novel, intuitive programming language for specifying such image analysis workflows. “Our dream is to enable medical consultants or researchers who are not experts in image computing to easily and interactively adapt a workflow to their data and their needs while it is already running on a large compute cluster”, adds Dr. Kainmüller.
Clearly, such problems can only be approached by a close interaction of computer scientists and researchers from the natural sciences. Accordingly, 50% of the projects within FONDA are made of such teams – studying not only image analysis problems, but also workflows for genome data analysis or for material science. The other 50% are pure computer science projects, ranging from theoretical investigations regarding properties of workflow systems to distributed file systems, and new scheduling algorithms. Prof. Matthias Weidlich, deputy speaker of the CRC, states: “One of our most ambitious research projects”, headed by Prof. Christoph Koch – a physicist – and Prof. Peter Eisert – a computer scientist –, “is concerned with real-time analysis of streams of high density measurements from electron microscopes. Currently, such analysis can only be performed offline, with a strict separation of measurement and analysis phases. We aim at developing new methods for removing this restriction, which would allow adapting the course of a measurement in real-time based on ongoing observations”. Addressing such demanding problems requires input from many more experts, such as Prof. Volker Markl from Technische Universität Berlin, who is an expert in workflows over streaming data, or Prof. Tilmann Rabl from the Hasso-Plattner-Institute at Universität Potsdam, an expert in distributed systems.
Berlin is the ideal place to pursue such research. Its high density of universities and research institutes covering all scientific disciplines builds the basis on which collaborative projects like FONDA become possible. Unifying these strengths also is the focus of the Berlin University Alliance (BUA), whose members are all participating in FONDA. “We consider FONDA with its integrative approach of cross-institutional and cross-discipline research, its seamless resource sharing, and its focus on high-profile reproducible science almost as a blueprint for the BUA”, says Prof. Peter Frensch, vice president for research of Humboldt-Universität zu Berlin. “Of course, we envision that the success of FONDA will foster other cross-university initiatives”.
Prof. Dr. Ulf Leser
Institut für Informatik
Humboldt-Universität zu Berlin
http://FONDA is a joint research project of Humboldt-Universität zu Berlin, Technische Universität zu Berlin, Freie Universität Berlin, Universität Potsdam, Charité Berlin, Max-Delbrück-Center for Molecular Medicine, and the Zuse-Institut Berlin.