Researching rare diseases using artificially generated data

Rare diseases affect millions in Germany. The available data is often insufficient to allow early identification, research, or treatment. Experts in computer science and medicine have now laid the foundations for the development of a public database at a sandpit event funded by Wübben Stiftung Wissenschaft.

Arrhythmogenic right ventricular cardiomyopathy (ARVC), an inherited disease of the heart muscle, is one of around 7,000 rare diseases that only affect one in several thousand people. In ARVC, fat and connective tissue grows in the right ventricle in place of muscle cells. This leads to arrhythmia. Which carries the risk of sudden cardiac death.

“Patients with rare diseases like ARVC are often not given a correct diagnosis for years. It usually takes five to seven years for their illness to be identified,” says Jannik Schaaf from Goethe University Frankfurt/University Hospital Frankfurt. As a medical information scientist, he is committed to improving data availability so diseases that are hard to identify can be diagnosed faster and studied. Treatments are currently only available for around five to ten percent of these diseases. “Usually, patients are just given treatment to alleviate their symptoms, with the aim of improving their quality of life.”

These case numbers are approaching those of widespread diseases and are placing a great strain on the health system. It is essential that we press ahead with research. And an important basis for this research is data.

Jannik Schaaf

Most rare diseases are genetic in origin and occur in childhood, with around 20 percent appearing later in life, as a result of a viral infection, for instance. Taking the sufferers of all these diseases together, they are by no means rare, with around four million such patients in Germany alone. “These case numbers are approaching those of widespread diseases and are placing a great strain on the health system,” says Schaaf. “It is essential that we press ahead with research. And an important basis for this research is data.”

Synthetic data can avoid legal pitfalls

In June 2025, 13 medical, computer science, and economics experts joined patient representatives at a Wübben Stiftung Wissenschaft sandpit workshop to develop a digital platform called SHARE that will make data on rare diseases available to researchers worldwide. SHARE, which stands for “Synthetic Health dAta Repository,” uses synthetic data instead of real patient data.

Synthetic data are generated from real data with the help of AI, and medical staff check and improve the output. This creates artificial datasets that are statistically very similar to real patient data – in terms of symptoms, lab test results, family history, etc. – but do not allow direct inferences to be drawn about individuals. Researchers can use the data to develop AI models to help diagnose rare diseases, for example, or to develop treatments. “With synthetic data you can generate artificial patients with different characteristics and test how they respond to a drug,” says Schaaf.

Synthetic data are ‘enablers’ of research and development and can dramatically speed up our understanding of rare diseases as well as their diagnosis and treatment.

Benedikt Langenberger

Obtaining data on rare diseases as a researcher is often a laborious process. Although there are now data integration centers at all university hospitals, provisioning the data takes a long time because of data protection regulations, and there is often a lack of standardization. By contrast, the artificial data on which SHARE is based will be publicly accessible, anonymized, and standardized for ease of use.

Synthetic data are particularly important for generating statistically relevant cohorts for rare diseases with low case numbers. Patient data from different hospitals and EU countries can be added without the need to spend considerable amounts of time on requests and data processing for every use case. “Synthetic data are ‘enablers’ of research and development and can dramatically speed up our understanding of rare diseases as well as their diagnosis and treatment,” says Benedikt Langenberger of the Digital Health Cluster at the Hasso Plattner Institute in Potsdam, who is helping to develop SHARE.

From ideas to concrete solutions in just three days

Wübben Stiftung Wissenschaft’s interdisciplinary sandpit format offered a chance to think through the SHARE idea from many different angles and produce a design. “Where else do you get the opportunity to concentrate on a topic for three days without distractions and with a wide range of different experts from numerous countries?” Jannik Schaaf asks.

Day one kicked off with two keynote speeches on the topic. Health economist and medical information scientist Andreas Goldschmidt brought the participants up to speed with the current situation. Ruth Biller from self-help association ARVC-Selbsthilfe, whose daughter died suddenly from ARVC at the age of 14, contributed the perspective of patients and their families. “I am campaigning so that other families can be spared our fate,” she says. “Good data are incredibly important because without them and without patient registers for rare diseases, you can’t have evidence-based medicine.”

Following the keynote speeches, the participants took part in a brainstorming session to identify the main challenges involved in creating a functioning SHARE platform. They then ranked these challenges in order of priority during a World Café session in small groups at separate tables. At the end of this process, four central challenges emerged: defining the aims of SHARE, ensuring data quality, ensuring a user focus, and standardizing the data.

On the second day of the workshop, the participants used interactive design thinking methods to formulate ideas for solutions, flesh them out, and rank them in order of priority. The result was a concrete roadmap for the next steps. As well as a technical paper, which has since been published, the main aim is to develop a prototype of the SHARE data repository. The idea is to involve a broad network of experts from research, clinical practice, and ethics, and to consult with patient representatives. Initially, SHARE will only contain data relating to a few rare diseases, including ARVC.

In everyday life, you wouldn’t dare tackle such a complex topic because it’s simply too involved but, thanks to the sandpit, we managed to lay a solid foundation.

Jannik Schaaf

Jannik Schaaf and his team face a mammoth task. “In everyday life, you wouldn’t dare tackle such a complex topic because it’s simply too involved but, thanks to the sandpit, we managed to lay a solid foundation,” says Schaaf, who, as a medical information scientist, has already worked on the development of an AI model that helps family physicians reach diagnoses in cases with non-specific symptoms, as part of a research project funded by Germany’s Federal Ministry of Health. A prototype has already been produced.

A core group from the sandpit is now applying for an EU grant to develop a prototype of the SHARE repository. Once all the technical and legal issues have been resolved, they could, for example, set up a company or an alternative organization to maintain the platform and obtain valuable basic data in the long term. “We need to motivate researchers to make their data available so we can use that data to generate synthetic datasets,” says Schaaf. He hopes a large community of people will take shape who recognize the value of the initiative and will upload data of their own accord. “If we succeed in making synthetic data commonplace, there will be noticeable benefits for patients.”

Jannik Schaaf is Professor of Digital Health, focusing on chronic and rare diseases, and Deputy Director of the Institute for Medical Informatics at Goethe University Frankfurt/University Hospital Frankfurt. He heads the research group on Digital Health & Artificial Intelligence.

Vital data