Left arrow
back to all news

Recap: CAFE University’s Introduction to Synthetic Health Data

December 10, 2025

Thank you to all who joined CAFE University’s: Introduction to Synthetic Health Data webinar with Sherya Nalluri, a CAFE Data Management Team Member and a programmer at the Harvard T.H. Chan School of Public Health! If you missed it, or would like to review key points, here’s a recap, with links to relevant moments in the recording.

The Challenge of Real-World Health Data

Real-world health data is ideal for analyzing trends, guiding public health strategies, and reducing care disparities. However, working with it comes with significant challenges. Strict privacy regulations create high barriers to data access, often requiring Institutional Review Board approvals and data use agreements. This high burden of acquiring access to data can significantly slow research.

What is Synthetic Health Data?

Synthetic data offers a solution to many of these challenges. Created using generative AI techniques, synthetic data mirrors the features and patterns of real-world health information while protecting patient privacy.

Using synthetic data is beneficial as it can reduce time and cost for data access, simplify the regulatory process, and allow for sharing data and code more openly. This can enable more reproducible research and expand collaboration across institutions.

Types of Synthetic Health Data:

  1. Fully Synthetic: Generated entirely by a model with no real-world data.
  2. Partially Synthetic: Real data with sensitive data replaced.
  3. Hybrid Synthetic: Blend of real and synthetic data.

Three Major Synthetic Health Data Resources:

  1. The CMS Data Entrepreneurs Synthetic Public Use File (DE-SynPUF): Provides fully synthetic Medicare claims data covering 2008-2010. While not suitable for inferential research, it is useful for software development and training researchers on the complexity of claims data.
  2. Synthetic Mass: Developed by the open-source platform Synthea, this dataset offers fully synthetic electronic health records for one million synthetic Massachusetts residents from 2019-2022. It includes 18 different files covering patient demographics, clinical encounters, diagnoses, prescriptions, and provider information.
  3. The AHRQ Synthetic Health Database: A partially synthetic resource covering all Medicare, Medicaid, and commercial insurance payers for 2016. 

Key Takeaways: 

Sherya emphasized that:

  • While synthetic health data offers tremendous opportunities, important caveats remain. Bias could exist in the source data and it could be replicated or even amplified in the synthetic data sets, potentially perpetuating health inequalities.

  • Synthetic data isn’t a replacement for real-world data, but rather a complementary tool that expands research possibilities while maintaining privacy protections. As NIH data sharing requirements expand, these resources will play a growing role in enabling researchers to develop methods, train students, and facilitate collaboration in extreme weather and health research.

Avatars of people