Frequently Asked Questions - synthetichealth/synthea GitHub Wiki

I am not a programmer, how do I get data?
How does this differ from real data and why should I trust it?
When is it better to get real or deidentified data?
Can Synthea produce anonymized data?
How can I use Synthea to run "what-if" scenarios?
Does Synthea have data about ___ ?
Does Synthea have data in format ___ ?
Does Synthea have data for region ___ ?
How can I generate data for the entire US?
How do I use Synthea to produce data for my FHIR Implementation Guide?
How do I get only data for a certain condition of interest?
I created a module, why isn't the condition showing up in output?

(Note to wiki editors: use https://github.com/jonschlinkert/markdown-toc for updating this ToC if you add/change several items)

I am not a programmer, how do I get data?

The easiest option to get sample data is to our pre-generated data. A number of general purpose datasets are available for download at https://synthea.mitre.org/downloads .

Even if you have limited to no development experience, the Basic Setup and Running of Synthea is intended to be easy to get started. Customization options are limited but if you don't have very specific needs, this is a good place to get started.

How does this differ from real data and why should I trust it?

Data generated by Synthea is synthetic; realistic, but not real. It is built up based on models of disease progression and treatment. No individual-level data was used in creating it, so there is no risk of any real individual's data being present.

This means that every aspect of Synthea data is fully explainable and reproducible, however it also means that it is limited. Statistical accuracy of Synthea data can vary significantly depending on the condition of interest, and Synthea data has limited diversity. (For example, in the real world patients may have wildly different disease progressions for complex conditions, patients with the same condition will generally follow very similar disease progression.) Synthea creates complete longitudinal patient records, with no missing or incorrect information, whereas real-world patient data can have gaps or inconsistencies due to various reasons such as data entry errors or patients changing healthcare providers.

When is it better to get real or deidentified data?

There are a couple situations when using Synthea data is likely not appropriate and you should be using real or deidenfied data.

Machine learning / Data-driven discovery: Synthea data is based on statistical models which are highly simplified compared to real diseases. Real-world data can reveal unexpected patterns, correlations, or insights that just do not exist in the synthetic data.
Rare conditions and complex scenarios: Synthea-generated data may not adequately capture rare diseases or complex patient scenarios, as it relies on statistical models and medical rules. Real data can provide more accurate insights into the prevalence, characteristics, and relationships among various factors associated with those conditions.

In situations like these, you can start by using Synthea data to get familiar with how concepts are represented, but for real insights you will need real data.

Can Synthea produce anonymized data?

No. Synthea generates new data using models of disease progression and treatment. It cannot take an existing dataset and anonymize it.

How can I use Synthea to run "what-if" scenarios?

Synthea can be used to model impacts of various scenarios. The basic process is as follows:

Run Synthea once as a "control" group
Modify Synthea with the intervention of interest
Re-run the same population with the modifications to see the impact See Recreating a Dataset for specifics on how to re-run the same population

This approach is best used when the low-level impacts of an intervention are well-defined and you want to see how those impacts bubble up at the population level.

Does Synthea have data about ___ ?

Synthea includes data representing the progression and treatment of the most common diseases and conditions, but by no means does Synthea include every known disease.

Synthea currently has over 90 different modules which have expanded beyond the original "Two Top Tens" that the project started with (see the table below).

Top 10 Reasons Patients Visit PCP	Top 10 Years of Life Lost
Routine infant/child health check	Ischemic Heart Disease
Essential Hypertension	Lung Cancer
Diabetes Mellitus	Alzheimer’s Disease
Normal Pregnancy	COPD
Respiratory Infections (Pharyngitis, Bronchitis, Sinusitis)	Cerebrovascular Disease
General Adult Medical Examination	Road Injuries
Disorders of Lipoid Metabolism	Self-Harm
Ear Infections (Otitis Media)	Diabetes Mellitus
Asthma	Colorectal Cancer
Urinary Tract Infections	Drug Use Disorders (limited to Opioids)

The best place to start is by reviewing the modules available in the Module Builder: https://synthetichealth.github.io/module-builder/

Alternatively, you can run the concepts gradle task to create a CSV file at ./src/concepts.csv with all the concepts (codes from SNOMED, LOINC, RxNorm, etc) that Synthea can use to populate records:

$ ./gradlew concepts
Starting a Gradle Daemon (subsequent builds will be faster)

> Task :concepts
Performing an inventory of concepts...
Scanned 76 modules and 90 submodules.
Catalogued 1412 concepts in file `./output/concepts.csv`.
Done.

BUILD SUCCESSFUL in 8s

Note that this requires the Developer setup of Synthea - see Developer Setup and Running for setup instructions. On Windows run gradlew.bat concepts instead.

Does Synthea have data in format ___ ?

Synthea by default supports the following formats:

HL7® FHIR®
- Versions: R4, STU3, DSTU2
C-CDA
CPCDS
BFD RIF
CSV
JSON
Text

If your format of interest is not already included, you have a few options:

Convert data from one of the other formats.
- If the concept of "translation from one format to another" is relevant to your use case, you may want to start with one of the industry standard formats: FHIR, CCDA, etc
- Otherwise, the CSV and JSON formats are Synthea-specific and intended to be broadly applicable to most use cases. The JSON export in particular includes additional pieces of data used in the simulation which are not exported in other formats. industry standard formats (FHIR, CCDA, etc).
Fork the repository and add your own exporter class.
New: write your exporter as a standalone project with the appropriate structure, and create a JAR to use with Synthea. See Custom Exporters for more information.

Does Synthea have data for region ___ ?

Synthea by default includes demographic data for the entire USA, and if not otherwise specified will generate synthetic patients within the state of Massachusetts.

The Synthea-International repository contains metadata and configuration files for international locations, and instructions on how to apply them to a developer instance of synthea. Currently this contains primarily European countries but we welcome community efforts to add support for any country.

Instructions for creating your own data files to support other regions are on the Other Areas page and its subpages.

How can I generate data for the entire US?

Synthea only supports generating data for one US state at a time.

You can specify a different state and/or city on the command line (see Basic Setup and Running). You can also replace the demographics (see Other Areas, Demographics for Other Areas, and related wiki articles).

If you want to run multiple states, we recommend that users write a script that simply executes Synthea multiple times for each state desired. For example, generating a population at 1/1000 scale of the entire US:

./run_synthea -p 5024 Alabama
./run_synthea -p 733 Alaska
./run_synthea -p 7151 Arizona
./run_synthea -p 3011 Arkansas
...

How do I use Synthea to produce data for my FHIR Implementation Guide?

There is no automated way to "load an IG" into Synthea such that exported data will automatically conform to it. (Barring significant advances in knowledge representation or AI, we don't expect there ever will be. See notes below in Why can't data be automatically generated for any given IG? )

There are a few possible approaches to building data for arbitrary IGs with Synthea:

Modify the Java code of the Synthea engine and FHIR Exporter to support the needs of the IG of interest.
Post-process data generated by data using other tools.
Use the "Flexible Exporter" aka Flexporter. This experimental feature is designed to allow users to define transformations that are applied to Synthea data. This is meant to support a number of use cases but making the data conform to an IG of interest is one of the most important. The Flexporter is most useful when the data is very close to what is needed, for example maybe only a couple of fields are missing or a new resource type needs to be added. See the Flexporter page for more details.

Note that the relevance of the data to the IG is also important to consider. For example, an IG with a clinical focus on a disease not already included in Synthea will also require an additional module to produce relevant data in the first place.

Why can't data be automatically generated for any given IG?

There is no automated way to "load an IG" into Synthea such that exported data will automatically conform to it. Creating data that is structurally conformant any given IG is fairly straightforward, but creating data that is relevant, meaningful, and realistic is nearly impossible.

Consider a profile that makes a field required that is not already exported by Synthea, for example Procedure.complication.

Should this profile be applied to all instances of the Procedure resource or just some?
What value should be put into the field? In this case this field is a CodeableConcept and the profile may define a value set binding, but there's no way to know which values would make the resource and overall record look realistic, and which values would be irrelevant or contradictory. This is even more difficult for other data types such as numbers, dates, or strings.

For these reasons, we decided the Flexporter concept would require a human to create the mapping. Going forward we hope to improve both the user-friendliness and power of the tool to make it easier for people to generate data for their IG.

How do I get only data for a certain condition of interest?

There are three basic approaches to getting data focused on a single condition (or other criteria) of interest:

Use a Keep Module
- A Keep Module is a specialized type of Synthea module, used to filter patients out of the simulation. ("Only keep patients matching these criteria.") Keep Modules are constructed using the module builder, which allows for arbitrarily complex logic. Eg. "Keep patients with an active condition of diabetes", "keep patients with an observed HbA1c > 6.5", "Keep patients with one of the following 3 allergies", etc...
Post-process the data to filter to only data of interest
- Synthea does not provide a built-in filtering capability, but using other scripting languages such as Python make it easy to read in patient data and filter (or otherwise post-process) the data however you like
Only run the module of interest.
- Do not do this. We mention this option only to point out that we strongly recommend against it for many reasons.

I created a module, why isn't the condition showing up in output?

First, make sure the module is actually loaded by confirming it is in the src/main/resources/modules directory, not a subdirectory, and that the filename shows up in the list of modules loaded at startup.

Assuming it is loaded and you still don't see the results in output, this is a common issue for first-time module creators. There are a few possible reasons the entry might not appear:

1. The Condition was never diagnosed

Consider a simple module that looks like this:

This Examplitis condition will never appear on a record because it is not diagnosed. Every ConditionOnset must either be between an Encounter and EncounterEnd state, or specify a "Target Encounter" where it will be diagnosed.

2. The entry is old and not "active" so was filtered out

Synthea simulates and generates each patient's entire life history, but by default only active conditions and the last 10 years of history are kept in the output. This number is configurable via the exporter.years_of_history setting; to keep the entire record change this to 0 though note this will likely result in extremely large files. See the Common Configuration page for instructions on changing settings.

The other important factor to note is that patients start at the Initial state in every module at birth and immediately progress to the next state, so if the first state after Initial is the ConditionOnset, that condition is being onset instantly at birth. Example:

Unless that's intentional, you will likely want to use a combination of Guard and Delay states and distributed or conditional transitions to affect the ages and prevalence that patients acquire the condition.

3. The entry is rare or not hit until later in life

Another possibility is that, if you are using guards or delays or distributions to control when patients hit the state of interest, it's possible that everything is working correctly but you haven't run enough patients to hit it yet. By default Synthea generates a population matching the age distribution of the selected region, so if your condition can only acquired at a high age, most generated patients will not have gone through that state yet.

In cases like this, consider specifying a target age range with the -a command line argument (for example -a 18-55 will produce only patients in the age range 18-55), or by specifying a Keep Module defining the output criteria you are looking for.