Recreating a Dataset - synthetichealth/synthea GitHub Wiki

Synthea allows users to exactly recreate a dataset when the same input parameters are provided to the same version of Synthea. In other words, if the exact same input is provided to Synthea twice, it will produce the exact same output twice. However certain parameters are not immediately apparent and must be explicitly provided in order to ensure the input is consistent.

Requirements

The bare minimum required to recreate a dataset is to provide 3 arguments when running Synthea:

  1. -s seed - The random seed used to produce the population. If not set this will default to the result of System.currentTimeMillis(), ie, the current system time in milliseconds since 1/1/1970.
  2. -r date - The reference date used to calculate ages from. If not set this will default to the current system date. (For example, to generate a 50-year old patient, the birthdate is selected by starting from the reference date and subtracting 50 years)
  3. -cs seed - The clinician seed used to produce the

The -e yyyyMMdd flag can be used to set the end date of the simulation. If not set this will default to the current date. This means that if -e is identical, then two runs will be exactly identical, however if not, the data may change in two ways:

  • The simulation that runs longer may have additional data not present in the shorter run.
  • Certain fields may change between the two runs, for instance conditions may be "active" in the shorter run and "resolved" in the longer run.

In combination with the simulation end date, the exporter.years_of_history config setting may affect the data that is exported. Again if the two settings match across runs, the results will be exactly identical, however if not, the output may be filtered such that different entries are present in each run. When an entry is present in both outputs, it will have the same UUID.

Expectations

  • All exporters, other than the limitations noted below, should produce exactly identical UUIDs when referring to the same

  • Content should be identical when running the same version of Synthea

    • "version of Synthea" means the specific commit
    • JDK version should not matter (assuming both versions are supported, currently that means >= 11)
    • Operating System should not matter
  • Systems running with different time zones will produce different timestamps, in export formats where timestamps are converted to the system local time

Limitations

Certain export formats have known limitations, and results across runs may not be exactly identical unless additional criteria are met.

  • CSV Exporter, claim_transactions.csv
    • The CSV exporter uses a monotonically increasing value for "ChargeID" which is shared across all patients. If Synthea is run with only a single thread and all the same settings, 2 runs will be exactly identical. If the run uses multithreading, then the order of exports will be different, and different entries in this file will receive different IDs and ChargeIDs
  • CCDA Exporter
    • Any instance where a Synthea HealthRecord.Entry maps cleanly to a single CCDA template will be fine, but instances where one Entry maps to multiple templates still use the randUUID() method. For example see conditions.ftl - the Entry UUID is assigned to the "problem act template" (line 17) but there is another UUID down in the nested "Problem observation template" below (line 33) which is left as randUUID(). (The randUUID will still evaluate to the same thing given the exact exact same settings, but is not guaranteed in all cases)

Risk of UUID collision

The risk of UUID collision within a single run of synthea of arbitrary size is designed to be acceptably low. However if the same seed is used across multiple populations intended to be unique (for example, running different US states with the same seed) then the risk of UUID collision increases significantly. For this reason it is recommended to use different seeds when combining multiple runs of synthea into a single population.

Technical Details

There are two general principles involved in ensuring consistency:

  1. Use the Person object as the sole source of randomness, do not add other sources of randomness (eg, do not create a new instance of java.util.Random)

  2. Do not use randomness in the exporters unless there is no clean alternative.

Other things to avoid:

  • Iterating over or selecting from non-sorted collections, for example HashMaps. HashMaps with the same contents may appear to have consistent order across runs, but this is not guaranteed (in particular across different operating systems).

Relevant Historical PRs

https://github.com/synthetichealth/synthea/pull/756
https://github.com/synthetichealth/synthea/pull/760
https://github.com/synthetichealth/synthea/pull/1237
https://github.com/synthetichealth/synthea/pull/1249