Fixed Records - synthetichealth/synthea GitHub Wiki

Synthea Fixed Records

Synthea has a feature that allows for fixed demographic information, sometimes referred to as fixed records. Users are able to supply demographic information to Synthea that will be used within the simulation and exported in the records that are generated. Further, users can submit multiple sets of demographic information for the same person. This is used with Synthea's split record feature to simulate the types of variation in demographic information typically found across information systems. As an example, a person may have their first name listed as "William" in the electronic health record system of their primary care provider, but their first name may be "Bill" in the electronic health record system of an urgent care provider. This differing demographic information can be provided to Synthea and it will export records where one lists the first name as "William", while another lists it as "Bill".

How it works

There are three main concepts to understand when using fixed records. They are:

  • Entity - This is a single person. A person will live in different places throughout their life. They may also change their name or other contact information over time.
  • Seed - An Entity will have one or more seeds. A seed is the "ground truth" for the entity. It is the correct location, date of birth, etc. for an entity for a particular time range. A series of seeds for an entity will show how their demographic information changes over time.
  • Variant - A seed may have variants. A variant represents some deviation from the information in the seed. It could be a typo in a name or address. It could show a previous address. It can be wrong in all sorts of ways. The intent is to use this to emulate the types of errors typically seen in demographic information stored in real-world systems.

Users supply a file to Synthea using the -f flag that contains entities. When using the fixed records feature, you cannot specify a Synthea population size, as that will be determined by the number of entities in the provided file. Similarly, the age range and gender command line flags cannot be used as that information will also be provided in the file.

An example file with a single entity is as follows:

{
  "records": [
    {
      "individualId": "1234",
      "dateOfBirth": "1944-03-17",
      "gender": "F",
      "seeds": [
        {
          "seedId": "5678",
          "period": {
            "start": "1944-03-17",
            "end": "1958-03-17"
          },
          "givenName": "Rita Ebony",
          "familyName": "Noble",
          "phone": "9194819376",
          "addressLines": [
            " 865 WHITT TOWN RD"
          ],
          "city": "ROXBORO",
          "state": "NC",
          "zipCode": "27574",
          "variants": [
            {
              "variantId": "91011",
              "dateOfBirth": "1944-03-17",
              "gender": "F",
              "givenName": "Margarita Ebony",
              "familyName": "Noble",
              "phone": "9194819376",
              "addressLines": [
                " 865 WHITT TOWN RD"
              ],
              "city": "ROXBORO",
              "state": "NC",
              "zipCode": "27574"
            },
            {
              "variantId": "1213",
              "dateOfBirth": "1944-04-17",
              "gender": "F",
              "givenName": "Margarita Ebony",
              "familyName": "Noble",
              "phone": "9194819376",
              "addressLines": [
                " 865 WHITT TOWN RD"
              ],
              "city": "ROXBORO",
              "state": "NC",
              "zipCode": "27574"
            }
          ]
        },
        {
          "seedId": "1415",
          "period": {
            "start": "1958-03-18",
            "end": "1970-03-17"
          },
          "givenName": "Rita Ebony",
          "familyName": "Noble",
          "phone": "9194819376",
          "addressLines": [
            "7217 BRIGHTON HILL LN"
          ],
          "city": "RALEIGH",
          "state": "NC",
          "zipCode": "27616",
          "variants": [
            {
              "variantId": "1617",
              "dateOfBirth": "1944-04-17",
              "gender": "F",
              "givenName": "Margarita Ebony",
              "familyName": "Noble",
              "phone": "9194819376",
              "addressLines": [
                "7217 BRIGHTON-HILL"
              ],
              "city": "RAHLEY",
              "state": "NC",
              "zipCode": "27616"
            }
          ]
        },
        {
          "seedId": "1416",
          "period": {
            "start": "1970-03-18"
          },
          "givenName": "Rita Ebony",
          "familyName": "Noble",
          "phone": "9194819376",
          "addressLines": [
            "7217 BRIGHTON HILL LN"
          ],
          "city": "RALEIGH",
          "state": "NC",
          "zipCode": "27616",
          "variants": [
            {
              "variantId": "1618",
              "dateOfBirth": "1944-04-17",
              "gender": "F",
              "givenName": "M E",
              "familyName": "Noble",
              "phone": "9194819376",
              "addressLines": [
                "7217 BRIGHTON-HILL"
              ],
              "city": "RAHLEY",
              "state": "NC",
              "zipCode": "27616"
            }
          ]
        }
      ]
    }
  ]
}

This example has a single entity with three seeds. Each seed has a period, over which it is active. The period for the last seed for an entity should not have an end value. The last seed is what will be used until the end of the simulation for the person.

The first seed has two variants. Both variants have a different givenName than the seed. The second variant also has a date of birth that is off by one month.

As a simulation for a person progresses, Synthea will detect when the time range for one seed has completed and it should switch to another. When this happens, all of the preferred providers in Synthea will be reset. This allows a person to move in Synthea and select new providers in their new location.

When using this feature, the configuration property exporter.split_records will be set to true. This means that when a person moves in the simulation, they will likely have a separate set of health records generated in the simulation.

The clinical simulation in Synthea will proceed normally when using this feature. This means that a variable number of records may be exported for an entity depending on the randomness of the simulation. If a person does not seek care in the simulation while a particular seed is active, then there will be no records with variants associated with that seed in the output.

When exporting records for the individual, Synthea will look at the seed with a period that includes the date for the last encounter for the record. It will then select a variant and demographic information for the individual will be populated with the variant information.

Caveats

Seeds are used as "ground truth" in the clinical simulation, so it needs to be information that Synthea understands. Cities must be present in the Synthea demographics file. Note that not all cities or towns are present in the demographics file, so be sure to verify that the city being used is one that Synthea has information for.

The current format allows for the creation of invalid dates, such as February 30, since date of birth is just a JSON string. While some users may wish to use something like that to show some data systems that may allow for the storage of invalid dates, the Synthea tools are not capable of handling invalid dates. Synthea is capable of handling valid dates that are supposed to represent typos but completely unreasonable for people alive today, such as "February 11, 22" (the actual year 22) or "February 11, 2202". Dates like this are fine to use in variants, but should not be used in seeds.