BFD RIF Export - synthetichealth/synthea GitHub Wiki

The BFD RIF exporter produces files that conform to the following specifications:

RIF Layout and FHIR Mapping defines each file type and the fields contained within it
CODEBOOK: Medicare Beneficiary Summary File (MBSF) Base with Medicare Part A, B, C, and D defines each of the data dictionaries and the included code values used in the beneficiary file
CODEBOOK: Medicare Fee For Service (FFS) Claims defines each of the data dictionaries and the included code values used in the claim files
CODEBOOK: Medicare Part D Event (PDE)/Drug Characteristics defines each of the data dictionaries and the included code values used in the Part D claim file

Configuration

The exporter is configured via a set of properties as shown below with their default values:

exporter.bfd.bene_id_start = -1000000 defines the start value of BENE_ID, the first exported patient will get the specified value, subsequent ids are monotonically decremented from that value
exporter.bfd.clm_id_start = -100000000 defines the start value of CLM_ID, the first exported claim will get the specified value, subsequent ids are monotonically decremented from that value
exporter.bfd.clm_grp_id_start = -100000000 defines the start value of CLM_GRP_ID, the first exported group will get the specified value, subsequent ids are monotonically decremented from that value
exporter.bfd.pde_id_start = -100000000 defines the start value of PDE_ID, the first exported PDE claim will get the specified value, subsequent ids are monotonically decremented from that value
exporter.bfd.mbi_start = 1S00-E00-AA00 defines the start value of MBI_NUM, the first exported patient will use that value, subsequent ids will monotonically increase from that value
exporter.bfd.hicn_start = T01000000A defines the start value of BENE_CRNT_HIC_NUM, the first exported record will use that value, subsequent ids will monotonically increase from that value.
exporter.bfd.partc_contract_start = Y0001 defines the start value of Part C contract IDs that will be used in PTC_CNTRCT_JAN_ID to PTC_CNTRCT_DEC_ID, the first contract will use that id, subsequent ids will monotonically increase from that value.
exporter.bfd.partc_contract_count = 10 defines the number of Part C contracts that Synthea will use in exports; each year, each patient will be randomly assigned to one of the contracts (or no contract).
exporter.bfd.partd_contract_start = Z0001 defines the start value of Part D contract IDs that will be used in PLAN_CNTRCT_REC_ID, the first contract will use that id, subsequent ids will monotonically increase from that value.
exporter.bfd.partd_contract_count = 10 defines the number of Part D contracts that Synthea will use in exports; each year, each patient will be randomly assigned to one of the contracts (or no contract).
exporter.bfd.plan_benefit_package_start = 800 defines the starting value of plan benefit package identifiers
exporter.bfd.plan_benefit_package_count = 5 defines the number of plan benefit package identifiers, each Part C and Part D plan will share the same set of plan benefit package identifiers.
exporter.bfd.clia_labs_start = 00A0000000 defines the start number of CLIA lab numbers that will be used to populate CARR_LINE_CLIA_LAB_NUM.
exporter.bfd.clia_labs_count = 10 defines the number of CLIA lab numbers that will be used.
exporter.bfd.cutoff_date=20140529 defines the earliest date for any exported claims
generate.thread_pool_size = -1 defines the number of threads to use for the generator, set the value to -1 (the default) to match the number of available processor cores (as per Runtime.getRuntime().availableProcessors())

Synthea Output Files

The BFD output files will be found at output/bfd:

beneficiary_YYYY.csv beneficiary information, one file per year where YYYY will be the year
carrier.csv carrier claims
dme.csv durable medical equipment claims
end_state.properties see below
export_summary.csv summarizes the number of claims of each type per beneficiary
hha.csv home health claims
hospice.csv hospice claims
inpatient.csv inpatient claims
manifest.xml an XML list of generated files
missing_codes.csv list of Synthea codes that could not be mapped to HCPCS or CPT
npi.tsv synthetic provider list
outpatient.csv outpatient claims
pde.csv part d prescription claims
snf.csv skilled nursing facility claims

The end_state.properties file captures the final value of any of the above listed configuration options that require a monotonically increasing or decreasing value per beneficiary or claim. The values in this file can be used (via the -c command line switch) to override the configured values to permit subsequent runs of Synthea to start where the prior run ended. An example file is shown below.

exporter.bfd.hicn_start=T01000020A
exporter.bfd.mbi_start=1S00E00AA20
exporter.bfd.clm_grp_id_start=-100003266
exporter.bfd.pde_id_start=-100000996
exporter.bfd.fi_doc_cntl_num_start=-100000575
exporter.bfd.bene_id_start=-1000020
exporter.bfd.carr_clm_cntl_num_start=-100001695
exporter.bfd.clm_id_start=-100002270

Generating a National Set of Records

The following shell script will generate records for a set of beneficiaries for all 50 states and Washington, DC. The desired total size of the population is supplied as a command line argument, numbers of beneficiaries in each location will be proportional to the population of each state (based on census data). An optional second integer argument specifies the number of months of future medical history; to generate 1000 patients with 24 months of future claims the script would be run as: ./national_bfd.sh 1000 24.

#!/bin/bash

if [[ $# -eq 0 || $# -gt 2 ]]; then
  echo "Usage: $0 size [months]"
  echo "where 'size' is an integer specifying the target population size and 'months' is an integer specifying the number of months of future medical history"
  exit 1
fi

if [[ $# -eq 1 ]]; then
  end_date=
else
  case "$(uname -s)" in
      Darwin*)    date_args="-v+${2}m +%Y%m%d";;
      *)          date_args="-d ${2}months +%Y%m%d"
  esac

  future_date=`date $date_args`
  end_date="-e ${future_date}"
fi

# Weights are based on 2019 census data:
#
# https://data.census.gov/cedsci/table?q=Total%20Population&g=0400000US01,02,04,05,06,08,09,10,11,12,13,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,44,45,46,47,48,49,50,51,53,54,55,56&tid=ACSDP1Y2019.DP05&hidePreview=true&moe=false
#
# Each value represents the number of state residents aged 62 or more divided by the
# total number of USA state residents aged 62 or more expressed as a percentage.
#
states=( ); weights=( )
states+=( "Alabama" ); weights+=( "1.578" )
states+=( "Alaska" ); weights+=( "0.178" )
states+=( "Arizona" ); weights+=( "2.357" )
states+=( "Arkansas" ); weights+=( "0.958" )
# states+=( "California" ); weights+=( "10.801" ) # California is handled separately at the end and is used to absorb any rounding errors
states+=( "Colorado" ); weights+=( "1.586" )
states+=( "Connecticut" ); weights+=( "1.170" )
states+=( "Delaware" ); weights+=( "0.351" )
states+=( "District of Columbia" ); weights+=( "0.161" )
states+=( "Florida" ); weights+=( "8.044" )
states+=( "Georgia" ); weights+=( "2.836" )
states+=( "Hawaii" ); weights+=( "0.492" )
states+=( "Idaho" ); weights+=( "0.536" )
states+=( "Illinois" ); weights+=( "3.796" )
states+=( "Indiana" ); weights+=( "2.016" )
states+=( "Iowa" ); weights+=( "1.016" )
states+=( "Kansas" ); weights+=( "0.891" )
states+=( "Kentucky" ); weights+=( "1.401" )
states+=( "Louisiana" ); weights+=( "1.399" )
states+=( "Maine" ); weights+=( "0.530" )
states+=( "Maryland" ); weights+=( "1.801" )
states+=( "Massachusetts" ); weights+=( "2.179" )
states+=( "Michigan" ); weights+=( "3.288" )
states+=( "Minnesota" ); weights+=( "1.712" )
states+=( "Mississippi" ); weights+=( "0.905" )
states+=( "Missouri" ); weights+=( "1.963" )
states+=( "Montana" ); weights+=( "0.382" )
states+=( "Nebraska" ); weights+=( "0.580" )
states+=( "Nevada" ); weights+=( "0.916" )
states+=( "New Hampshire" ); weights+=( "0.472" )
states+=( "New Jersey" ); weights+=( "2.753" )
states+=( "New Mexico" ); weights+=( "0.698" )
states+=( "New York" ); weights+=( "6.092" )
states+=( "North Carolina" ); weights+=( "3.210" )
states+=( "North Dakota" ); weights+=( "0.220" )
states+=( "Ohio" ); weights+=( "3.804" )
states+=( "Oklahoma" ); weights+=( "1.175" )
states+=( "Oregon" ); weights+=( "1.406" )
states+=( "Pennsylvania" ); weights+=( "4.413" )
states+=( "Rhode Island" ); weights+=( "0.351" )
states+=( "South Carolina" ); weights+=( "1.713" )
states+=( "South Dakota" ); weights+=( "0.285" )
states+=( "Tennessee" ); weights+=( "2.098" )
states+=( "Texas" ); weights+=( "7.031" )
states+=( "Utah" ); weights+=( "0.686" )
states+=( "Vermont" ); weights+=( "0.234" )
states+=( "Virginia" ); weights+=( "2.523" )
states+=( "Washington" ); weights+=( "2.247" )
states+=( "West Virginia" ); weights+=( "0.679" )
states+=( "Wisconsin" ); weights+=( "1.903" )
states+=( "Wyoming" ); weights+=( "0.185" )

END_STATE_PROPS_FILE="./output/bfd/end_state.properties"

total_generated=0
for i in "${!states[@]}"
do 
  state=${states[$i]}
  weight=${weights[$i]}
  count=`echo "${1}*${weight}/100" | bc`
  total_generated=`echo "${total_generated}+${count}" | bc`
  
  if [[ $count -eq "0" ]]
  then
    echo "Skipping generating ${state}, requested patients is ${count} "
    continue
  fi

  if [[ -f "${END_STATE_PROPS_FILE}" ]]
  then
    load_props="-c ${END_STATE_PROPS_FILE}"
  else
    load_props=
  fi

  echo "Generating ${count} patients for ${state}"
  ./run_synthea -s ${i} -cs ${i} -r 20230224 ${end_date} ${load_props} -p ${count} --exporter.fhir.export=false --exporter.fhir.transaction_bundle=false --exporter.hospital.fhir.export=false --exporter.practitioner.fhir.export=false --exporter.bfd.export=true --exporter.years_of_history=10 --generate.only_alive_patients=true --generate.providers.selection_behavior=medicare "${state}"
done

# Generate remaining requested population for California to handle any rounding errors
if [[ -f "${END_STATE_PROPS_FILE}" ]]
then
  load_props="-c ${END_STATE_PROPS_FILE}"
else
  load_props=
fi

remaining=`echo "${1}-${total_generated}" | bc`
echo "Generating ${remaining} patients for California"
total_generated=`echo "${total_generated}+${remaining}" | bc`
./run_synthea -s 51 -cs 51 -r 20230224 ${end_date} ${load_props} -p ${remaining} --exporter.fhir.export=false --exporter.fhir.transaction_bundle=false --exporter.hospital.fhir.export=false --exporter.practitioner.fhir.export=false --exporter.bfd.export=true --exporter.years_of_history=10 --generate.only_alive_patients=true --generate.providers.selection_behavior=medicare California
echo "Finished generating ${total_generated} of ${1} requested patients"

The number of patients generated for each state is based on 2019 Census data. The target population for each state is calculated as:

                          target_total_population * census_state_population
target_state_population = -------------------------------------------------
                                    census_all_states_population

where:

target_total_population is the target population specified on the command line, e.g. 1000 in the example above,
census_state_population is the number of state residents aged 62 or more, and
census_all_states_population is the total number of USA state residents aged 62 or more

Note that the script fixes the value of a number of Synthea command line arguments. It may be desirable to edit these values, e.g. the value of the simulation end date (-r 20230224) depending on requirements.

Adding a New Field to a Synthea BFD Export File

Fork the Synthea Repository and Create a New Branch

Prior to making any changes it is recommended to fork the Synthea repository and create a new branch for the changes.

Update the Enumeration

The src/main/java/org/mitre/synthea/export/rif/BB2RIFStructure.java file contains an enum for each of the output files: BENEFICIARY, CARRIER, DME, HHA, HOSPICE, INPATIENT, OUTPATIENT, PDE and SNF. Each value in these enumerations defines the name of a field in the corresponding output file (the column header in the CSV file). The order of the values in the enumeration defines the order that the fields will be output in the file.

The above file also contains

Additional enumerations that are used to define other file structures, e.g. EXPORT_SUMMARY to define the structure of the export_summary.csv file.
Static arrays of enumeration values that are used to group and loop over related fields, e.g. beneficiaryMedicareStatusFields includes each of the medicare status field names (MDCR_STUS_JAN_CD...MDCR_STUS_DEC_CD one field for each calendar month).

These additional enumerations and arrays can be ignored for the purposes of this document.

To add a new field to a BFD file, first edit the corresponding enumeration to add a new value for the field in the desired location. E.g. to add an EYE_COLOR field to the beneficiary file following the AGE field, edit the BENEFICIARY enumeration as shown below:

public enum BENEFICIARY {
  DML_IND,
  BENE_ID,
  ...
  CRNT_BIC,
  AGE,
  EYE_COLOR, // new field added for this example
  COVSTART,
  ...
}

Once rebuilt, Synthea will output the new field with a blank value.

Adding Fixed or Random Field Values

The src/main/resources/export/bfd_field_values.tsv tab-separated value (TSV) file contains fixed or random values for fields in each of the BFD output files. The columns of this file are:

Line a unique index for the line in the file used for reporting errors when processing the file
Field the BFD field name, values must match a value in an output file enumeration
BENEFICIARY...SNF one column per BFD file enumeration (names must match), each column entry specifies the desired value for that BFD file for the corresponding field value
Optional specifies whether the BFD field is optional (TRUE) or required (FALSE)
Comment provides any comments related to the field

The order of rows in the TSV file is not significant. The table below shows a small extract from the file.

Line	Field	BENEFICIARY	INPATIENT	OUTPATIENT	CARRIER	...
0	ADJSTMT_DLTN_CD	N/A	N/A	N/A	N/A	...
1	ADMTG_DGNS_CD	N/A	Coded	N/A	N/A	...
2	ADMTG_DGNS_VRSN_CD	N/A	0	N/A	N/A	...
3	AGE	Coded	N/A	N/A	N/A	...
4	AT_PHYSN_NPI	N/A	Coded	Coded	N/A	...
5	AT_PHYSN_UPIN	N/A	[Blank]	[Blank]	N/A	...

Each cell at the intersection of a field and an output file can contain one of the following:

N/A means that the field is not included in the file
Coded means that the field value is set dynamically in Java code - see next section
[Blank] means that the field value is explicitly set be blank
A single value (e.g. 0 for the ADMTG_DGNS_VRSN_CD field of INPATIENT in the table above). Values can be numbers or strings and they are copied literally after removing any leading or trailing whitespace.
Multiple values separated by commas, e.g. 1,2,3, where one value will be selected at random from the list with equal weight applied to all values

Comments can be added to cells using parentheses, e.g. 1,2,3 (this is a comment) is functionally equivalent to 1,2,3. These can provide a helpful reminder of the meaning of coded values.

To assign random values to the example EYE_COLOR field above, add a new row to the TSV file, increment the Line number value, use EYE_COLOR for the field column value, enter the desired values in the BENEFICIARY column and N/A in every other column. See below for an example:

Line	Field	BENEFICIARY	INPATIENT	OUTPATIENT	CARRIER	...
...	...	...	...	...	...	...
125	EYE_COLOR	brown,blue,green	N/A	N/A	N/A	...

Adding Computed Field Values

If the field value needs to be computed you will need to edit the Java source code. Each BFD file is written by a separate Java class in the org.mitre.synthea.export.rif package. The class is named after the BFD file, e.g., the beneficiary BFD file is written by the BeneficiaryExporter class. Each of these classes implements an export method that is responsible for exporting all rows for a given synthetic patient at the end of the simulation. An example of adding a statistically weighted value for eye color is shown below:

# org.mitre.synthea.export.rif.BeneficiaryExporter.java

private static final RandomCollection<String> eyeColors = new RandomCollection<>();
static {
  eyeColors.add(45, "brown");
  eyeColors.add(27, "blue");
  eyeColors.add(18, "hazel");
  eyeColors.add(9, "green");
}

public String export(Person person, long startTime, long stopTime) throws IOException {
  ...
  fieldValues.put(BB2RIFStructure.BENEFICIARY.EYE_COLOR, eyeColors.next(person));
  ...
}

Submit a Pull Request

Once all changes have been made and tested:

Run the Synthea test suite via ./gradlew clean check
Fix any test failures
Submit a pull request via GitHub to request review of your changes and merging into the Synthea master branch