Get Started with FHIR Data Pipes Pipelines - google/fhir-data-pipes GitHub Wiki

The pipelines directory contains code to transform data from a FHIR server to either Apache Parquet files for analysis or another FHIR store for data integration.

See this page if you are using OpenMRS.

There are two options for reading the source FHIR server (input):

There are two options for transforming the data (output):

  • Parquet: Outputs the FHIR resources as Parquet files, using the SQL-on-FHIR schema.
  • FHIR: Copies the FHIR resources to another FHIR server using FHIR APIs.

Setup

  1. Clone the FHIR Data Pipes project to your machine.
  2. Set the utils directory to world-readable: chmod -R 755 ./utils.
  3. Build binaries by running mvn clean install from the root directory of the repository.

Run the pipeline

Run the pipeline directly using the java command:

java -cp ./pipelines/batch/target/batch-bundled-0.1.0-SNAPSHOT.jar \
    org.openmrs.analytics.FhirEtl \
    --fhirServerUrl=http://example.org/fhir \
    --[see additional parameters below]

Add the necessary parameters depending on your use case. The methods used for reading the source FHIR server and outputting the data depend on the parameters used. You can output to both Parquet files and a FHIR server by including the required parameters for both.

Parameters

This section documents the parameters used by the various pipelines. For more information on parameters, see FhirEtlOptions.

Common parameters

These parameters are used regardless of other pipeline options.

  • fhirServerUrl - The base URL of the source FHIR server. Required.
  • fhirServerUserName - The HTTP Basic Auth username to access the FHIR server APIs. Default: admin
  • fhirServerPassword - The HTTP Basic Auth password to access the FHIR server APIs. Default: Admin123
  • resourceList - A comma-separated list of FHIR resources to include in the pipeline. Default: Patient,Encounter,Observation
  • runner - The Apache Beam Runner to use. Pipelines supports DirectRunner and FlinkRunner by default; other runners can be enabled by Maven profiles, e.g., DataflowRunner. Default: DirectRunner

FHIR-Search input parameters

The pipeline will use FHIR-Search to fetch data as long as jdbcModeEnabled is unset or false.

  • batchSize - The number of resources to fetch in each API call. Default: 100

JDBC input parameters

JDBC mode is used if jdbcModeEnabled=true.

To use JDBC mode, first create a copy of hapi-postgres-config.json and edit the values to match your database server.

Next, include the following parameters:

  • jdbcModeEnabled=true
  • fhirDatabaseConfigPath=./path/to/config.json

If you are using a HAPI FHIR server, also include:

  • jdbcModeHapi=true
  • jdbcDriverClass=org.postgresql.Driver

All JDBC parameters:

  • jdbcModeEnabled - If true, uses JDBC mode. Default: false
  • fhirDatabaseConfigPath - Path to the FHIR database config for JDBC mode. Default: ../utils/hapi-postgres-config.json
  • jdbcModeHapi - If true (with jdbcModeEnabled), uses JDBC mode for a HAPI source server. Default: false
  • jdbcFetchSize - The fetch size of each JDBC database query. Default: 10000
  • jdbcMaxPoolSize - The maximum number of database connections. Default: 50
  • jdbcDriverClass - The JDBC driver to use. Should be set to org.postgresql.Driver for HAPI FHIR Postgres access. Default: com.mysql.cj.jdbc.Driver

Parquet output parameters

Parquet files are output when outputParquetPath is set.

  • outputParquetPath - The file path to write Parquet files to, e.g., ./tmp/parquet/. Default: empty string, which does not output Parquet files.
  • secondsToFlushParquetFiles - The number of seconds to wait before flushing all Parquet writers with non-empty content to files. Use 0 to disable. Default: 3600.
  • rowGroupSizeForParquetFiles - The approximate size in bytes of the row-groups in Parquet files. When this size is reached, the content is flushed to disk. This is not used if there are less than 100 records. Use 0 to use the default Parquet row-group size. Default: 0.

FHIR output parameters

Resources will be copied to the FHIR server specified in fhirSinkPath if that field is set.

  • fhirSinkPath - A base URL to a target FHIR server, or the relative path of a GCP FHIR store, e.g. http://localhost:8091/fhir for a FHIR server or projects/PROJECT/locations/LOCATION/datasets/DATASET/fhirStores/FHIR-STORE-NAME for a GCP FHIR store. If using a GCP FHIR store, see here for setup information. default: none, resources are not copied
  • sinkUserName - The HTTP Basic Auth username to access the FHIR sink. Not used for GCP FHIR stores.
  • sinkPassword - The HTTP Basic Auth password to access the FHIR sink. Not used for GCP FHIR stores.

A note about Beam runners

If the pipeline is run on a single machine (i.e., not on a distributed cluster), for large datasets consider using a production grade runner like Flink. This can be done by adding the parameter --runner=FlinkRunner (use --maxParallelism and --parallelism to control parallelism). This should not give a significant run time improvement but may avoid some of the memory issues of DirectRunner.

Example configurations

These examples are set up to work with local test servers.

FHIR Search to Parquet files

Example run:

java -cp ./pipelines/batch/target/batch-bundled-0.1.0-SNAPSHOT.jar \
    org.openmrs.analytics.FhirEtl \
    --fhirServerUrl=http://localhost:8091/fhir \
    --outputParquetPath=/tmp/TEST/ \
    --resourceList=Patient,Encounter,Observation

HAPI FHIR JDBC to a FHIR server

Example run:

java -cp ./pipelines/batch/target/batch-bundled-0.1.0-SNAPSHOT.jar \
    org.openmrs.analytics.FhirEtl \
    --fhirServerUrl=http://localhost:8091/fhir \
    --resourceList=Patient,Encounter,Observation \
    --fhirDatabaseConfigPath=./utils/hapi-postgres-config.json \
    --jdbcModeEnabled=true --jdbcModeHapi=true \
    --jdbcMaxPoolSize=50 --jdbcFetchSize=1000 \
    --jdbcDriverClass=org.postgresql.Driver \
    --fhirSinkPath=http://localhost:8099/fhir \
    --sinkUserName=hapi --sinkPassword=hapi

How to query the data warehouse

To query Parquet files, load them into a compatible data engine such as Apache Spark. The single machine Docker Compose configuration runs the pipeline and loads data into an Apache Spark Thrift server for you.