Data Architecture - department-of-veterans-affairs/caseflow GitHub Wiki

Caseflow Database Dataflow Architecture

graphviz source

Prior Caseflow Data Architecture

Definitions

Caseflow transactional DB

The data store of record for Caseflow. Normalized schema (lots of tables, few columns, store values once).

Caseflow transactional replica

An RDS real-time replica of the Caseflow transactional DB.

Caseflow ODS DB

Caseflow Operational Data Store (ODS) is for reporting. Denormalized schema (few tables, lots of columns, store values multiple times). See the ETL section for details on how ODS is populated. ODS is an industry term.

An ODS is a database designed to integrate data from multiple sources for additional operations on the data, for reporting, controls and operational decision support. Unlike a production master data store, the data is not passed back to operational systems. It may be passed for further operations and to the data warehouse for reporting.

VACOLS transactional DB

Oracle data store for legacy appeals. Hosted in VA data center.

VACOLS copy DB

A replica of the production Oracle DB, hosted in AWS GovCloud. Our copy is synced via AWS DMS. Our copy is for reporting only.

Redshift

A single instance of AWS Redshift, with schemas for each of Caseflow, VACOLS, and ODS. The Redshift product is a NoSQL architecture for column-oriented "big data" reporting. Our Redshift instance is sometimes referred to colloquially as a "data warehouse" but strictly speaking it does not follow the data warehouse architecture pattern.

Data is synced via AWS DMS.

NOTE that we currently take advantage of two Redshift features, namely that it integrates with DMS and allows for original source data stores of heterogenous types (in our case, PostgreSQL and Oracle). We do not, however, take advantage of its large data volume scaling since we do not have large volumes of data, nor do we use any of the sorting and distribution key features, nor have we performed any custom schema/table design to take advantage of Redshift's architecture. We are using it primarily because it allows for bringing together different schemas (Caseflow and VACOLS) into one query interface.

DMS tasks

dsva-appeals-vacols-spoke-prod-dms-task: copies from VACOLS to VACOLS copy
appeals-caseflow-prodtest-redshift-full-load: copies Caseflow's DB to Redshift prodtest env
appeals-caseflow-redshift-onetable-fix: manually run to copy one or a few tables to update tables that were locked when another DMS job ran
dsva-appeals-certification-redshift-onetime: copies from Caseflow's DB to Redshift (prod env)
dsva-appeals-ods-redshift-onetime: copies from Caseflow's ODS DB to Redshift (prod env)
dsva-appeals-vacols-redshift-onetime: copies from VACOLS copy to Redshift (prod env)

See https://console.amazonaws-us-gov.com/dms/v2/home?region=us-gov-west-1#tasks for when each DMS task was last Started and Stopped. (Ignore the misleading Status column, which indicates the lambda job that initiated the DMS job.)

Tableau

The Tableau product provides visualizations and reporting to BVA.

Metabase

Web application for running ad hoc queries against one or more data stores. Our instance is currently in development. When ready, it will provide access to Redshift, Caseflow transactional replica, and Caseflow ODS.

ETL

The Extract, Transform, Load (ETL) process in Caseflow is used to prepare transactional data for reporting. Currently we use an ETL process to populate the ODS data store. We Extract from the Caseflow transactional DB, Transform the data, and Load into the ODS schema. The Transform step allows for pre-calculating some algorithmically complex values and reducing the number of SQL JOINs required by denormalizing how the data is stored. The Transform step makes the Tableau reporting SQL used to retrieve those complex values easier to maintain and perform faster.

Some Transform examples can illustrate. These examples look at common reporting needs and show the transactional SQL vs the ODS SQL.

Note the ODS schema ods. prefix in these examples is specific to Redshift. If you run the SQL in the Caseflow ODS PostgreSQL db, it uses the public. (default) schema prefix.

Determine the Claimant DOB for an Appeal

Transactional:

select people.date_of_birth as claimant_dob
from people,claimants,appeals 
where claimants.participant_id=people.participant_id 
  and claimants.decision_review_id=appeals.id 
  and claimants.decision_review_type='Appeal' 
  and appeals.id=123

ODS:

select claimant_dob 
from ods.appeals 
where ods.appeals.appeal_id=123

Determine the VACOLS user name for the person assigned a Task

Transactional:

select cached_user_attributes.sattyid
from cached_user_attributes, users, tasks
where cached_user_attributes.sdomainid=users.css_id 
  and users.id=tasks.assigned_to_id 
  and tasks.id=123

ODS:

select assigned_to_user_sattyid
from ods.tasks
where ods.tasks.task_id=123

Count the number of Appeals for each possible decision status

BVA has defined a series of status categories. We derive these from the tasks associated with an Appeal.

Transactional:

WITH undistributed_appeals AS (select '1. Not distributed'::text as decision_status, count(DISTINCT(appeal_id)) as num
            FROM tasks
            WHERE tasks.type IN ('DistributionTask')
              AND tasks.appeal_type='Appeal'
              AND tasks.status IN  ('on_hold', 'assigned', 'in_progress')
              AND tasks.appeal_id NOT IN (SELECT appeal_id FROM tasks WHERE tasks.appeal_type='Appeal' AND tasks.type = 'TimedHoldTask' AND tasks.status IN ('on_hold', 'assigned', 'in_progress'))
-- The Appeal has been distributed to the judge, but not yet assigned to an attorney for working.
      ), distributed_to_judge AS (select '2. Distributed to judge'::text as decision_status, count(DISTINCT(appeal_id)) as num
            FROM tasks
            WHERE tasks.type IN ('JudgeAssignTask')
              AND tasks.appeal_type='Appeal'
              AND tasks.status IN  ('assigned', 'in_progress')
-- The Appeal has been assigned to the Attorney, but is not yet being worked.
      ), assigned_to_attorney AS (select '3. Assigned to attorney'::text as decision_status, count(DISTINCT(appeal_id)) as num
            FROM tasks
            WHERE tasks.type IN ('AttorneyTask', 'AttorneyRewriteTask')
              AND tasks.appeal_type='Appeal'
              AND tasks.status IN  ('assigned')
-- The Appeal has been sent to VLJ Support
      ), assigned_to_colocated AS (select '4. Assigned to colocated'::text as decision_status, count(DISTINCT(appeal_id)) as num
            FROM tasks
            WHERE tasks.type IN ('AddressVerificationColocatedTask',
                                 'AojColocatedTask',
                                 'ArnesonColocatedTask',
                                 'ExtensionColocatedTask',
                                 'FoiaColocatedTask',
                                 'HearingClarificationColocatedTask',
                                 'IhpColocatedTask',
                                 'MissingHearingTranscriptsColocatedTask',
                                 'MissingRecordsColocatedTask',
                                 'NewRepArgumentsColocatedTask',
                                 'OtherColocatedTask',
                                 'PendingScanningVbmsColocatedTask',
                                 'PoaClarificationColocatedTask',
                                 'PreRoutingColocatedTask',
                                 'RetiredVljColocatedTask',
                                 'ScheduleHearingColocatedTask',
                                 'StayedAppealColocatedTask',
                                 'TranslationColocatedTask',
                                 'UnaccreditedRepColocatedTask')
              AND tasks.appeal_type='Appeal'
              AND tasks.status IN ('assigned', 'in_progress')
-- The Appeal has been assigned to the Attorney and is being worked.
      ), decision_in_progress AS (select '5. Decision in progress'::text as decision_status, count(DISTINCT(appeal_id)) as num
            FROM tasks
            WHERE tasks.type IN ('AttorneyTask', 'AttorneyRewriteTask')
              AND tasks.appeal_type='Appeal'
              AND tasks.status IN  ('in_progress')
-- The Appeal decision has been written and is with the judge for sign off
      ), decision_ready_for_sign AS (select '6. Decision ready for signature'::text as decision_status, count(DISTINCT(appeal_id)) as num
            FROM tasks
            WHERE tasks.type IN ('JudgeDecisionReviewTask')
              AND tasks.appeal_type='Appeal'
              AND tasks.status IN ('assigned', 'in_progress')
-- The Appeal has been signed by the judge and is in Quality Review or BVA Dispatch
      ), decision_signed AS (select '7. Decision signed'::text as decision_status, count(DISTINCT(appeal_id)) as num
            FROM tasks
            WHERE tasks.type IN ('BvaDispatchTask', 'QualityReviewTask')
              AND tasks.appeal_type='Appeal'
              AND tasks.status IN ('assigned', 'in_progress')
-- The Appeal dispatch is completed. Case is complete with no open tasks.
      ), decision_dispatched AS (select '8. Decision dispatched'::text as decision_status, count(DISTINCT(appeal_id)) as num
            FROM tasks
            JOIN public.appeals AS appeals ON tasks.appeal_id = appeals.id
            WHERE tasks.type IN ('BvaDispatchTask')
              AND tasks.appeal_type='Appeal'
              AND tasks.status IN ('completed')
              AND tasks.appeal_id NOT IN (SELECT appeal_id FROM tasks WHERE tasks.appeal_type='Appeal' AND tasks.status IN ('on_hold', 'assigned', 'in_progress'))
-- The Appeal is currently on a timed hold
      ), appeal_on_hold AS (select 'ON HOLD'::text as decision_status, count(1) as num
            FROM tasks
            WHERE tasks.type IN ('TimedHoldTask')
              AND tasks.appeal_type='Appeal'
              AND tasks.status IN ('assigned', 'in_progress')
-- An appeal can be considered cancelled if the RootTask is cancelled.
      ), appeal_cancelled AS (select 'CANCELLED'::text as decision_status, count(DISTINCT(appeal_id)) as num
            FROM tasks
            WHERE tasks.type IN ('RootTask')
              AND tasks.appeal_type='Appeal'
              AND tasks.status IN ('cancelled')
-- Otherwise Unspecified Miscellaneous Status
-- The Appeal is currently returned from Quality Review (and with the Judge or Attorney) or returned from BVA Disptch (and with the Judge or Attorney)
-- Active task for MISC is one of: 'JudgeQualityReviewTask', 'JudgeDecisionReviewTask', 'AttorneyQualityReviewTask', 'AttorneyDecisionReviewTask'
      ), appeal_misc AS (select 'MISC'::text as decision_status, count(DISTINCT(appeal_id)) as num
            FROM tasks
            WHERE tasks.type IN ('JudgeQualityReviewTask', 'JudgeDispatchReturnTask', 'AttorneyQualityReviewTask', 'AttorneyDispatchReturnTask')
              AND tasks.appeal_type='Appeal'
              AND tasks.status IN ('assigned', 'in_progress')

      )

---- Stitch together the temp tables into the format the Board would like

SELECT decision_status, num FROM undistributed_appeals
UNION ALL
SELECT decision_status, num FROM distributed_to_judge
UNION ALL
SELECT decision_status, num FROM assigned_to_attorney
UNION ALL
SELECT decision_status, num FROM decision_in_progress
UNION ALL
SELECT decision_status, num FROM assigned_to_colocated
UNION ALL
SELECT decision_status, num FROM decision_ready_for_sign
UNION ALL
SELECT decision_status, num FROM decision_signed
UNION ALL
SELECT decision_status, num FROM decision_dispatched
UNION ALL
SELECT decision_status, num FROM appeal_on_hold
UNION ALL
SELECT decision_status, num FROM appeal_misc
UNION ALL
SELECT decision_status, num FROM appeal_cancelled
ORDER BY decision_status

ODS:

SELECT status, count(*) 
FROM ods.appeals 
GROUP BY status, decision_status_sort_key 
ORDER BY decision_status_sort_key

Determine the AOD (Advanced on Docket) status of an Appeal

Transactional:

SELECT appeals.*, 
       "AOD Details"."appeals.id"            AS "aod_appeals.id", 
       "AOD Details"."granted"               AS "aod_granted", 
       "AOD Details"."reason"                AS "aod_reason", 
       "AOD Details"."veteran.age"           AS "aod_veteran.age", 
       "AOD Details"."is_advanced_on_docket" AS "aod_is_advanced_on_docket" 
FROM   appeals 
       left join (WITH people_with_age 
                       AS (SELECT *, 
                                  datediff ('years', current_date 
                                                     , people.date_of_birth) AS 
                                     "veteran.age" 
                           FROM   people) 
                  SELECT appeals.id                        AS "appeals.id", 
                         advance_on_docket_motions.granted AS "granted", 
                         advance_on_docket_motions.reason  AS "reason", 
                         "people_with_age"."veteran.age"   AS "veteran.age", 
                         Coalesce (CASE 
                                     WHEN "veteran.age" >= 75 THEN TRUE
                                     WHEN advance_on_docket_motions.granted = 'true' THEN TRUE
                                     ELSE false
                                   END)             AS 
                         "is_advanced_on_docket" 
                   FROM   people_with_age 
                          left join public.claimants AS claimants 
                                 ON people_with_age.participant_id = 
                                    claimants.participant_id 
                          left join public.advance_on_docket_motions AS 
                                    advance_on_docket_motions 
                                 ON people_with_age.id = 
                                    advance_on_docket_motions.person_id 
                          left join public.appeals AS appeals 
                                 ON claimants.decision_review_id = appeals.id 
                                    AND claimants.decision_review_type = 
                                        'Appeal') 
                                 "AOD Details" 
              ON ( "appeals"."id" = "AOD Details"."appeals.id" ) 
WHERE  appeals.id = 123

ODS:

SELECT *, 
       COALESCE (CASE 
                   WHEN aod_due_to_dob = 'true' THEN true 
                   WHEN aod_granted = 'true' THEN true 
                   ELSE false 
                 END) AS is_advanced_on_docket 
FROM ods.appeals
WHERE ods.appeals.appeal_id = 123

Scheduling

Redshift (prod environment) is updated once a day. All schema tables are dropped and re-imported. This allows for schema changes in the transactional and ODS data stores to be reflected in Redshift. The schedule is:

public (transactional schema) 11pm ET
ods 12am ET
vacols 1am ET

Redshift (prodTest environment) is updated nightly. ~~This is used by the Reporting Team to test performance against real data.~~

Caseflow Replica (RDS) is updated in near real-time.

The Caseflow ODS (RDS) updates every 6 hours, at 20 minutes after the hour. The schedule is on UTC, so 00:20, 06:20, 12:20, 18:20 (1:20am, 7:20am, 1:20pm, 7:20pm EST).

Schemas

Naming conventions

All the Caseflow schemas (i.e. not including VACOLS) follow the Active Record naming conventions. For example, the column appeal_id on any table should be interpreted as a foreign key referring to the appeals.id column.

In addition, many tables follow the Polymorphic Association pattern which allows for a single table to contain foreign key references to different tables.

The polymorphic naming convention is to pair *_type and *_id suffixes on two columns that refer to the foreign model name and the foreign key value, respectively. Because the model name is read by the Active Record code, it is used to correspond with the table name.

For example, the end_product_establishments table has source_id and source_type columns. The source_type value might be (among others) "HigherLevelReview" or "SupplementalClaim". These model names correspond to the "higher_level_reviews" and "supplemental_claims" tables, respectively. So if you found a end_product_establishments record with source_id of "101" and a source_type of "SupplementalClaim" you would expect to find an associated supplemental_claims record with id of "101".

Do not confuse the *_type polymorphic pattern with the Single Table Inheritance (STI) pattern. They both use the word "type" in the column name, but STI lacks the corresponding *_id column.