Data Architecture - TISTATechnologies/caseflow GitHub Wiki
Prior Caseflow Data Architecture
Definitions
Caseflow transactional DB
The data store of record for Caseflow. Normalized schema (lots of tables, few columns, store values once).
Caseflow transactional replica
An RDS real-time replica of the Caseflow transactional DB.
Caseflow ODS DB
Caseflow Operational Data Store (ODS) is for reporting. Denormalized schema (few tables, lots of columns, store values multiple times). See the ETL section for details on how ODS is populated. ODS is an industry term.
An ODS is a database designed to integrate data from multiple sources for additional operations on the data, for reporting, controls and operational decision support. Unlike a production master data store, the data is not passed back to operational systems. It may be passed for further operations and to the data warehouse for reporting.
VACOLS transactional DB
Oracle data store for legacy appeals. Hosted in VA data center.
VACOLS copy DB
A replica of the production Oracle DB, hosted in AWS GovCloud. Our copy is synced via AWS DMS. Our copy is for reporting only.
Redshift
A single instance of AWS Redshift, with schemas for each of Caseflow, VACOLS, and ODS. The Redshift product is a NoSQL architecture for column-oriented "big data" reporting. Our Redshift instance is sometimes referred to colloquially as a "data warehouse" but strictly speaking it does not follow the data warehouse architecture pattern.
Data is synced via AWS DMS.
NOTE that we currently take advantage of two Redshift features, namely that it integrates with DMS and allows for original source data stores of heterogenous types (in our case, PostgreSQL and Oracle). We do not, however, take advantage of its large data volume scaling since we do not have large volumes of data, nor do we use any of the sorting and distribution key features, nor have we performed any custom schema/table design to take advantage of Redshift's architecture. We are using it primarily because it allows for bringing together different schemas (Caseflow and VACOLS) into one query interface.
DMS tasks
- dsva-appeals-vacols-spoke-prod-dms-task: copies from VACOLS to VACOLS copy
- appeals-caseflow-prodtest-redshift-full-load: copies Caseflow's DB to Redshift prodtest env
- appeals-caseflow-redshift-onetable-fix: manually run to copy one or a few tables to update tables that were locked when another DMS job ran
- dsva-appeals-certification-redshift-onetime: copies from Caseflow's DB to Redshift (prod env)
- dsva-appeals-ods-redshift-onetime: copies from Caseflow's ODS DB to Redshift (prod env)
- dsva-appeals-vacols-redshift-onetime: copies from VACOLS copy to Redshift (prod env)
See https://console.amazonaws-us-gov.com/dms/v2/home?region=us-gov-west-1#tasks for when each DMS task was last Started and Stopped. (Ignore the misleading Status column, which indicates the lambda job that initiated the DMS job.)
Tableau
The Tableau product provides visualizations and reporting to BVA.
Metabase
Web application for running ad hoc queries against one or more data stores. Our instance is currently in development. When ready, it will provide access to Redshift, Caseflow transactional replica, and Caseflow ODS.
ETL
The Extract, Transform, Load (ETL) process in Caseflow is used to prepare transactional data for reporting. Currently we use an ETL process to populate the ODS data store. We Extract from the Caseflow transactional DB, Transform the data, and Load into the ODS schema. The Transform step allows for pre-calculating some algorithmically complex values and reducing the number of SQL JOINs required by denormalizing how the data is stored. The Transform step makes the Tableau reporting SQL used to retrieve those complex values easier to maintain and perform faster.
Some Transform examples can illustrate. These examples look at common reporting needs and show the transactional SQL vs the ODS SQL.
Note the ODS schema ods.
prefix in these examples is specific to Redshift. If you run the SQL in the Caseflow ODS PostgreSQL db, it uses the public.
(default) schema prefix.
Determine the Claimant DOB for an Appeal
Transactional:
select people.date_of_birth as claimant_dob
from people,claimants,appeals
where claimants.participant_id=people.participant_id
and claimants.decision_review_id=appeals.id
and claimants.decision_review_type='Appeal'
and appeals.id=123
ODS:
select claimant_dob
from ods.appeals
where ods.appeals.appeal_id=123
Determine the VACOLS user name for the person assigned a Task
Transactional:
select cached_user_attributes.sattyid
from cached_user_attributes, users, tasks
where cached_user_attributes.sdomainid=users.css_id
and users.id=tasks.assigned_to_id
and tasks.id=123
ODS:
select assigned_to_user_sattyid
from ods.tasks
where ods.tasks.task_id=123
Count the number of Appeals for each possible decision status
BVA has defined a series of status categories. We derive these from the tasks associated with an Appeal.
WITH undistributed_appeals AS (select '1. Not distributed'::text as decision_status, count(DISTINCT(appeal_id)) as num
FROM tasks
WHERE tasks.type IN ('DistributionTask')
AND tasks.appeal_type='Appeal'
AND tasks.status IN ('on_hold', 'assigned', 'in_progress')
AND tasks.appeal_id NOT IN (SELECT appeal_id FROM tasks WHERE tasks.appeal_type='Appeal' AND tasks.type = 'TimedHoldTask' AND tasks.status IN ('on_hold', 'assigned', 'in_progress'))
-- The Appeal has been distributed to the judge, but not yet assigned to an attorney for working.
), distributed_to_judge AS (select '2. Distributed to judge'::text as decision_status, count(DISTINCT(appeal_id)) as num
FROM tasks
WHERE tasks.type IN ('JudgeAssignTask')
AND tasks.appeal_type='Appeal'
AND tasks.status IN ('assigned', 'in_progress')
-- The Appeal has been assigned to the Attorney, but is not yet being worked.
), assigned_to_attorney AS (select '3. Assigned to attorney'::text as decision_status, count(DISTINCT(appeal_id)) as num
FROM tasks
WHERE tasks.type IN ('AttorneyTask', 'AttorneyRewriteTask')
AND tasks.appeal_type='Appeal'
AND tasks.status IN ('assigned')
-- The Appeal has been sent to VLJ Support
), assigned_to_colocated AS (select '4. Assigned to colocated'::text as decision_status, count(DISTINCT(appeal_id)) as num
FROM tasks
WHERE tasks.type IN ('AddressVerificationColocatedTask',
'AojColocatedTask',
'ArnesonColocatedTask',
'ExtensionColocatedTask',
'FoiaColocatedTask',
'HearingClarificationColocatedTask',
'IhpColocatedTask',
'MissingHearingTranscriptsColocatedTask',
'MissingRecordsColocatedTask',
'NewRepArgumentsColocatedTask',
'OtherColocatedTask',
'PendingScanningVbmsColocatedTask',
'PoaClarificationColocatedTask',
'PreRoutingColocatedTask',
'RetiredVljColocatedTask',
'ScheduleHearingColocatedTask',
'StayedAppealColocatedTask',
'TranslationColocatedTask',
'UnaccreditedRepColocatedTask')
AND tasks.appeal_type='Appeal'
AND tasks.status IN ('assigned', 'in_progress')
-- The Appeal has been assigned to the Attorney and is being worked.
), decision_in_progress AS (select '5. Decision in progress'::text as decision_status, count(DISTINCT(appeal_id)) as num
FROM tasks
WHERE tasks.type IN ('AttorneyTask', 'AttorneyRewriteTask')
AND tasks.appeal_type='Appeal'
AND tasks.status IN ('in_progress')
-- The Appeal decision has been written and is with the judge for sign off
), decision_ready_for_sign AS (select '6. Decision ready for signature'::text as decision_status, count(DISTINCT(appeal_id)) as num
FROM tasks
WHERE tasks.type IN ('JudgeDecisionReviewTask')
AND tasks.appeal_type='Appeal'
AND tasks.status IN ('assigned', 'in_progress')
-- The Appeal has been signed by the judge and is in Quality Review or BVA Dispatch
), decision_signed AS (select '7. Decision signed'::text as decision_status, count(DISTINCT(appeal_id)) as num
FROM tasks
WHERE tasks.type IN ('BvaDispatchTask', 'QualityReviewTask')
AND tasks.appeal_type='Appeal'
AND tasks.status IN ('assigned', 'in_progress')
-- The Appeal dispatch is completed. Case is complete with no open tasks.
), decision_dispatched AS (select '8. Decision dispatched'::text as decision_status, count(DISTINCT(appeal_id)) as num
FROM tasks
JOIN public.appeals AS appeals ON tasks.appeal_id = appeals.id
WHERE tasks.type IN ('BvaDispatchTask')
AND tasks.appeal_type='Appeal'
AND tasks.status IN ('completed')
AND tasks.appeal_id NOT IN (SELECT appeal_id FROM tasks WHERE tasks.appeal_type='Appeal' AND tasks.status IN ('on_hold', 'assigned', 'in_progress'))
-- The Appeal is currently on a timed hold
), appeal_on_hold AS (select 'ON HOLD'::text as decision_status, count(1) as num
FROM tasks
WHERE tasks.type IN ('TimedHoldTask')
AND tasks.appeal_type='Appeal'
AND tasks.status IN ('assigned', 'in_progress')
-- An appeal can be considered cancelled if the RootTask is cancelled.
), appeal_cancelled AS (select 'CANCELLED'::text as decision_status, count(DISTINCT(appeal_id)) as num
FROM tasks
WHERE tasks.type IN ('RootTask')
AND tasks.appeal_type='Appeal'
AND tasks.status IN ('cancelled')
-- Otherwise Unspecified Miscellaneous Status
-- The Appeal is currently returned from Quality Review (and with the Judge or Attorney) or returned from BVA Disptch (and with the Judge or Attorney)
-- Active task for MISC is one of: 'JudgeQualityReviewTask', 'JudgeDecisionReviewTask', 'AttorneyQualityReviewTask', 'AttorneyDecisionReviewTask'
), appeal_misc AS (select 'MISC'::text as decision_status, count(DISTINCT(appeal_id)) as num
FROM tasks
WHERE tasks.type IN ('JudgeQualityReviewTask', 'JudgeDispatchReturnTask', 'AttorneyQualityReviewTask', 'AttorneyDispatchReturnTask')
AND tasks.appeal_type='Appeal'
AND tasks.status IN ('assigned', 'in_progress')
)
---- Stitch together the temp tables into the format the Board would like
SELECT decision_status, num FROM undistributed_appeals
UNION ALL
SELECT decision_status, num FROM distributed_to_judge
UNION ALL
SELECT decision_status, num FROM assigned_to_attorney
UNION ALL
SELECT decision_status, num FROM decision_in_progress
UNION ALL
SELECT decision_status, num FROM assigned_to_colocated
UNION ALL
SELECT decision_status, num FROM decision_ready_for_sign
UNION ALL
SELECT decision_status, num FROM decision_signed
UNION ALL
SELECT decision_status, num FROM decision_dispatched
UNION ALL
SELECT decision_status, num FROM appeal_on_hold
UNION ALL
SELECT decision_status, num FROM appeal_misc
UNION ALL
SELECT decision_status, num FROM appeal_cancelled
ORDER BY decision_status
ODS:
SELECT status, count(*)
FROM ods.appeals
GROUP BY status, decision_status_sort_key
ORDER BY decision_status_sort_key
Determine the AOD (Advanced on Docket) status of an Appeal
Transactional:
SELECT appeals.*,
"AOD Details"."appeals.id" AS "aod_appeals.id",
"AOD Details"."granted" AS "aod_granted",
"AOD Details"."reason" AS "aod_reason",
"AOD Details"."veteran.age" AS "aod_veteran.age",
"AOD Details"."is_advanced_on_docket" AS "aod_is_advanced_on_docket"
FROM appeals
left join (WITH people_with_age
AS (SELECT *,
datediff ('years', current_date
, people.date_of_birth) AS
"veteran.age"
FROM people)
SELECT appeals.id AS "appeals.id",
advance_on_docket_motions.granted AS "granted",
advance_on_docket_motions.reason AS "reason",
"people_with_age"."veteran.age" AS "veteran.age",
Coalesce (CASE
WHEN "veteran.age" >= 75 THEN TRUE
WHEN advance_on_docket_motions.granted = 'true' THEN TRUE
ELSE false
END) AS
"is_advanced_on_docket"
FROM people_with_age
left join public.claimants AS claimants
ON people_with_age.participant_id =
claimants.participant_id
left join public.advance_on_docket_motions AS
advance_on_docket_motions
ON people_with_age.id =
advance_on_docket_motions.person_id
left join public.appeals AS appeals
ON claimants.decision_review_id = appeals.id
AND claimants.decision_review_type =
'Appeal')
"AOD Details"
ON ( "appeals"."id" = "AOD Details"."appeals.id" )
WHERE appeals.id = 123
ODS:
SELECT *,
COALESCE (CASE
WHEN aod_due_to_dob = 'true' THEN true
WHEN aod_granted = 'true' THEN true
ELSE false
END) AS is_advanced_on_docket
FROM ods.appeals
WHERE ods.appeals.appeal_id = 123
Scheduling
Redshift (prod environment) is updated once a day. All schema tables are dropped and re-imported. This allows for schema changes in the transactional and ODS data stores to be reflected in Redshift. The schedule is:
public
(transactional schema) 11pm ETods
12am ETvacols
1am ET
Redshift (prodTest environment) is updated weekly. This is used by the Reporting Team to test performance against real data.
Caseflow Replica (RDS) is updated in near real-time.
The Caseflow ODS (RDS) updates every 6 hours, at 20 minutes after the hour. The schedule is on UTC, so 00:20, 06:20, 12:20, 18:20 (1:20am, 7:20am, 1:20pm, 7:20pm EST).
Schemas
Naming conventions
All the Caseflow schemas (i.e. not including VACOLS) follow the Active Record naming conventions. For example, the column appeal_id
on any table should be interpreted as a foreign key referring to the appeals.id
column.
In addition, many tables follow the Polymorphic Association pattern which allows for a single table to contain foreign key references to different tables.
The polymorphic naming convention is to pair *_type
and *_id
suffixes on two columns that refer to the foreign model name and the foreign key value, respectively. Because the model name is read by the Active Record code, it is used to correspond with the table name.
For example, the end_product_establishments
table has source_id
and source_type
columns. The source_type
value might be (among others) "HigherLevelReview" or "SupplementalClaim". These model names correspond to the "higher_level_reviews" and "supplemental_claims" tables, respectively. So if you found a end_product_establishments
record with source_id
of "101" and a source_type
of "SupplementalClaim" you would expect to find an associated supplemental_claims
record with id
of "101".
Do not confuse the *_type
polymorphic pattern with the Single Table Inheritance (STI) pattern. They both use the word "type" in the column name, but STI lacks the corresponding *_id
column.