Data and Metric Request Workflow - TISTATechnologies/caseflow GitHub Wiki

This page provides a workflow on handling data and metric requests in order to ensure consistency, encourage knowledge sharing, and improve responsiveness.

1. Check for existing solutions

Check if the or a similar metric has been provided before in the Database Query Registry. If not, add a link to the request below for future reference. A solution for the request will be added to the Database Query Registry at the end of this workflow.

Data or metric requests:

2. Understand Caseflow's data model

If the request has been answered before, ensure Caseflow's data model has not changed since the last request. If a new query is needed, determine which Caseflow database tables contain the data needed to respond to the request.

Either way review the following linked pages and related PRs. If the information cannot be found or is outdated, investigate and update the linked pages for future reference.

When investigating an answer to the request, consider:

  • What aspects of the request can be answered using Caseflow's data and with what confidence?
  • What external databases (e.g., BGS, VBMS, VACOLS) need to be used? Should parts of the request be delegated to teams maintaining those external systems?

3. Query to get results

Depending on the request, it can be addressed using any combination of:

  • Rails query to get a quick short-term answer;
  • Metabase SQL for answering recurring questions and visualizing charts that can be run and viewed by non-engineers;
  • Jupyter Notebook for interactive visualizations and results that require data merging (e.g., Caseflow with VACOLS data) or more data transformation.

Note that any of these solutions can query the Caseflow ODS database (see Data Architecture), which has denormalizing data, to provide faster query results.

Details are provided in subsections below.

Recommendation: this process can be interrupted once a satisfactory solution is created:

  1. create Rails query that produces desired results
  2. validate results against previous reports, other data and metrics, and/or expert feedback
  3. port Rails query to SQL (using to_sql). If the Rails query cannot be ported to SQL, skip the next steps involving Metabase and jump to using Jupyter Notebook.
  4. run the SQL query in Metabase and create charts if needed
  5. validate the results with requester
  6. use SQL and Python libraries (such as SQLAlchemy and Pandas) in a Jupyter Notebook to create result, charts, and dashboards
  7. validate the results with requester

Other processes include:

  • Exporting Rails query results to a CSV file and processing the CSV file with a script (e.g., in Ruby or Python) to produce an answer.
  • Exporting a CSV file from Metabase results and loading it in a spreadsheet investigate and produce an answer.

A. Rails query

Determine the correct Rails query based on Caseflow's data model and codebase.

(More details to come)

B. Metabase SQL query and chart

Replicate Rails query in Metabase to enable non-engineers to assess the results and present basic visualizations. Can be skipped if ETL transformations or interactive dashboard needed.

(More details to come)

C. Jupyter Notebook and interactive dashboard

(More details to come)

4. Update Database Query Registry with the solution

Add a description and links to the solution(s) in the Database Query Registry.

Challenges

Keeping SQL and Dashboards up-to-date with Caseflow

Current problem: Tableau reports query Caseflow's database using hand-crafted SQL queries based on Caseflow's logic and interpretation of Caseflow's data model. Over time, Caseflow's logic changes to accommodate new features or Board-requested changes in Caseflow behavior. As a result, Tableau may be presenting outdated information to the Board.

General problem:

  • Rails queries run in the production Rails Console is always up-to-date with Caseflow code changes, however external tools generally don't support Rails (except for example DARU (Data Analysis in RUby)). To capture Caseflow's logic, an external tool would have to run against the Caseflow codebase or somehow import Caseflow's models.
  • Dashboard/BI/Data exploration tools (Tableau, Metabase, Jupyter Notebooks, etc.) support SQL, however the SQL queries could become outdated as Caseflow changes.
    • Side problem: The SQL for Tableau reports can only be captured using Tableau Desktop, so we'll have to ask the Reporting Team to get the SQLs or examine Redshift logs for incoming queries.

Current solution: The current approach is to run the Rails and SQL queries to ensure the results are the same. This can be automated.

  • An alternative solution for Python-based dashboards is (1) to have it call a Caseflow API that effectively runs the Rails query and returns the results, or (2) import results from a file exported by some recurring job that runs the Rails query.