Data and Metric Request Workflow - TISTATechnologies/caseflow GitHub Wiki
This page provides a workflow on handling data and metric requests in order to ensure consistency, encourage knowledge sharing, and improve responsiveness.
1. Check for existing solutions
Check if the or a similar metric has been provided before in the Database Query Registry. If not, add a link to the request below for future reference. A solution for the request will be added to the Database Query Registry at the end of this workflow.
Data or metric requests:
- Jan 2020 and Sep 2020 - Caseflow OIT Metrics #12748
- Aug 7, 2020 - OIT Metrics Request: # of Vets served in FY 2020#14916
- Nov 1, 2019 - Deliver Annual OIT Metrics Report #12555
- Nov 1, 2019 - Annual Metric: Caseflow Reader adoption #12556
- Nov 1, 2019 - Monthly Metric: Caseflow certifications percentage of total certifications#12554
- Oct 15, 2019 - Deliver Quarterly OIT Metrics Report #12389
- Sep 9, 2019 - Caseflow metrics instrumentation #11994
- Sep 17, 2019 - Monthly metrics report job #12150
- Oct 30, 2019 - Caseflow Corp/VBMS Data Integration | AMO Bi-Weekly Reporting#12510
- Jun 25, 2019 - Caseflow metrics for OIT #11198
2. Understand Caseflow's data model
If the request has been answered before, ensure Caseflow's data model has not changed since the last request. If a new query is needed, determine which Caseflow database tables contain the data needed to respond to the request.
Either way review the following linked pages and related PRs. If the information cannot be found or is outdated, investigate and update the linked pages for future reference.
When investigating an answer to the request, consider:
- What aspects of the request can be answered using Caseflow's data and with what confidence?
- What external databases (e.g., BGS, VBMS, VACOLS) need to be used? Should parts of the request be delegated to teams maintaining those external systems?
3. Query to get results
Depending on the request, it can be addressed using any combination of:
- Rails query to get a quick short-term answer;
- Metabase SQL for answering recurring questions and visualizing charts that can be run and viewed by non-engineers;
- Jupyter Notebook for interactive visualizations and results that require data merging (e.g., Caseflow with VACOLS data) or more data transformation.
Note that any of these solutions can query the Caseflow ODS database (see Data Architecture), which has denormalizing data, to provide faster query results.
Details are provided in subsections below.
Recommendation: this process can be interrupted once a satisfactory solution is created:
- create Rails query that produces desired results
- validate results against previous reports, other data and metrics, and/or expert feedback
- port Rails query to SQL (using
to_sql
). If the Rails query cannot be ported to SQL, skip the next steps involving Metabase and jump to using Jupyter Notebook. - run the SQL query in Metabase and create charts if needed
- validate the results with requester
- use SQL and Python libraries (such as SQLAlchemy and Pandas) in a Jupyter Notebook to create result, charts, and dashboards
- validate the results with requester
Other processes include:
- Exporting Rails query results to a CSV file and processing the CSV file with a script (e.g., in Ruby or Python) to produce an answer.
- Exporting a CSV file from Metabase results and loading it in a spreadsheet investigate and produce an answer.
A. Rails query
Determine the correct Rails query based on Caseflow's data model and codebase.
(More details to come)
B. Metabase SQL query and chart
Replicate Rails query in Metabase to enable non-engineers to assess the results and present basic visualizations. Can be skipped if ETL transformations or interactive dashboard needed.
(More details to come)
C. Jupyter Notebook and interactive dashboard
(More details to come)
4. Update Database Query Registry with the solution
Add a description and links to the solution(s) in the Database Query Registry.
Challenges
Keeping SQL and Dashboards up-to-date with Caseflow
Current problem: Tableau reports query Caseflow's database using hand-crafted SQL queries based on Caseflow's logic and interpretation of Caseflow's data model. Over time, Caseflow's logic changes to accommodate new features or Board-requested changes in Caseflow behavior. As a result, Tableau may be presenting outdated information to the Board.
General problem:
- Rails queries run in the production Rails Console is always up-to-date with Caseflow code changes, however external tools generally don't support Rails (except for example DARU (Data Analysis in RUby)). To capture Caseflow's logic, an external tool would have to run against the Caseflow codebase or somehow import Caseflow's models.
- Dashboard/BI/Data exploration tools (Tableau, Metabase, Jupyter Notebooks, etc.) support SQL, however the SQL queries could become outdated as Caseflow changes.
- Side problem: The SQL for Tableau reports can only be captured using Tableau Desktop, so we'll have to ask the Reporting Team to get the SQLs or examine Redshift logs for incoming queries.
Current solution: The current approach is to run the Rails and SQL queries to ensure the results are the same. This can be automated.
- An alternative solution for Python-based dashboards is (1) to have it call a Caseflow API that effectively runs the Rails query and returns the results, or (2) import results from a file exported by some recurring job that runs the Rails query.