Chefs Analytics Investigation Redash - bcgov/common-service-showcase GitHub Wiki
CHEFS Analytics Investigation (Redash)
Redash is an open source analytics tool which allows users to view and analyze data. It has abilities beyond current Metabase usage for analysis of non-relational JSON data. We think this could be a good candidate for allowing CHEFS form owners to dig deep into their form submission data and visualize it.
CHEFS BI
At this time, CHEFS gives very limited amount of analytics into form submissions. As the administrative team we have Metabase directly connected to the database and can view metadata about submission numbers and other metrics.
For business users, we've heard a want for the ability to:
- Compute and display metrics based on fielded data in the submissions themselves
- Publicly display sub-sets of form submission data, or other aggregated metrics
A robust BI tool that allows users to self-serve create their own queries would facilitate this. The CHEFS API Key access could allow to ETL data in various ways into Business Intelligence tools.
The considerations we have taken into account for this investigation into Redash as a common hosted BI tool for CHEFS are:
- Being able to ask questions and display visualizations from CHEFS Submission data
- Fetching data ad hoc from the CHEFS API (though an alternate scheme where data is replicated could be investigated)
- Self-serve query/dashboard creation and maintenance for CHEFS business users (CSS team does not have to set up specific questions or dashboards)
- Self-serve onboarding if possible
A business area with form API access of course would have the ability to fetch data in whatever means into any variety of analytics tools, so this particular discussion involves the idea of a common BI service for CHEFS that teams without development resources could potentially leverage
Redash
Redash fits as a BI tool for this purpose as the form submissions outputted from the CHEFS API are non-relational (and in-fact non-relational even in the CHEFS database, stored as JSONB blocks), so a tool that can do operations on raw non-relational data is needed.
Local installation testing
For this investigation, Redash was built from source and run on a local computer.
The Redash repo has a docker-compose that will set up the server side architecture and then the web application. On your local system you will need
- Docker
- Strongly recommend doing this in WSL2 if you are on Windows
The Redash docker setup includes images for Postgres, Redis, and others.
Follow the steps in this Developer Guide to get Redash started on your local computer. At time of writing the version of Redash used was 10 and fetched from this tag: https://github.com/getredash/redash/tree/release/10.0.x
On first use you will set up a local administrative user and password of your choice.
Connecting to CHEFS API
This implementation scheme of a common BI service for CHEFS submissions will try out directly fetching via the CHEFS API. This seems to work well but if considered as a solution then we should think about ramifications (see "API Limits" below).
First an API key for the particular form being analyzed is required. For a form you are the owner of, create an API key.
Add a New Data Source in Redash and include the relevant bearer auth information. Use the JSON Data Source
Note that the actual URL is not supplied during data source creation, rather just the headers. URLs are defined in the Query creation later.
Data Sources in Redash are granted to groups of users. On creation, sources go into the default group. When leveraging this for CHEFS user integration, we would create groups for forms, and the appropriate Form API Key data source with the group.
Querying
The JSON Data Source documentation again has details on Query setup for API access.
And see the documentation on the many options for querying, remember the data being used here is JSON
Set up a query on the new data source of your creation, the /export endpoint in CHEFS is a good start, using your appropriate Form ID.
A large variety of visualizations or tabular data can be generated from the query results, refer to the Redash documentation for guides
Also see the 'Querying' documentation link above.
Dashboarding
Once queries and visualizations are created from the data source, these can be grouped together to display on a Dashboard.
Dashboards can then be published to be displayed to allowed users, or made public to share data to unauthenticated users via an accessible URL. This is a technique that could be used for business users of CHEFS to display sub-sets of secured data, or analytics for public consumption.
Users
Redash users are set up and granted access to data sources, queries, and dashboards as needed.
Local users are the default, with Redash-stored passwords, etc. But for our needs, and to integrate best with CHEFS access (so a form user would not need a separate login for the BI tool necessarily) it would be best to integrate with Keycloak.
Redash supports SAML login, and there is documentation of that integrating correctly with Keycloak, but possibly with some issues that are being looked at right now. See the following documentation:
And links to issues/discussions people are having:
Redash API Access
A potential way of leveraging user API keys for CHEFS integration would be to create a service account. Everything being done through the user interface (creating data source, queries, user mapping, etc.) can be done via API calls as well. This way CHEFS could have the ability to connect over to Redash and set up the data source with the form API key at the time the key is generated, automating the onboarding process.
Future Roadmap
With Redash hosted locally we are able to proof-of-concept the following requirements
- Connecting to a production CHEFS form using the API Key access in CHEFS
- Querying and displaying data derived from that JSON API response including the form submissions
- Restricting queries to specific user groups
- Making a publicly accessible dashboard
- Automating some of this creation process through the Redash API
If this were to be pursued we would want to investigate considerations in the sections below
Business Users
Look at whether business users would be willing and able to go through the query setup and visualization creation processes. If the only types of users who would use Business Intelligence tools are the types of users who would want to extract their own data and host their own tools then it would be a limited user base.
The most likely use cases in estimation are:
- Business users who are able to learn the Redash basics for querying JSON
- Data analysts with the experience, but do not have their own platform to access this data
Hosting
Consider the hosting model for this. A model could be to host this as a Common CHEFS BI Service, specific to CHEFS. The CHEFS team would be responsible for hosting and maintenance, and users would only be granted access specific to the data sources that their CHEFS forms.
If hosted in this model a separate OpenShift namespace is likely the way to go.
This common BI tool instance could theoretically be used to query other sources aside from CHEFS as well.
SAML Login
SAML login appears to be supported in Redash. All of the above could be done with local users, but clearly associating the CHEFS user with the Redash user with Keycloak (and thus allowing for the IDIR login easily as well) would be ideal.
Determine if SAML login can indeed work with the SSO instance we are using, if not is the solution worth going forward with?
Other Considerations
Architecture
What Redash setup would be desirable for hosting?
Redash maintains a Docker image
Redash uses Redis and Postgres, so consider what setup for those 2. The setup above for local builds it all up with Docker Compose, but on OCP hosted instance would likely want to use appropriate Redis and Postgres/Patroni deployments.
Caching
Since queries would be done against the CHEFS API ad hoc in this proposed solution, investigate how/whether Redash could handle caching results to speed up operations and take load off the CHEFS API if the CHEFS API requires this.
There's some documentation about using cached query results as a data source itself. Maybe a place to start
API limits
Again, as this solution would be fetching directly from the CHEFS API, could this prove to be an issue for the CHEFS API at allowable volume for the BI tool.
What considerations should be put in place to account for allowable volume if this were a self-serve service?