Backend ‐ Metrics - chanzuckerberg/napari-hub GitHub Wiki
What is it?
This captures plugin’s activity data, which includes Install Activity metrics and GitHub Activity metrics.
Architecture
The architecture consists of three major parts:
- The external datastore that holds the complete dataset.
- The data fetch workflow keeps the data fresh.
- The API that surfaces the data to the client.
The external datastore
This is a snowflake system that is managed by the data science team. The data science team refreshes the data by appending new data to the table. This update happens every day at 5:00 AM.
Install Activity
The data is fetched from pip’s statistics. A view is created on top of the table’s data to filter out all the installs done by CI/CD pipelines. This helps provide a closer estimate of individual users installing the plugin for their processing.
GitHub Activity
The data, which represents all commit activity on the users' GitHub repos, is fetched from GitHub’s statistics.
imaging.github.commits
table in Snowflake is where GitHub data is stored, and contains columns such as repo, author id, commit timestamp, commit message, repo URL, and ingestion timestamp.
Data fetch
The data fetch workflow lives in the data-workflows code space. It queries snowflake, transforms the data, and writes it to the relevant dynamo db. It has write access to the relevant DynamoDB tables.
CloudWatch event rule
A CloudWatch event bridge rule schedules the workflow as a cron job. The rule is scheduled to run at 13:00 UTC daily after the data science team’s workflow updates the tables. The rule publishes the following JSON message to the SQS queue:
{"type": "activity"}
SQS Message
The message acts as a trigger to the lambda. It allows the message to be reprocessed in case of failures in the lambda.
Parameter Store
It stores the information that needs to be passed between subsequent iterations of the lambda run for activity workflow processing. This activity process stores the timestamp until which all activity ingested in the snowflake has been processed successfully. It is stored as the value for last_activity_fetched_timestamp
.
Dynamo
Install Activity
The data is stored in the install-activity
dynamo table, and here is the schema.
GitHub Activity
The data is stored in the github-activity
dynamo table, and here is the schema.
Lambda
Fetching start timestamp of query window:
- The timestamp of the last run is fetched from the parameter store and used as start_time for the window and the current time is used as end_time.
Processing for Install Activity:
- The view from snowflake (
imaging.pypi.labeled_downloads
) is then queried to get the plugin names with the earliest install activity added between the last run(start_time) and now(end_time). - Since their earliest install activity, the day, month, and total level granularity of data is computed for all the plugins returned in the previous query.
- The records fetched are transformed into the dynamo records of relevant types.
- The records are batch written to the
install-activity
dynamo table.
Processing for GitHub Activity:
- The table from snowflake (
imaging.github.commits
) is then queried to get the plugin names with the earliest github activity added between the last run(start_time) and now(end_time). - Since their earliest install activity, the latest, month, and total level granularity of data is computed for all the plugins returned in the previous query.
- The records fetched are transformed into the dynamo records of relevant types.
- The records are batch written to the
github-activity
dynamo table.
Storing end timestamp of query window:
- On successful completion of the workflow, the parameter store is updated with the end_time used in the workflow.
Install Activity API
total_installs:
GetItem for records from install-activity
table with key condition name=:plugin_name AND type_timestamp=‘TOTAL:’
and projection installs
.
installs_in_last_30_days:
Query for records from install-activity
table with key condition expression name=:plugin_name AND type_timestamp BETWEEN :start_date AND :end_date
’ and projection installs. The start_date and end_date are computed dynamically, to reflect the last 30 days.
timeline:
Query for records from install-activity
table with key condition expression name=:plugin_name AND type_timestamp BETWEEN :start_month AND :end_month
and projections installs
and timestamp
. The start_date and end_date are computed dynamically, to reflect the number of months over which the timeline data is needed.
GitHub Activity API
total_commits:
GetItem for records from the GitHub activity table with key condition name=:plugin_name AND type_identifier=f"TOTAL: {repo_name}"
and projection commits
.
latest_commit:
Query for records from GitHub activity table with key condition expression name=:plugin_name AND type_identifier=f"LATEST:{repo_name}"
and projection commits
. The start_date and end_date are computed dynamically, to reflect the latest commit.
timeline:
Query for records from github-activity
table with key condition expression name=:plugin_name AND type_identifier BETWEEN f"MONTH:{start_month}:{repo_name}" AND f"MONTH:{end_month}:{repo_name}"
and projections commits
and timestamp
. The start_month and end_month are computed dynamically to reflect the number of months over which the timeline data is needed.
Backfill
Step 1: Launch aws
and search for Parameter Store
and click on it.
Step 2: Search for napari-hub/data-workflows/config
.
Step 3: To backfill the entire data, set the last_activity_fetched_timestamp
variable to 0 in the staging and prod environments to kickstart the workflow.
Troubleshooting
If any issue occurs from backfilling the data, look through Lambda logs to pinpoint the problem.