Applying Data Compression To Historical Data and Running the Legacy iOS Data Recovery Script - onnela-lab/beiwe-backend GitHub Wiki

Applying Data Compression To Historical Data

FYI: I have not yet pushed all updates from staged-updates to the main branch. I had to go in and double check details as I wrote this up, I think we are ready with the merge, its just the end of a long day and I don't want to suddenly realize I was wrong. Release imanent, just not today.

There was an issue (guidance below, see details about KMS) that has delayed release to Main, but we are back on for an immanent release.

Executive Summary

  • The beiwe-backend now has one of our long term wish-list features: Data Compression
  • You will notice it most on your S3 bill, but it also improves data download speeds, and reduces total network traffic costs.
  • This work lays down the foundation for a future where further savings via the use of binary data formats for the Beiwe Platform.
  • It also lets us set up an immediately-to-follow new data access api: download compressed data!

So what is this guide about then?

This feature comes with some unavoidable sysadmin tasks.

  • In the past Beiwe has not collected the data needed to report on details like stats about your raw data, we are introducing that here.
  • These details are generated at-compressing-time, the best way to fill it is to compress your backlog historical data.
  • We also need to deal with the conclusion to the iOS data corruption saga
    • fortunately we are now at least 1.3 years out from the source bug in the iOS app, and we have a data recovery script.
  • This is an opportunity to look into an optional cost optimization that should be determined on a per-deployment basis.

So, Before Starting

Building that big database table of files is somewhat storage-intense - there's no way around this, you will have a lot of files.

  • Very roughly - I recommend doubling your current database storage if it is already half used.
  • Check the storage usage on your RDS server.
    • There are ample resources out there on how to view this data that are more elegant and up-to-date than I can keep up with here.
    • AWS' own documentation on how to upgrade storage for your RDS server can be found here.
  • You should also update the database software while you are in there.
    • Best practice is to do these as two separate operations, each of which should cause RDS to create automatic backup-restore points.
    • AWS' RDS has an "Automatic Minor Version Upgrade" option -- it appears to silently-just-not-work-at-all with Postgres.
    • Beiwe-backend does not make use of complex Postgres-specific features and should be compatible will all versions of Postgres.
  • Note that during these processes the database will probably go down for a bit, causing downtime on the site, temporarily blocking upload of data.

Review and analyze your S3 encryption settings

  • The Beiwe deployment script enables an AWS S3 feature called KMS (Key Management Service) that enables an extra layer of encryption, but results in some cost overhead.
    • In this developer's professional opinion: KMS is a layer that protects only the raw encryption keys used inside AWS, by AWS.
      • KMS is 100% automated. If a Beiwe server is compromised KMS does not do anything to protect against data exfiltration. It is therefore of questionable utility.
    • KMS is enabled in an abundance of caution for common Health Data security compliance purposes.
    • This may or may not be relevant to your use case.
    • To be very very clear: Beiwe already encrypts all participant data before it ever touches S3.
  • Normally this feature is limited in your costs because accessing the KMS Key it is only used in read and write operations on S3.
  • Under normal use of the Beiwe Platform this only happens at upload and download time. Well, the compressing process downloads and uploads all data.
    • Disabling KMS will only apply on existing data if it is replaced. Reads (downloads) don't replace it. Compressing all historical will replace it.
  • On one of our platform instances with ~21TB of data (compressed to ~4.8TB of data!) across ~140 million objects, the KMS usage during recompression was ~$700 over a normal month's operating costs.
    • (This cost will be swiftly recouped by the the reduced storage options.)
  • If you want to disable the AWS KMS encryption on the bucket you can do so on the AWS Online Console by:
    • Go to S3 > click on your Beiwe bucket, it will start with the word Beiwe > the Properties tab > Default encryption, then click the edit button.
    • Set encryption mode to "Server-side encryption with Amazon S3 managed keys (SSE-S3)"
      • (Yes, that really is the text of option to disable this S3-side encryption.)
    • Set "Bucket Key" to Disable.
      • (If you change the first option it will set this one to "Enable" even if it was previously disabled, so double check.)

As always, please Watch the GitHub issues page for Announcement posts. I hope to have a feature dynamically enabling Amazon Glacier on ~up-to-half of bulk Beiwe data later this year.

Almosst Ready, just one more thing

You've got to be updated to the Python 3.12 version of the AWS Elastic Beanstalk environment of your Beiwe Platform instance.

  • We have a should-be-comprehensive guide here.
  • Do not run the database update mentioned above at the same time as upgrading the AWS python platform version.
  • You should update to the most recent version of Beiwe-backend after you have updated to the special python 3.8-3.12 transitional branch.

Step 1

  • There has been some code churn since Onnela Lab debugged and ran this process. We cannot re-test this process, but will continue to support running the recover script for a reasonable amount of time.
  • You can go ahead and start running the compression script (step 3) along side the corrupted data recovery if you want, but the full process directions are detailed here first.

Run the data recovery script for corrupted iOS data.

  • If you want to know what is going on, there is an issue here.
    • This issue has been resolved for over a year.
    • You only need to run this script if you want to recover data from iOS app versions before the 2.5.X release, which was released in Early 2024.

This script will take quite a while to run, possibly days or weeks.

  • Make sure you have deployed a new Data Processing Manager server with the same version of the code as you have on your EB servers.

  • In the setup process process for updating to python 3.12 you may not have shut down your python 3.8 manager/worker servers, so double check that to ensure you are using an appropriate server and don't have old servers still up.

  • To initiate the script start by SSHing onto your Data Processing Manager (or a worker) server.

    • You can find the IP address by running the -get-manager-ip or -get-worker-ips of the launch script, make sure you are pointing at your python 3.12 deployment.
    • The SSH key is the key in your server management credentials.
  • cd into the beiwe-backend folder.

  • Before running this script you must add the following line to the config/remote_db_env.py file: os.environ["ENABLE_IOS_FILE_RECOVERY"] = "true"

    • This setting allows the data processing server to access the PROBLEM_UPLOADS folder.
    • This setting exists because we have to block new corrupted files getting added by Elastic Beanstalk servers from participants still running ancient iOS app versions.
    • We, Onnela Lab, cannot permanently commit to supporting participants who just never install updates.
    • This setting is documented in config/settings.py.
  • run ./run_task.sh, you will be presented with output that looks like this (exact items listed may differ):

This task runner will dispatch a script safely, running on the local machine with output redirected to a log file.
Output from the script will immediately be followed, but you can exit the live follow at any time by pressing Ctrl+C.
This action WILL NOT STOP THE SCRIPT, it will just stop following the output.

Available Scripts:
1) script_that_recovers_some_ios_data.py                  4) script_that_compresses_s3_data.py
2) script_that_removes_data_from_invalid_time_sources.py  5) script_that_deletes_data_from_unknown_studies.py
3) script_that_deletes_problem_uploads.py
#? 
  • Run the script_that_recovers_some_ios_data.py script, for the above this means enter 1 (one) and then enter.
    • You will be asked to confirm this choice.
    • Scripts are run under a lower priority via the nice tool, so the normal duties of the data processing servers should continue to work.
    • (Inexplicably, sometimes the order of these items isn't alphabetically.)
  • The task runner will then.... do exactly what it says.
    • "follow" here is a term that means live-viewing the logging output as it is written to a file.
    • The exact operation that is occurring here is a tail -f some_file_name, you can use that to follow the log manually later.
    • Dismiss the live follow with ctrl-c.
  • ./run_task.sh wraps the operation in some safety nets, ensures any errors will be reported to Sentry, and directs output to the log file.
  • If you run the ll or ls commands here you will see all the files in the beiwe-backend directory, including some very obvious log files.
  • That's it, the script is now running, and it will run until it is finished, or it unexpectedly errors.
  • This script will write Quite A Large log file in real time.
  • Check back periodically to see if the script has finished.
  • When finished, it is time to run the task in step 2.

Step 2 - deletion boogaloo

It is now time to delete the PROBLEM_UPLOADS folder on S3.

  • This is the folder we just recovered data from.
  • All recovered data has been pulled into normal locations, so the folder can be completely cleared out.
  • This is worthwhile because there may be many many gigabytes of data.
  • cd into beiwe-backend, run ./run_task.sh to run the script_that_deletes_problem_uploads.py script.
    • It deletes all the items that folder, and clears out old database content related to those files.
    • It will take a while, each file deletion is an http operation, there is a thread pool running 25 of these concurrently until the folder is empty.
  • When the script is finished you can move on to step 3.

Step 3 - Compression

  • If you want to you can start the historical data compression script alongside iOS data recovery.
    • You should only do this if the server you have allocated has extra cores available.
    • If you use a tool like htop you will see that tasks are run every 6 minutes, notably the processing of uploaded files.
    • As long as there is some downtime between these 6 minute intervals everything should work out fine.
    • Scripts are run under a lower priority via the nice tool, so the normal duties of the data processing servers should continue to work.
    • Assume a script requires 2 cpu cores to run at full speed at all times.
    • (This is a rough approximation. It is particularly wrong for T-series servers, which are not recommended for your data processing servers.)
  • These scripts can be run concurrently because the compression script does not operate on the files in the PROBLEM_UPLOADS folder.
  • If you want to speed up the processing time by using multiple instances of this script, then you Must Read the documentation in the script to do this, and modify the script each time before kicking off each custom instance.
  • All you have to do is run the script_that_compresses_s3_data.py using ./run_task.sh
  • This script will take An Absolute Age If Not Ten to run.
    • Onnela Lab is running this script across 650 million files and many Terabytes of data. It has been running for 3 weeks and is probably 1/4th to 1/3rd done.
    • I don't have an exact numbers at time of writing, we had additional S3 Bucket Versioning issues that were fogging the starting values up. I think about 60-80TB.
    • We have a second instance that ran across about 130 million files on a slightly slow server, it is at roughly 3 weeks and nearly finished.
    • I have improved efficiency of the script over time, and I am occasionally kicking off extra parallel runs, as well as continually developing the platform while we do this work.
    • We also had some other custom configuration options like a logs folder created by enabling an automatic S3 logging feature, so this is not directly comparable.