ingesting trip files into the mysql database - accuscore/scoring_engine GitHub Wiki

Four phases:

  1. Parser takes a trip file and outputs a set of text files ready to load into the database.
  2. Loader takes a batch of sets of text files and bulk-loads them into the database.
  3. Postprocess calculates averages and redflags. The first level of summarization.
  4. Scorer reads the summary tables and produces scoring tables.

The Parser, Phase 2, handles one trip file at a time. We launch several of these, one per CPU, and they all work in parallel. No chance of deadlock since the only database access is up front to check the subscription. The Parser only writes files to disk.

Phases 2 and 4 operate on batches of trips. There is a single cron job that runs once a minute to start processing the batch of trips that have arrived since the last run. No concurrency, just one process that should always finish in less than a minute. With an ETL process, it's better to process batches using a single thread than to process one item at a time from multiple threads. This approach also removes the menace of deadlock conditions, which we have with the current setup.

The current code does Phase 1 and Phase 2 using the same python file. To make the Parser, two changes:

  • in fleet_parser.py, after creating the output files, do not bulk insert them.

  • we need to avoid trying to reserve an id in the fleet_trip table. So use the filename as id at this stage. Filenames are guaranteed to be unique. If a directory is found with the same filename, stop processing, it is a duplicate.

This is the first task. I'm not sure how hard it will be to use filenames rather than ids.

When this is done, the cron job to run the other three modules will wake up to find some directories in the bulk_insert directory. Each directory has a set of files to be loaded into the database.

Phase 2, Loader is new code.

  1. Make a list of the directories with fleet_trip.txt files, indicating that they are complete. Since the Parser runs concurrently, more directories may appear while Loader is working on a batch. They will be handled in the next batch.

  2. Concatenate all the fleet_trip.txt files. The resulting file has one line per trip in this batch.

  3. Bulk load into fleet_trip. Fleet_trip needs to have a new column, filename. The table has an auto_increment id field, which will be assigned by the database at this point. We want to read it back and update the files created by the Parser with the correct trip_id.

  4. Read the records just inserted into fleet_trip, to get the trip id for each filename.

  5. Edit the files in each directory to replace the filename with the trip_id.

  6. Concatenate all the remaining files, by type. All the fleet_acceleration records from all trips are put in a single file.

  7. Bulk load the 7 files containing data from all trips in this batch.

When use is low, there will usually be at most one trip every minute, so it's as if we are processing each trip independently. This approach pays off when usage is high and the batches get large.

Phase 3, the Postprocess, is a legacy python script that runs on a single trip_id at a time. We need to add a loop to do all the trip_id's in the current batch. That's a small change. At some point, this work should also be batched.

Phase 4 is a series of procedures that maintain the running_totals table and generate tables for reporting. No change here.