Audio and Transcript Data - UTD-CRSS/exploreapollo-import-audio GitHub Wiki

Audio and transcript data can be either uploaded from a local machine, or if the data is already in S3, transferred to the database from S3. These scripts first require the config file be filled out with the correct S3 and API server credentials.

Data requirements

Note that this script skips any audio files without a transcript file of the same name, and skips and transcript files without the audio file of the same name. The scripts require that the filenames of each file be in the format <mission>_<recorder>_<channel>_<start MET>. This format is necessary to obtain all the information needed in the database. For example, A11_HR1U_CH10_00000175000.txt and A11_HR1U_CH10_00000175000.wav are an acceptable pair of transcript and audio files. Examples of both filetypes can be found under the src/A11_HR1U_CH10_AIR2GND directory.

Running

When uploading from a local machine, from the directory src/ run

python AudioUpload.py <local base folder> <S3 base folder>

where is the base folder containing all the files to be uploaded, and is the folder under which the items will be stored in S3. For example, under src/ is a subfolder named A11_HR1U_CH10_AIR2GND. Running

python AudioUpload.py A11_HR1U_CH10_AIR2GND test

uploads all the .wav and .txt files (that meet above Data requirements) under src/A11_HR1U_CH10_AIR2GND to S3 under the folder test. The file A11_HR1U_CH10_AIR2GND/A11_HR1U_CH10_00000175000.txt will be located in test/A11_HR1U_CH10_00000175000.txt. Note that any files in subfolders will also be copied - so a file A11_HR1U_CH10_AIR2GND/subfolder/A11_HR1U_CH10_0000000000.txt would end up in test/subfolder/A11_HR1U_CH10_0000000000.txt in S3. After the files are uploaded to S3 from the local machine, they will be uploaded to the database via the API entry point specified in config.py.

When moving files only from S3 to the database, from the directory src/ run

python TransferS3Data.py 

This will transfer ALL the files in S3. This will take a long time! When only a subset of the data in S3 is needed, you can specify the channel ID and a range of METs to transfer.

python TransferS3Data.py <channel> <met start> <met end>

Note that since the script only looks at the MET start given in the filenames to determine if it should be imported, some data may be missed at the beginning from files beginning at an earlier MET start, but ending after the given MET start. To avoid this, run the script with a lower MET start than needed.

⚠️ **GitHub.com Fallback** ⚠️