Downloading and Preprocessing the Data - TarteelAI/tarteel-ml GitHub Wiki
Welcome to the Tarteel-ML wiki! Here, we describe how to get started with downloading, pre-processing, and training models with the Tarteel data. Requirements: python3
and the python packages listed in the environment.yml
file.
- Clone the Tarteel-ML repository by running
git clone [email protected]:Tarteel-io/Tarteel-ML.git
- From the main folder, run
python download.py -s 1
(here, and in what follows, you may need to replacepython
withpython3
).
- The flag
-s
indicates which surah you'd like to download, with1
corresponding to Surah Al-Fatihah. If no flag is provided, then all surahs will be downloaded
- Navigate into the
audio_preprocessing
directory and runpython generate_features.py -f mfcc -s 1 --local_download_dir "../.audio" --output_dir "../.outputs"
to generate the MFCC coefficents
- The flag
-f
indicates which format the audio recordings should be preprocessed into. Valid options aremfcc
ormel_filter_bank
- The flag
-s
indicates which surah you'd like to preprocess, with1
corresponding to Surah Al-Fatihah. If no flag is provided, then all surahs will be preprocessed. - The option
--local_download_dir
should correspond to the local directory in which the audio files have been downloaded. - The option
--output_dir
should correspond to the local directory in which the preprocessed files should be saved.
- Run
python generate_one_hot_encoding.py" -i "../data/data-uthmani.json" -o "../data/one-hot.pkl"
- The flag
-i
should correspond to the local directory in which the text of the Quran is saved as ajson
file. - The flag
-o
should correspond to the local directory in which the one-hot encodings should be saved as apkl
file.