Downloading and Preprocessing the Data - TarteelAI/tarteel-ml GitHub Wiki

Welcome to the Tarteel-ML wiki! Here, we describe how to get started with downloading, pre-processing, and training models with the Tarteel data. Requirements: python3 and the python packages listed in the environment.yml file.

  1. Clone the Tarteel-ML repository by running git clone [email protected]:Tarteel-io/Tarteel-ML.git
  2. From the main folder, run python download.py -s 1 (here, and in what follows, you may need to replace python with python3).
  • The flag -s indicates which surah you'd like to download, with 1 corresponding to Surah Al-Fatihah. If no flag is provided, then all surahs will be downloaded
  1. Navigate into the audio_preprocessing directory and run python generate_features.py -f mfcc -s 1 --local_download_dir "../.audio" --output_dir "../.outputs" to generate the MFCC coefficents
  • The flag -f indicates which format the audio recordings should be preprocessed into. Valid options are mfcc or mel_filter_bank
  • The flag -s indicates which surah you'd like to preprocess, with 1 corresponding to Surah Al-Fatihah. If no flag is provided, then all surahs will be preprocessed.
  • The option --local_download_dir should correspond to the local directory in which the audio files have been downloaded.
  • The option --output_dir should correspond to the local directory in which the preprocessed files should be saved.
  1. Run python generate_one_hot_encoding.py" -i "../data/data-uthmani.json" -o "../data/one-hot.pkl"
  • The flag -i should correspond to the local directory in which the text of the Quran is saved as a json file.
  • The flag -o should correspond to the local directory in which the one-hot encodings should be saved as a pkl file.