Data Pipeline - mlopatka/CANOSP2020 GitHub Wiki


Data

Data comes from 2 main places:

  • CANOSP2020_ROOT_FOLDER/CANOSP_FIREFOX_SUPPORT_QUESTIONS_TRAINING_DATA/Tagged Tickets on Google Drive contains our tagged tickets. This needs to be updated whenever we manually tag new tickets. Then it can be exported as a CSV file to be processed.

  • CANOSP2020_ROOT_FOLDER/CANOSP_FIREFOX_SUPPORT_QUESTIONS_TRAINING_DATA/SUMO-data-dump-raw/tickets.json contains tickets pulled from SUMO.


Processing

  • fetch_ticket.py can either pull tickets from Mozilla Support using the Kitsune API, or combine our tagged tickets with the tickets pulled from SUMO.

  • ticket_to_csv.py converts a ticket JSON with this format into a CSV file with this format.

  • json_to_crowdtruth_csv.py converts a ticket JSON with this format into a CSV file with a format compatible with the CrowdTruth library.