Data Pipeline - mlopatka/CANOSP2020 GitHub Wiki

Data comes from 2 main places:

CANOSP2020_ROOT_FOLDER/CANOSP_FIREFOX_SUPPORT_QUESTIONS_TRAINING_DATA/Tagged Tickets on Google Drive contains our tagged tickets. This needs to be updated whenever we manually tag new tickets. Then it can be exported as a CSV file to be processed.
CANOSP2020_ROOT_FOLDER/CANOSP_FIREFOX_SUPPORT_QUESTIONS_TRAINING_DATA/SUMO-data-dump-raw/tickets.json contains tickets pulled from SUMO.

fetch_ticket.py can either pull tickets from Mozilla Support using the Kitsune API, or combine our tagged tickets with the tickets pulled from SUMO.
ticket_to_csv.py converts a ticket JSON with this format into a CSV file with this format.
json_to_crowdtruth_csv.py converts a ticket JSON with this format into a CSV file with a format compatible with the CrowdTruth library.