Proposed Plan - monum/311-translation GitHub Wiki

Plan

The plan for the GSoC coding period is elucidated below.

WEEK	DATES	TASKS	Links
Week One	June 13th - June 19th	Set up the plan and the Google Collab.	Link
Week Two	June 20th - June 26th	Look at the data and figure out how many pairs would be needed. Figure out which model to use.
Week Three	June 27th - July 3rd	Clean the data - remove unnecessary tokens, tokenise everything, etc.
Week Four	July 4th - July 10th	Train the models
Week Five	July 11th - July 17th	Depending on accuracy, either fine tune model further (augmentation, maybe?). Give results for human evaluation.
Week Six	July 18th - July 24th	Compare the translations with open-source tools available. Get BLEU scores.
Week Seven	July 25th - July 31st	Start working on the other language pair.
Week Eight	August 1st - August 7th	Clean the data, and look at the previous models to see which one works best.
Week Nine	August 8th - August 14th	Train the models and send them for human evaluation.
Week Ten	August 15th - August 22nd	Compute BLEU scores and document the code.
Week Eleven	August 23rd - August 30th	If human evaluation is not up to the mark, add more data either by creating more pairs or by data augmentation.
Week Twelve	August 31st - September 6th	Test again and see if code can be faster.
Week Thirteen	September 7th - September 12th	Complete documentation and support for the project.

Deliverables

Deliverable 1: Trained Model(s) on the custom dataset. Documented, modularised code for the models. (After week 6)

Final Evaluation Objectives: A machine translation model trained on custom data with an accuracy comparable to general translation services. If not better. Well-documented and modular code that can be extended easily later. (End of the project)