Proposed Plan - monum/311-translation GitHub Wiki

Plan

The plan for the GSoC coding period is elucidated below.

WEEK DATES TASKS Links
Week One June 13th - June 19th Set up the plan and the Google Collab. Link
Week Two June 20th - June 26th Look at the data and figure out how many pairs would be needed. Figure out which model to use.
Week Three June 27th - July 3rd Clean the data - remove unnecessary tokens, tokenise everything, etc.
Week Four July 4th - July 10th Train the models
Week Five July 11th - July 17th Depending on accuracy, either fine tune model further (augmentation, maybe?). Give results for human evaluation.
Week Six July 18th - July 24th Compare the translations with open-source tools available. Get BLEU scores.
Week Seven July 25th - July 31st Start working on the other language pair.
Week Eight August 1st - August 7th Clean the data, and look at the previous models to see which one works best.
Week Nine August 8th - August 14th Train the models and send them for human evaluation.
Week Ten August 15th - August 22nd Compute BLEU scores and document the code.
Week Eleven August 23rd - August 30th If human evaluation is not up to the mark, add more data either by creating more pairs or by data augmentation.
Week Twelve August 31st - September 6th Test again and see if code can be faster.
Week Thirteen September 7th - September 12th Complete documentation and support for the project.

Deliverables

Deliverable 1: Trained Model(s) on the custom dataset. Documented, modularised code for the models. (After week 6)

Final Evaluation Objectives: A machine translation model trained on custom data with an accuracy comparable to general translation services. If not better. Well-documented and modular code that can be extended easily later. (End of the project)