Proposed Plan - monum/311-translation GitHub Wiki
Plan
The plan for the GSoC coding period is elucidated below.
WEEK | DATES | TASKS | Links |
---|---|---|---|
Week One | June 13th - June 19th | Set up the plan and the Google Collab. | Link |
Week Two | June 20th - June 26th | Look at the data and figure out how many pairs would be needed. Figure out which model to use. | |
Week Three | June 27th - July 3rd | Clean the data - remove unnecessary tokens, tokenise everything, etc. | |
Week Four | July 4th - July 10th | Train the models | |
Week Five | July 11th - July 17th | Depending on accuracy, either fine tune model further (augmentation, maybe?). Give results for human evaluation. | |
Week Six | July 18th - July 24th | Compare the translations with open-source tools available. Get BLEU scores. | |
Week Seven | July 25th - July 31st | Start working on the other language pair. | |
Week Eight | August 1st - August 7th | Clean the data, and look at the previous models to see which one works best. | |
Week Nine | August 8th - August 14th | Train the models and send them for human evaluation. | |
Week Ten | August 15th - August 22nd | Compute BLEU scores and document the code. | |
Week Eleven | August 23rd - August 30th | If human evaluation is not up to the mark, add more data either by creating more pairs or by data augmentation. | |
Week Twelve | August 31st - September 6th | Test again and see if code can be faster. | |
Week Thirteen | September 7th - September 12th | Complete documentation and support for the project. |
Deliverables
Deliverable 1: Trained Model(s) on the custom dataset. Documented, modularised code for the models. (After week 6)
Final Evaluation Objectives: A machine translation model trained on custom data with an accuracy comparable to general translation services. If not better. Well-documented and modular code that can be extended easily later. (End of the project)