data.table - rstats-gsoc/gsoc2018 GitHub Wiki

Background

The data.table package is an invaluable tool for data analysis. TODO more background info.

Related work

TODO What other R packages with similar functionality already exist? Why aren’t they good enough?

Details of your coding project

Find one or more students to fix/close some of the hundreds of outstanding issues. TODO add more details.

  • details about which issues are a priority for a 3-month project.
  • Are there separate groups of issues that would be good to assign to separate students?
  • Maybe one student could work on code coverage and/or performance testing (Rperform).

Expected impact

TODO: describe how this project will help data.table and the larger R community.

Mentors/tests

  • Toby Dylan Hocking <[email protected]> is a user of data.table, and can be a mentor, but there needs to be two mentors. So any potential students would need to find a second co-mentor who is familiar with the data.table internals. Arun Srinivasan, co-author of data.table, does not have enough time to mentor for GSOC2018. To find a co-mentor I would suggest finding an outstanding data.table issue, and then submit a Pull Request (PR) which fixes the issue, in order to prove that you have the skills needed to contribute to data.table. After your PR is merged, ask the person who merged it if they can mentor you in GSOC.

NOTE: Has Matt Dowle or anyone else with data.table commit access agreed to mentor this project? Without commitment from a data.table committer to review PR’s, provide feedback, and push changes, I don’t see how this project works. - BGP

Edit from Matt Dowle. I’ve previously replied to Toby on email :

GSOC 2018 sounds like a great idea. Happy to support it in any way I can but I don’t personally have the free time that it would need to be a mentor.

Toby’s reply :

About data.table in GSOC, its good to hear that you think it is a good idea in principle, even if you don’t have time to mentor yourself. As long as you find two mentors from somewhere in the R community, it should be fine.

Three ideas about ways you could help:

  1. label/categorize/prioritize the issues that you would like the summer student(s) to work on (and also list them in Details of the Coding Project on the GSOC project page).
  2. write some documentation about the organization of the internal structure of the data.table R/C code, so that it would be easier for random students to dive in, make modifications, and contribute fixes/improvements. (provide links to relevant docs in Background or Details section of the project page)
  3. write some tests for the GSOC project page. These are coding challenges that a student should be able to complete, in order to demonstrate their coding skills to you. The harder the tests you write, the easier it will be for you to choose the right student for the job. For examples see tests from other projects from previous years, e.g

Matt’s reply :

Those 3 bullets are actually themselves very time consuming. data.table has 559 open issues, 1536 closed.

If a student can look at what exists already, define their own tests and really want to be mentored then I’ll consider it. There is a ton of material in the slides, videos and over 6,000 stack overflow questions and answers. If they can find the material for themselves and work out what they feel is most important – that is the test. My measure of a good student is if they can work it out for themselves and define their own project.

If any student shows the initiative to look at the project, search the project and discussions about it, and pick something off to work on that interests them, I’ll consider mentoring them. That’s the test.

It would be easier for me if a student approached and said something like :

  • “I want to do some serous C at low level”. I could then propose something there.
  • “I want to write a paper”. Then I could suggest writing about data.table code that hasn’t been written about before.
  • “I want to close 100 issues, one per day”. Then maybe I could pick the ones where that might be possible.
  • or anything else like that
⚠️ **GitHub.com Fallback** ⚠️