GSoC 2015 : Various enhancements to the model selection API of scikit learn - maniteja123/scikit-learn GitHub Wiki
###SUB-ORGANIZATION: scikit-learn
###Detailed Project Information.
**NOTE** - I have added descriptive tags to all the links; Reviewers may find it handy to hower over the links instead of actually visiting them.####1. Make CV iterators data independent and provide a clean API w/ deprecations.
Status: Done and merged in #4294
Cross Validation is an important tool to avoid over fitting the model. scikit-learn has a nice set of tools to split the data into train, test set based on various strategies. However, currently the CV iterator objects are data dependent, in the sense that they are initialized with data dependent parameters like y
, labels
etc. and restrict usability, especially if one wishes to use the generator objects for multiple datasets.
The goal here is to make these generator objects data independent and provide a clean API like estimators enjoy currently.
I have already started work on the same, building upon Ignacio Rosi's work at #3340. Refer PR #4294.
This essentially attempts to separate the data dependent parameters from the __init__
and providing a clean API using the split(X, y=None)
method. This needs to be done without breaking the existing functionality to allow backward compatibility with previous version(s).
Also at the end, all the examples which directly use CV also need to be modified to conform to/showcase the new API.
####2. Group together, clean up and organize model evaluation and optimization modules.
Status: Done and merged in #4294
This was suggested by Joel in the issue #1848.
Cross Validation and Model Selection related modules have grown over time via many great contributions. These need a clean up and assembly into a toplevel module model_selection
, to group together related algorithms which will enhance their usability. For instance,
contains RandomizedSearchCV
which is not a quite appropriate structure as pointed out in the above issue.
Goal: (Note that there were slight changes in the organization and naming)
The goal is to group together related algorithms / classes / functions into the below structure which is taken from Joel's comment at PR #1848:
model_selection/ -- KFold, train_test_split, check_cv -- cross_val_score, permutation_test_score, learning_curve, validation_curve -- GridSearchCV, RandomizedSearchCV -- make_scorer, get_scorer, check_scorer, etc. -- ParameterGrid (may be used by validation_curve), ParameterSampler
This should probably be trivial, and would involve moving all the classes/functions of
into the respective files as shown above and providing a clean module level import path as discussed in #1848 and also support the current import structure, for backward compatibility, while issuing a deprecation warnings for those who attempt to use the same.
I am planning to salvage PR #4254, for the same.
To keep the diff reviewable (as this was a problem previously), I would like to do this, if possible via 3 different PRs one each for
One important note for this deliverable is that, this needs to be done in the month of April itself as quite a few other GSoC projects/PRs involve touching
et al. and with this as WIP it would be really difficult to keep all of our works synchronized (since we will be disintegrating the code to multiple files, a clean git rebase is impossible, as git would be blind to our refactoring. Only manual update would work, which may be error prone.)
Also, both of the above goals involve touching most of the code of grid search, cross validation et al., which should give me a good understanding of their design and working. This, I believe, will help me a little in adding the multiple metric support to grid search as cleanly as possible.
3. Multiple metric support to enhance scikit-learn's CV objects.
Search algorithms to optimize and fine tune the model, currently work only with a single metric. Including support for multiple metric would be really useful as a diagnostic tool to provide more insight into the parameter exploration.
The end goal here is to provide a mechanism with a clean API to allow multiple metric support (with the ability to explore the model simultaneously w.r.t multiple metrics without having to manually repeat the search individually w.r.t every single metric)
This is a major work requiring a clean design and a lot of discussions regarding the API structure and backward compatibility as well. I would like to devote the entire month of May to do this as cleanly as possible.
Mathieu Blondel has done most of the work here at PR #2759. I would be building upon his PR and finishing up what ever needs to be done based on further discussions.
There are a few other things that need to be discussed and consolidated before coding them up. Thanks to Vlad for pointing the same out, I should probably allocate enough time for API discussions.
On the issue of how the output should be handled, people had suggested using masked arrays, Pandas, a better dict of arrays and a dedicated class to handle the output. This needs to be discussed with the other developers and consolidated into a nice solution, now that we would be supporting multiple metrics.
Before such a discussion, I should also acquaint myself well with the current situation, the ideas proposed in #1020, where Andreas explores goals on how to better present the output of search algorithms, #1034 where he attempts to fix the same by introducing a new class to handle the output and #1842 where Joel explores another way to solve #1020 by introducing a method to index the parameter grid.
Probably related issues/ideas to look at are #2733 and #2079.
This would involve 1 week to discuss what needs to be done further upon Mathieu's existing work and scavenging the PR (#2759 and issue #1850) for comments and discussions to frame a clear TODO.
I believe this can be completed, with full time involvement, by May end.
4. sample_weight support in grid_search et al.
Currently custom scorers which take in thesample_weight
cannot be effectively used in grid search which does not support
delegating the sample_weight
parameter. This could hamper usability and hence needs to be fixed.
Support delegating sample_weight
(and related parameters(?)) to the scorers.
Noel Dawe has attempted this has also got the same reviewed with positive feedback from core devs at PR #1574 and Vlad at PR #3524 has put forth ideas which provide mechanisms to allow multiple parameters to be neatly passed to the scorer.
To be frank I am a bit fuzzy with the various options here, as I have not looked into the same very well, but I believe the following ideas were attempted/suggested :
- Simply supporting
alone at the top levelGridSearchCV
itself. - Adding
to allow multiple such estimator / scorer parameters which add flexibility and also provide a way to supportsample_group
Hence I'll start working towards this deliverable by raising a new thread for the same and initiating discussion on the API front and later proceed with the evaluation and implementation.
Since this also involves API discussion, I would take around 3 weeks for the same.
Estimated completion date: June 20th
5. Generalized grid search and early stopping
There is a detailed discussion here at #1626
The basic idea behind generalized cv is that estimators should provide a nice functionality for tuning its multiple parameters. This should work seamlessly with grid search. Such a setup will help build the estimator specific cv functionality inside the estimator itself and the more generalized stuff into our grid search module which should ideally work together with the estimators CV functionality.
This involves heavy discussions and exchange of multiple ideas as this is a major API change which would involve touching several estimators and the grid search module as well. As Andreas and Olivier note, this is probably not easy and might not get fully implemented. But this should at the least kindle enough discussion and attempt at implementing the end goal. This should pave way for a full fledged implementation perhaps in the near future.
Estimated completion date: July 31st
6 Introducing additional CV strategies for non trivial cv tasks
Recently (as on March 27th) DisjointKFold
was proposed in PR #4444.
Olivier suggests including a similar one that is a blend of ShuffleSplit
and LeavePLabelOut
I would like to include more such CV strategies that can help in non trivial cv tasks for our advanced users.
Estimated completion date: August 10th
7. Make an extensive Cross Validation tutorial.
Motivation: Cross Validation is an important tool for everyone and our users could benefit from an exhaustive CV tutorial.
Goal: To forge an exhaustive CV tutorial that could help users use scikit-learns CV tools effectively.
Implementation: I need to have a thorough understanding of CV since my entire GSoC proposal revolves around the same. This deliverable, I hope, will help me understand more on cross validation along with its intricacies and all the trivial as well as non trivial CV usecases. Hence I will do this in parallel with the main work, investing 5 hours per week for the same.
The following sections as suggested by Olivier should be added along with others that I may frame based on discussions with my mentors on the go.
- Selection of the CV strategy. There is also a technical paper highlighting the same.
- On using stratified CV iterators for dichotomous classification tasks.
- How to check if the IID criterion is not broken.
There are just a few of the topics that could be included. I'll add more after an exhaustive survey of available texts on CV.
8. Improve docstring, examples and contributor documentation.
- The current contributors guide could be more exhaustive and help new contributors who often get stuck with similar git or other minor issues related to conventions / code formatting etc. In general it should serve as a quick reference guide for most version control / convention / code formatting related issues. This should also be helpful in code reviews as core devs can quickly point to this instead of correcting minor mistakes like code formatting etc.
- Docstring is the first place people look for when they get stuck with a particular module, it would definitely be helpful to them if we add a minimal example for each model as a quick reference.
- I've suggested at this wiki page the new structure of the contributors guide. The goal would be to fill up all the sections of the suggested tree of headings. Refer - #3912
- To add
section for as many models as possible. - Refer - #3846
I intend to work on this in parallel with the main work, since this will also help me understand more about contributor best practices as well as about different machine learning models.
For #3846, the 98 examples can be split up into one example per day, starting from may.
Put together both the goals of this deliverable should take less than 1 hour per day.
9. Participate in project wide bug fixes/code reviews along the way.
I am planning to tackle at the least one bug fix, however minor it may be, unrelated to the scheduled work for that week to help bring minor improvements across our code-base.
EMAIL: [email protected]
TELEPHONE: +91 9176370278
TIME ZONE: GMT + 1:00 [ Paris / France ]
IRC NICKNAME: rvraghav93
GITHUB HANDLE: rvraghav93
I am Raghav R V, a final year undergrad studying in SVCE under Anna University, India. I have taken up quite a few projects in Python, over the past two years. I have also successfully completed my project in Google Summer of Code 2014 under Python Software Foundation / BinPy, where I implemented simulation of various digital components, ASCII based logic visualization tools, binary multiplication algorithms based on the bitstring library and a few selective analog componenents like the signal generator module, analog buffers etc.
While I would like to note that most of the work that I did in BinPy were nowhere near professional standards, I nevertheless, ended up learning a lot of Python / git / boolean algorithms and got the chance to interact with an awesome open source community.
I started with machine learning around September of 2014 and have contributed to scikit-learn from Nov 2014 in the form of minor bug fixes / documentation improvements etc.
I have made bug fixes at quite a few places and hence am quite well versed with our API.
####My Contributions to scikit-learn
#3891 - (merged) - Decision function for the
#4295 - (merged) - Scaling scores to
(-0.5, 0.5)
instead of[-0.4, 0.4]
- #4023 - (merged) - Fix docstring/signature mismatch
#4029 - (merged) - Add
class and fix all modules to raise uniform errors when not fitted -
#3907 - (unmerged) -
tests and assert_xmodel helpers; -
#4261 - (merged) - Fix broken
deprecations andsvm
example. - #4362 - (unmerged) - Make PyFuncDistance (cython) picklable.
- #4432 - (unmerged) - docstring --> comments for a cleaner output in nose verbose mode.
- #4226 - (merged) - Handle numerical instability in ElasticNetCV and LassoCV
- #4076 - (merged) - Add silhouette analysis plot for KMeans
- #4126 - (unmerged) -(WIP) - Multilabel Confusion Matrix
- #4115 - (WIP - more tests needed) - sample_weight support to MCC.
- #4294 - (WIP - 20%) - Make CV iterators data independent.
- #4228 - (WIP - 80%) - Allow nan for userdefined metric in dbscan et al
Few other minor bug/documentation fixes were not included.