how_do_i_shot_web - SNUDerek/MLsnippets GitHub Wiki
ML study plan suggestions
engineer vs researcher?
it seems that there is a difference between the engineering and research sides of the machine learning field, where researchers are the ones at the bleeding edge, developing new algorithms and techniques, whereas engineers are expected to apply the discoveries of the researchers to new problems. insert the joke [paraphrased from reddit] where the deep learning researchers try to apply some complex neural network, the statisticians try to dream up some complex model, and the engineer just applies random forest because she knows that that usually works - of course the neural network fails to converge properly, the model's assumptions turn out to be false and the random forest proves to be the most effective. obviously there is overlap but i feel being aware of this distinction (as well as however you want to delineate machine learning vs data science) is important in defining goals and objectives.
so do i need to know all the math?
the more the better, of course, and i feel that a deeper understanding of the underlying algorithms helps develop a better intuition, and my own gaps in knowledge obfuscate my understanding. you don't ahve to take only my word; one of my idols Andrej Karpathy asserts that "Yes you should understand backprop" (https://medium.com/@karpathy/yes-you-should-understand-backprop-e2f06eab496b). On the other hand, Ive both seen people argue online as well as coworkers in the field assert that it's fine to jump in and fill in the gaps later, and i do feel that there is a certain benefit to getting your hands dirty, especially if the alternative is paralysis that impedes your development. tackling real-world problems with real data both teaches you where your weaknesses are, and gives you practical experience in the more technical side of things (cleaning data, using tools, etc.)
how do i build a portfolio?
my friend who is a professional developer at a large company told me that your github isn't your primary resource when job-hunting, and any code on your github should be up to coding standards and demonstrate modularity (i.e. applicability to real-world systems over cleverness in algorithms). But then again, he is entirely on the development side with no machine learning experience.
from personal experiences, i know of at least one company that uses github and other personal projects as an initial estimator of skill barring and publications or released commercial projects. i don't think that pushing well-documented repos to your profile that demonstrate your knowledge and skills can hurt. other than that, i'd just check articles about people who have interviewed with big companies and see what they recommend.
tools of the trade
primary python packages :
python3
- i recommend miniconda
- numpy
, scipy
, konlpy
, nltk
, gensim
sklearn
, tensorflow
, keras
, keras-contrib
other python packages :
imblearn
, mlxtend
, h5py
(for saving models), tqdm
, keras-tqdm
apps :
jupyter notebook
comes with miniconda
pycharm
for version control (github) support
git
for githubbing if you suck at pycharming like me
hardware:
a macbook, or a computer with linux (no windows for the love of god). and i mean a real computer, not some piece of shit chromebook with 32GB EMMC SSD and 2GB RAM. you need some decent firepower to crunch real datasets. 8GB RAM or more (for caching datasets), plus a decent CPU with multiple cores (for computing the algorithms), and a decent hard drive (HDD or SSD, just needs decent storage for datasets).
Nvidia GPU with 4GB+ of VRAM...? only for efficient training of deep learning models.
head-first study plan
only slightly facetious, this is probably a decent way to get your feet wet and see if you enjoy ML:
python
for loops, importingnumpy
, reading a csv file withpandas
- (just watch) Andrew Ng's Coursera ML course
sklearn
web tutorials ( http://scikit-learn.org/stable/tutorial/index.html )sklearn
andkeras
tutorials at https://machinelearningmastery.com/blog/- enter some kaggle projects, or dream up your own pet project
- review/learn the rest as you go
full study plan
a long list of resources, presented in rough order of what i'd recommend studying
theoretical background
the three mathematical foundations of machine learning are probably statistics & probability (to understand the basic idea of statistical modeling), multivariate calculus (to understand the idea of optimization), and linear algebra (to understand the [computationally efficient] calculations behind the algorithm). While I was a strong math student in high school and college (enough that i was actually drafted into my high school's Math League), after however many years of neglecting that mental muscle, I was basically jumping in blind again. If I could rewind to 2014, I would have spent the year before SNU going through the following material:
statistics & probability, information theory
Probability & Statistics on MIT OpenCourseware
https://www.youtube.com/playlist?list=PL1DmdxuyZtDS4o1ZabeVqpWvxmvyy7d8P)
ThinkStats II
apparently a popular textbook. python-based practical approach to running statistical operations on data. while educational, he annoyingly relies on custom classes and functions written for the book that are not exactly useful outside the context of the given exercise.
you... should... have this if you came to study group...
Information Theory:
brush up on stuff like Shannon entropy, expectation, etc.
calculus
If you are really new to calculus and need a quick referesher as to WTF calculus is even about, and you don't have a time machine to travel back to Manlius, NY, 2002 and enroll in former-Professor Steadman's excellent AP class, check out the 3Blue1Brown series Essence of Calculus:
https://www.youtube.com/watch?v=WUvTyaaNkzM&list=PLZHQObOWTQDMsr9K-rj53DwVRMYO3t5Yr
Then check out this great lecture series from Mizzou's Multivariable Calculus:
https://www.youtube.com/playlist?list=PL576C313B98C1419E
linear algebra
practice python and basic linear algebra with the Udacity Linear Algebra Refesher Course:
https://www.udacity.com/course/linear-algebra-refresher-course--ud953
Then keep reviewing more stuff, especially matrix operations (vs. vector operations), with some numpy
exercises if possible.
coding fundamentals
codecademy, udacity, dataquest, whatever. start with python; for ML, high-level languages like matlab, octave and R are nice to know, and for development, things like java and C++ may be handy (for deploying ML models).
handy (non-ML) tools:
python
data structures etc.
numpy
vector operations
pandas
dataframe manipulation
machine learning fundamentals
Andrew Ng's Coursera Course
seems like everyone and his brother has done this, you so are just at a disadvantage if you don't do it.
https://www.youtube.com/playlist?list=PLgIPpm6tJZoRdxIz247lwQbZuQ5X22DCl
UC Irvine's ML and Data Mining class
more math-y than Andrew Ng's course, it presupposes familiarity with calculus and probability.
https://www.youtube.com/playlist?list=PLaXDtXvwY-oDvedS3f4HW0b4KxqpJ_imw
Caltech's ML course
another good intro class lecture series
https://www.youtube.com/playlist?list=PLD63A284B7615313A
Elements of Statistical Learning II
textbook that @midnightdream1 recommended. it approaches common ML algorithms from a highly statistical POV that may differ significantly from the intuition and/or notations used by Andrew Ng and others. no page-turner but provides a very detailed and more usefully, alternative perspective on some algorithms.
you... should... have this if you came to study group...
mathematicalmonk's youtube playlist on ML
if his handle didn't give it away, this is waaay mathy but pretty approachable if you don't mind watching each video a couple of times. i use this as a review/alternative resource and not a course to watch all the way though.
https://www.youtube.com/watch?v=yDLKJtOVx5c&list=PLD0F06AA0D2E8FFBA
coding machine learning models
sklearn
, sklearn
and more sklearn
.
imblearn
and mlxtend
are other free tools that work with sklearn
to extend functionality.
deep learning
MIT Deep Learning Book
a comprehensive resource on all things deep learning. no need to read cover-to-cover at first, but a good reference. also provided at study group.
key algorithm: the multi-linear perceptron
first, make sure you understand linear regression and logistic regression. then study the MLP until it makes decent sense. this is the 'heart' of deep learning so it's important to understand. it is discussed in many of the ML resources above. check out our study group code for some more references, and also check out:
3Blue1Brown's intro to NN:
https://www.youtube.com/watch?v=aircAruvnKk&list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi
coding tools
tensorflow
with keras
: raw TF is probably more powerful, but it requires explicit definition of a lot of things that keras
can interpolate. keras
is definitely important for prototyping.
keras-contrib
for extensions to keras
gensim
for training word embeddings
common layer types:
Recurrent Networks/LSTM/GRU
for sequential data, including sequences over time. check out colah blog for basic intro:
Convolutional Networks
commonly for examining images, though with clever application they can also map sequential data ala recurrent networks; essentially they detect certain changes in a direction of space or time.
Check out Numberphile videos on YT: "How Blurs & Filters Work" "Finding the Edges (Sobel Operator)" "Neural Network that Changes Everything" "Inside a Neural Network"
practical exercises
machinelearningmastery.com has some cool tutorials for basic tasks in keras
:
https://machinelearningmastery.com
...and beyond
reinforcement learning lectures
https://www.youtube.com/playlist?list=PL7-jPKtc4r78-wCZcQn5IqyuWhBZ8fOxT