how_do_i_shot_web - SNUDerek/MLsnippets GitHub Wiki

ML study plan suggestions

engineer vs researcher?

it seems that there is a difference between the engineering and research sides of the machine learning field, where researchers are the ones at the bleeding edge, developing new algorithms and techniques, whereas engineers are expected to apply the discoveries of the researchers to new problems. insert the joke [paraphrased from reddit] where the deep learning researchers try to apply some complex neural network, the statisticians try to dream up some complex model, and the engineer just applies random forest because she knows that that usually works - of course the neural network fails to converge properly, the model's assumptions turn out to be false and the random forest proves to be the most effective. obviously there is overlap but i feel being aware of this distinction (as well as however you want to delineate machine learning vs data science) is important in defining goals and objectives.

so do i need to know all the math?

the more the better, of course, and i feel that a deeper understanding of the underlying algorithms helps develop a better intuition, and my own gaps in knowledge obfuscate my understanding. you don't ahve to take only my word; one of my idols Andrej Karpathy asserts that "Yes you should understand backprop" (https://medium.com/@karpathy/yes-you-should-understand-backprop-e2f06eab496b). On the other hand, Ive both seen people argue online as well as coworkers in the field assert that it's fine to jump in and fill in the gaps later, and i do feel that there is a certain benefit to getting your hands dirty, especially if the alternative is paralysis that impedes your development. tackling real-world problems with real data both teaches you where your weaknesses are, and gives you practical experience in the more technical side of things (cleaning data, using tools, etc.)

how do i build a portfolio?

my friend who is a professional developer at a large company told me that your github isn't your primary resource when job-hunting, and any code on your github should be up to coding standards and demonstrate modularity (i.e. applicability to real-world systems over cleverness in algorithms). But then again, he is entirely on the development side with no machine learning experience.

from personal experiences, i know of at least one company that uses github and other personal projects as an initial estimator of skill barring and publications or released commercial projects. i don't think that pushing well-documented repos to your profile that demonstrate your knowledge and skills can hurt. other than that, i'd just check articles about people who have interviewed with big companies and see what they recommend.

tools of the trade

primary python packages :

python3 - i recommend miniconda - numpy, scipy, konlpy, nltk, gensim sklearn, tensorflow, keras, keras-contrib

other python packages :

imblearn, mlxtend, h5py (for saving models), tqdm, keras-tqdm

apps :

jupyter notebook comes with miniconda

pycharm for version control (github) support

git for githubbing if you suck at pycharming like me

hardware:

a macbook, or a computer with linux (no windows for the love of god). and i mean a real computer, not some piece of shit chromebook with 32GB EMMC SSD and 2GB RAM. you need some decent firepower to crunch real datasets. 8GB RAM or more (for caching datasets), plus a decent CPU with multiple cores (for computing the algorithms), and a decent hard drive (HDD or SSD, just needs decent storage for datasets).

Nvidia GPU with 4GB+ of VRAM...? only for efficient training of deep learning models.

head-first study plan

only slightly facetious, this is probably a decent way to get your feet wet and see if you enjoy ML:

python for loops, importing numpy, reading a csv file with pandas
(just watch) Andrew Ng's Coursera ML course
sklearn web tutorials ( http://scikit-learn.org/stable/tutorial/index.html )
sklearn and keras tutorials at https://machinelearningmastery.com/blog/
enter some kaggle projects, or dream up your own pet project
review/learn the rest as you go

full study plan

a long list of resources, presented in rough order of what i'd recommend studying

theoretical background

the three mathematical foundations of machine learning are probably statistics & probability (to understand the basic idea of statistical modeling), multivariate calculus (to understand the idea of optimization), and linear algebra (to understand the [computationally efficient] calculations behind the algorithm). While I was a strong math student in high school and college (enough that i was actually drafted into my high school's Math League), after however many years of neglecting that mental muscle, I was basically jumping in blind again. If I could rewind to 2014, I would have spent the year before SNU going through the following material:

statistics & probability, information theory

Probability & Statistics on MIT OpenCourseware

https://www.youtube.com/playlist?list=PL1DmdxuyZtDS4o1ZabeVqpWvxmvyy7d8P)

ThinkStats II

apparently a popular textbook. python-based practical approach to running statistical operations on data. while educational, he annoyingly relies on custom classes and functions written for the book that are not exactly useful outside the context of the given exercise.

you... should... have this if you came to study group...

Information Theory:

brush up on stuff like Shannon entropy, expectation, etc.

calculus

If you are really new to calculus and need a quick referesher as to WTF calculus is even about, and you don't have a time machine to travel back to Manlius, NY, 2002 and enroll in former-Professor Steadman's excellent AP class, check out the 3Blue1Brown series Essence of Calculus:

https://www.youtube.com/watch?v=WUvTyaaNkzM&list=PLZHQObOWTQDMsr9K-rj53DwVRMYO3t5Yr

Then check out this great lecture series from Mizzou's Multivariable Calculus:

https://www.youtube.com/playlist?list=PL576C313B98C1419E

linear algebra

practice python and basic linear algebra with the Udacity Linear Algebra Refesher Course:

https://www.udacity.com/course/linear-algebra-refresher-course--ud953

Then keep reviewing more stuff, especially matrix operations (vs. vector operations), with some numpy exercises if possible.

coding fundamentals

codecademy, udacity, dataquest, whatever. start with python; for ML, high-level languages like matlab, octave and R are nice to know, and for development, things like java and C++ may be handy (for deploying ML models).

handy (non-ML) tools: python data structures etc. numpy vector operations pandas dataframe manipulation

machine learning fundamentals

Andrew Ng's Coursera Course

seems like everyone and his brother has done this, you so are just at a disadvantage if you don't do it.

https://www.youtube.com/playlist?list=PLgIPpm6tJZoRdxIz247lwQbZuQ5X22DCl

UC Irvine's ML and Data Mining class

more math-y than Andrew Ng's course, it presupposes familiarity with calculus and probability.

https://www.youtube.com/playlist?list=PLaXDtXvwY-oDvedS3f4HW0b4KxqpJ_imw

Caltech's ML course

another good intro class lecture series

https://www.youtube.com/playlist?list=PLD63A284B7615313A

Elements of Statistical Learning II

textbook that @midnightdream1 recommended. it approaches common ML algorithms from a highly statistical POV that may differ significantly from the intuition and/or notations used by Andrew Ng and others. no page-turner but provides a very detailed and more usefully, alternative perspective on some algorithms.

you... should... have this if you came to study group...

mathematicalmonk's youtube playlist on ML

if his handle didn't give it away, this is waaay mathy but pretty approachable if you don't mind watching each video a couple of times. i use this as a review/alternative resource and not a course to watch all the way though.

https://www.youtube.com/watch?v=yDLKJtOVx5c&list=PLD0F06AA0D2E8FFBA

coding machine learning models

sklearn, sklearn and more sklearn.

imblearn and mlxtend are other free tools that work with sklearn to extend functionality.

deep learning

MIT Deep Learning Book

a comprehensive resource on all things deep learning. no need to read cover-to-cover at first, but a good reference. also provided at study group.

key algorithm: the multi-linear perceptron

first, make sure you understand linear regression and logistic regression. then study the MLP until it makes decent sense. this is the 'heart' of deep learning so it's important to understand. it is discussed in many of the ML resources above. check out our study group code for some more references, and also check out:

3Blue1Brown's intro to NN:

https://www.youtube.com/watch?v=aircAruvnKk&list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi

coding tools

tensorflow with keras: raw TF is probably more powerful, but it requires explicit definition of a lot of things that keras can interpolate. keras is definitely important for prototyping.

keras-contrib for extensions to keras

gensim for training word embeddings

common layer types:

Recurrent Networks/LSTM/GRU

for sequential data, including sequences over time. check out colah blog for basic intro:

http://colah.github.io

Convolutional Networks

commonly for examining images, though with clever application they can also map sequential data ala recurrent networks; essentially they detect certain changes in a direction of space or time.

Check out Numberphile videos on YT: "How Blurs & Filters Work" "Finding the Edges (Sobel Operator)" "Neural Network that Changes Everything" "Inside a Neural Network"

practical exercises

machinelearningmastery.com has some cool tutorials for basic tasks in keras:

https://machinelearningmastery.com

...and beyond

reinforcement learning lectures

https://www.youtube.com/playlist?list=PL7-jPKtc4r78-wCZcQn5IqyuWhBZ8fOxT