Machine Learning - Gastove/meanrecipes GitHub Wiki

Update, 6/9

Waugh. So I've been chewing on my desire to use Super Fancy Tools, and I think it is... the wrong one, to begin with. Part A. of this is that I did some testing and discovered that web scrapes just arent enough data to justify spinning up a Hadoop job. Part B is that the round-trip between our app and EC2 is just... well, it's going to be way the hell longer than either processing it server-side (for tiny data sets) or just loading a pre-processed result out of DB.

So.

It looks like the right approach is really this:

  1. Get a basic ML solution, written in native Scala that runs locally in just plain scala, up off the ground.

  2. Use that to get the whole rest of this project moving forward, so Hadoop doesn't become a huge headache/heartache/roadblock.

  3. Figure out this overarching architectural question.

Old

Well.

I've wanted to use some super fancy tools to do this, and it's quickly becoming apparent to me that the Super Fancy Tools might be the most hugest investment of time imaginable. This isn't to say I shouldn't use them, but maybe I should use them as part of Step 2? Hrm. You can build a very simple classifier in pure Scala, and then I can get going with that.

Pro: I can probably do that today, and it can definitely be served from the web.

Con: Less useful for professional development; less satisfyingly "cool". Also, this is a one-shot proof-of-concept solution, not a particularly scalable one. (That is, I've got no idea how well a pure, native Scala classifier will perform running on the Real Live Internet. Also, a Mahout job can be extended from a linear classification into a vastly more sophisticated classification pretty trivially; this... uh, 100% can't.)