Something Something Software Architecture - Gastove/meanrecipes GitHub Wiki
So This is Hard
(Updated 6/9) Here's the big damn question:
A user is at our site. They enter a search term. We don't want to make them wait! We can probably scrape and munge web data pretty quickly. We could also try and build some sort of sweet AJAX-y asynchronous result page that loads in more and more data as it returns. But the question really is, just how much data do we want to return, and how fast do we want to return it?
I've been thinking of using Map/Reduce with Hadoop to do hells of machine learnings -- and I still want to do that, 'cause it's gonna be fucking sweet. But: You can't map or reduce data that's too small -- not sensically. A way better solution is to do one of two things: process the data server-side (in native Scala), then return it (good); simply load an answer out of db (way better). Map/Reduce then becomes some kind of web scraper/crawler recipe finder thingee that goes hunting for more recipes and builds a Sweet Database of Averages, while the web service does a quick "Do I have it? If yes, return it, if no, calculate it from a quick search, return and store it".
I think that sounds better, but we should Talk About This.
Overall Data Flow
- User enters search term -- lets say, "Roasted Chicken"
- A search is executed, and the top N URLs are returned.
- URLs are passed as a list into the server Map/Reduce system via API call.
- The server returns the Sum of All Recipes Divided By the Count of Recipes -- that is the Mean Recipe. Maybe also the Modal Recipe, and the Standard Deviation of Recipes. (I am joking, but only barely :-P)
A little more detail and some of Ross' notes
Native Scala Architecture
The client layer can trivially do a DB lookup to check to see if we've already computed a MeanRecipe for the current search term -- and if one is found it can be returned. If no result is found, Scala can take over.
The basic process is basically identical to everything listed below, just with a lot less horsepower or fucking around in EC2 instances: either an incoming search term or an incoming URL list is received; each page is loaded and scraped; each page is classified, generating a Recipe object; each Recipe object gets combined into a Mean Recipe object.
Maybe -- maybe -- on the backend somewhere there's some kind of Map/Reduce whatsit filling the DB with meticulously combed results. Maybe. But I don't think that needs to be part of the app, really -- it should just feed a datastore the app can access.
Map/Reduce Architecture (Not Deprecated per se, but not current any more)
The server receives an incoming list of URLs and crunches them as args and creates a MeanRecipeMap object that represents the Map job. The MeanRecipeMap is then packaged with a MeanReduce object (specifying the reduce job) and the whole thing is sent into an EC2 instance to run. The return pipeline from EC2 is pretty vague to me; most likely, a serialized result object gets written in to an S3 instance and a callback notifies the server of job completion. On job complete, the server loads the result out of S3 and restores the object, then passes it back to the client layer.
MeanRecipeMap
The MeanRecipeMap object implements the Map class, probably via a few layers of abstraction. It will take as an argument an incoming list of URLs. In an improved version, a target term could be passed in order to better optimize results, but that's a v2 sort of thing. Once it's all set up and running, it spins up a Mapper.
The Mapper will then complete the following workflow:
- Create an Extractr object.
- Load each URL and pass it in to the Extractr.
- Extractr returns a Recipe object.
- The Mapper serializes the Recipe to a temp file.
- (Optional) There's a lot of potential here for storing the result permanently and building our own DB of recipes, or improving/expanding the training set. This would be one of a few possible places to do so.
MeanReducer
MeanReducer is the final piece: the Reducer loads in every return object from the temp file and then processes them to produce the MeanRecipe object. I don't... totally understand how this will work yet. There will probably be regexes. It's gonna be good.
Other Objects Mentioned
Recipe
So this is a single recipe and all the metadata we can extract. Fields:
- Source
- Title (if we have one for it)
- Raw Text
- Ingredients ** Quantity ** Name ** Notes
- Directions ** Order ** Text
MeanRecipe
This'n is way more complicated, and we should talk about this. I've got a couple ideas I find very promising, but... I would still much rather develop this bit together.