Data Challenge Problem - A-R-Smith/data-challenge GitHub Wiki

Contents

  1. My simple Solution
    1. Performance Considerations
    2. Possible Improvements
    3. Unit Tests
  2. Distributed Solution

Simple Solution

Usually when tackling a large data problem (or any programming problem really) I write the easiest naïve solution first. The algorithm is to have a sliding window of varying length that iterates through all strings in the files looking for matches. The count of matches for this string are then added to a map with the string itself as the string.

The performance for this solution is bad. From the sheer combinatorics of how many possible strings can be contained in a file, the map of strings/counts is orders of magnitude larger than the file itself. In order to just to get past memory errors I made this assumption:

  • Strings no longer than 40 characters
  • Throw out non-printable characters

This algorithm takes 30 seconds to run per file on my hardware and takes about .4 GB of memory per file.

Possible improvements:

OR.... we can throw more hardware at it.

Distributed Solution

This problem is a variation of the classic Map-Reduce Word Count problem that every hadoop beginner does.

We will use my simple solution as the example.

The Mappers will split the files/lines and run the algorithm finding the string counts for each. The Reducers will combine all the maps into a single map with 10000 keys.

On average my algorithm uses .4 gb memory per file and takes 30 seconds. That means to calculate the answer of all 100k files in 30secs we need 40k gb ram. (This is not taking into account the CPUs needed)

If we ran the map-reduce on m4.4xlarge (16 cpus, 64 gb ram) we would need 625 machines to run the query in 30s. At $0.8 per hour, this would cost $4.17.