Data Challenge Problem - A-R-Smith/data-challenge GitHub Wiki

My simple Solution
1. Performance Considerations
2. Possible Improvements
3. Unit Tests
Distributed Solution

Simple Solution

Usually when tackling a large data problem (or any programming problem really) I write the easiest naïve solution first. The algorithm is to have a sliding window of varying length that iterates through all strings in the files looking for matches. The count of matches for this string are then added to a map with the string itself as the string.

The performance for this solution is bad. From the sheer combinatorics of how many possible strings can be contained in a file, the map of strings/counts is orders of magnitude larger than the file itself. In order to just to get past memory errors I made this assumption:

Strings no longer than 40 characters
Throw out non-printable characters

This algorithm takes 30 seconds to run per file on my hardware and takes about .4 GB of memory per file.

Possible improvements:

Culling strings that are not close to being in top 10000.
Make more assumptions about data (i.e eliminate whitespace and control characters, try to tokenize the data into smaller strings)
Writing the map to disc to save memory
Take another look at the Untar/Encoding
Use TreeMap instead
Figure out how to eliminate substrings of strings
Look into other possible algorithms
- Knuth–Morris–Pratt algorithm (https://en.wikipedia.org/wiki/Knuth%E2%80%93Morris%E2%80%93Pratt_algorithm)
- EPMA: Efficient pattern matching algorithm (https://www.sciencedirect.com/science/article/pii/S0957417417301811)

OR.... we can throw more hardware at it.

Distributed Solution

This problem is a variation of the classic Map-Reduce Word Count problem that every hadoop beginner does.

We will use my simple solution as the example.

The Mappers will split the files/lines and run the algorithm finding the string counts for each. The Reducers will combine all the maps into a single map with 10000 keys.

On average my algorithm uses .4 gb memory per file and takes 30 seconds. That means to calculate the answer of all 100k files in 30secs we need 40k gb ram. (This is not taking into account the CPUs needed)

If we ran the map-reduce on m4.4xlarge (16 cpus, 64 gb ram) we would need 625 machines to run the query in 30s. At $0.8 per hour, this would cost $4.17.

Data Challenge Problem - A-R-Smith/data-challenge GitHub Wiki

Contents

Simple Solution

Distributed Solution