Sprints: PyCon2012 - ShuaiYAN/ipython GitHub Wiki
- Where: PyCon 2012, Santa Clara
- When: March 12-15
- What: Let's sprint on prototyping memory efficient infrastructure tools suitable for distributed, short-lived CPU-bound iterative numerical analysis on small to medium sized clusters.
-
Who:
- Olivier Grisel (scikit-learn)
- Min Ragan-Kelley (IPython)
- Fernando Perez (IPython)
- Wes McKinney (Pandas & statsmodels)
- David Cournapeau (SciPy)
- Timmy Wilson (scikit-learn)
- Subhodeep Moitra ([email protected])
- Clay Woolam
- Jake Vanderplas (scikit-learn - March 12 only)
Low overhead means:
- do not force copy of intermediate data to hard drive if the data can fit in memory and if it would hurt performance of the whole process.
- do not copy big data chunks from kernel space to user space when not necessary so as to limit GC overhead and working memory exhaustion
- adding a second node should decrease the runtime of single node (once the machine is up and running)
- it should bring value when used in interactive data exploration sessions with IPython
- long term fault tolerant storage with replicated data and checksummed files
- fault-tolerance of computational nodes while processing jobs (the duration and number of nodes should be small enough to make this unlikely)
- support for unstructured (e.g. raw text) or sparse numerical data (e.g. Compressed Sparse Rows) that are not as trivially partitionable and "memmappable" as numpy arrays (would be nice to add support for those later though).
- support for Pregel-style distributed graph processing (e.g. PageRank, transitive closures...).
Let's drive the design of this PoC with some real need expressed by sprint participants:
Generic data manipulation / exploration (e.g. with Pandas) use cases:
- Distributed time series alignment between 2 (or more) large transaction logs "joined" by time codes.
- Distributed data aggregation operations, e.g. a distributed variant of Pandas GroupBy
- Distributed & streaming implementation of computation of means, variance and other statistical moments.
Machine learning related use cases (e.g. with scikit-learn):
- Embarrassingly parallel implementation of Sparse Coding with a fixed dictionary (for non linear feature extraction)
- Distributed implementation of (A)SGD on a partitioned training set with non blocking averaging using message passing.
- IPython.parallel & ZeroMQ (suitable for low latency small to medium transfer of data in user space)
- https://github.com/traviscline/gevent-zeromq/ (more low-level speed)
- numpy memmapped arrays: let the Linux kernel handle the disk access efficiently and free working memory from unused data
- the UNIX sendfile syscall for sending partitioned memmapped files around the cluster without loading stuff in the user space.
- Compressed numpy arrays, for instance using carray to speed up both disk / memory access and network transfers.
- Some kind of notion of partitioned array datastructure build on top of memmapped arrays + shared metadata for leveraging data locality on the cluster to limit the amount of data transferred over the network. We should steal the good ideas of Spark RDDs (resilient distributed datasets) or the FlumeJava API. However we don't necessary need to address resiliency in this first iteration.
- Hadoop AllReduce and Terascale Learning - blob post by John Langford on large scale distributed machine learning (1k nodes)
- Spark: a Scala framework for distributed, iterative analytics with a high level dataset oriented API and explict in-memory caching of intermediate results
- Is Teaching MapReduce Healthy for Students? Opinion blog post on the suitability of MapReduce for teaching distributed computing.
- Get ssh access to a small (~10-20 nodes) cluster for all participants. Fernando will do this.