GSoC_2018_project_arrow - shogun-toolbox/shogun GitHub Wiki
Arrow Buffer as CFeatures memory backend
Now that more and more data science project starts to use Apache Arrow as a memory backend
or at least has the support to export the data into an Arrow Buffer (see for example SPARK-13534) it would be great that some of the Shogun's CFeatures
classes could use Arrow Buffer as a memory backend.
Mentors
Difficulty & Requirements
Medium.
You need know
- C++
- basic software engineering
Description
Apache Arrow is a cross-language development platform for in-memory data. It specifies a standardized language-independent columnar memory format for flat and hierarchical dataIt provides computational libraries and zero-copy streaming messaging and interprocess communication. Languages currently supported include C, C++, Java, JavaScript, Python, and Ruby.
Using Arrow as CFeatures
would not only allow us for example to directly work over pandas DataFrame via pyarrow,
but in the long run, as the number of supported languages of Arrow is getting more and more, slowly and gradually we could
get rid of some of the SWIG based typemaps, which would result in a significant memory footprint reduction as well as
performance.
Useful resources
Start with checking out the prototype in the feature/arrow
branch.