Skip to content

Activeloop GSoC 2022 project ideas

mikayelh edited this page Apr 5, 2022 · 7 revisions

Introduction

Data scientists lose a lot of time managing data. We want to fix that by introducing a new standard for working with unstructured data. Will you help us achieve that?

This is the Google Summer of Code 2022 (GSoC'22) ideas page for Hub. Hub is the fastest way to store, access & manage datasets with version-control for PyTorch/TensorFlow. Hub unifies the storage for datasets, making them streamable & accessible from any machine at any scale for AI/ML.

Activeloop is building the Database for AI First step in achieving this is building the dataset format for AI - Hub. With Hub, we store your (even petabyte-scale) datasets as a single NumPy-like array on the cloud, so you can seamlessly access and work with it from any machine. Hub makes any data type (images, text files, audio, or video) stored in the cloud usable as fast as if it were stored on-premise. With the same dataset view, your team can always be in sync.

This page lists a number of ideas for Google Summer of Code projects for Hub, plus give some pointers for potential GSoC students on how to get started with contributing and putting together their application. All the projects are available for part and full-time capacity, but preference may be given to full-time GSoC contributors.

Who is using Hub?

Hub is developed by team Activeloop (activeloop.ai) is being used by Google, Waymo, Red Cross, Oxford UNiversity, John Hopkins University, and others.

Hub Features

  • Store and retrieve large datasets with version-control
  • Collaborate as in Google Docs: Multiple data scientists working on the same data in sync with no interruptions
  • Access from multiple machines simultaneously
  • Deploy anywhere - locally, on Google Cloud, S3, Azure as well as Activeloop (by default - and for free!)
  • Integrate with your ML tools like Numpy, Dask, Ray, PyTorch, or TensorFlow
  • Create arrays as big as you want. You can store images as big as 100k by 100k!
  • Keep the shape of each sample dynamic. This way you can store small and big arrays as 1 array.
  • Visualize any slice of the data in a matter of seconds without redundant manipulations via app.activeloop.ai



Visualization of a dataset uploaded to Hub via app.activeloop.ai (free tool)

Guidelines & requirements

Hub is participating in GSoC'22 under the umbrella of Python Software Foundation.

Contacting us

Before you apply, you may join the Activeloop Hub Slack (slack.activeloop.ai) to connect with potential mentors to help find interesting projects. Ask questions in #gsoc. This will help you get a head-start! While PRs before applying are not required, they are highly welcome (check out good first issues or upload a requested dataset to Hub)! Please introduce yourselves in #introductions and let us know you’re considering contributing to Hub for GSOC. We are super friendly (or so they say).

Please hit up @Mikayel anytime in the community slack if you need any help. Alternatively, you may send a note to gsoc@activeloop.ai.

Getting Started

Note : We don’t use any unusual libraries apart from Bugout for telemetry. We use git for source control (the contributing guide should be helpful here)

Contributing to Hub within GSOC

We strongly advise reviewing the Python GSOC guidelines. Don't forget to mention our sub-org name (Activeloop). Here are some of the best case practices from our side to make a great first impression with your application.

The key factors we evaluate are:

  1. Do you have what it takes to deliver an interesting project on time?
  2. Is the project reasonably scoped with clear milestones?
  3. Prior OSS contributions (preferably in Activeloop Hub)?
  4. Is the proposed mentor enthusiastic about the project?
  • The Project
    • Deliverables
    • The feasibly scoped project with quantifiable milestones
    • How will you prioritize different aspects of the project like features, API usability, documentation, and robustness?
  • Things to include about yourself:
    • Code you're proud of. It doesn't have to be Hub (but that doesn't hurt either). You don't need to be a star programmer as long as you can demonstrate an interest in and commitment to your project, and, ideally quantifiable results of that code being useful to users.
    • Your CV/website/blog
    • Why did you choose Hub? (max 100 words)
    • Are you part of an underrepresented group in STEM?
    • Email and GitHub
    • Time commitments over the summer

Share drafted applications in the #gsoc Slack channel for feedback! For accessibility needs, please email gsoc@activeloop.ai.

Project Ideas List

Add efficient support for new types of data

[Enhancement]

Rating: Easy

Description: Hub datasets are a way to store unstructured data in a structured format. We already have support for a lot of data types such as Image, Audio, Video, Text, etc. There are other data types that are also really useful for some datasets such as point clouds, seismic data, Xarray support and more.

  • Skills: Good experience with Python and familiarity with different types of data
  • Expected outcome: This project would make Hub support even more types of data more efficiently.
  • Mentor: Fariz Rahman(slack: @Fariz Rahman)

Explore and optimize popular datasets for Hub

[Enhancement]

Rating: Easy

Description: Swift access to a large number of datasets is the most basic premise of Hub. This task requires preparing, converting, splitting and uploading public domain datasets used in most popular data science contexts. With the datasets ready, one might look into problems in which these datasets are used and create tutorials aiming for optimized use of specific datasets in a given use case.

  • Skills: Python, Data Engineering
  • Expected outcome: The project leads to an addition of a number of datasets to Hub along with tutorials for each of the datasets.
  • Mentor: Fariz Rahman (slack: @Fariz Rahman)

Integrate Hub with Kubeflow, Metaflow, MLFlow, Horovod, and the like

[Enhancement]

Rating: Easy

Description: Easy conversion from and to various formats, loaders, and machine learning flows allows for a seamless experience of users migrating from other packages. Kubeflow, Metaflow, MLFlow, and Horovod represent four different approaches to a machine learning cycle. However, choosing one of these is currently limiting a data scientist to that specific package. The possibility of migrating data through Hub datasets offers a vital alternative to scientists who need to change their machine learning workflow (e.g. as a result of incorporating the infrastructure into Kubernetes). In this project, you can make Hub bring all the platforms to the table.

  • Skills: Python, Kubernetes, Kubeflow, Metaflow, and MLFlow,
  • Expected outcome: The project should lead to a Python package that integrates Hub datasets in pipelines and training of Kubeflow, Metaflow, MLFlow, and Horovod packages.
  • Mentor:(slack: @Fariz Rahman)

Query Datasets using natural language

[Enhancement]

Rating: Medium

Description: As Datasets get larger and larger, it becomes harder to find and filter the samples that you require. Currently, in Hub, such querying can be done using a lambda function as a filter. Allowing users to use natural language querying would greatly enhance the usability of these filters. This project seeks to design and implement a solution that converts users’ queries into a lambda function that is used to filter datasets.

  • Skills: Experience with Python and NLP
  • Expected outcome: This project would result in a better querying system that would get integrated Hub’s API as well as our visualization web app.
  • Mentor: Fariz Rahman (slack: @Fariz Rahman)

Simplify high-level API for Hub

[Enhancement]

Rating: Medium

Description: Clarity in user-facing API has been a fundamental virtue of many successful machine learning libraries, such as sci-kit-learn. Currently, to create a Hub dataset, a number of parameters need to be specified. Hub's API involves quite a few classes that should be restricted to its internal structures, e.g. sharded datasets. The project involves researching what the API should look like, implementing it, and rewriting relevant parts of the documentation to lower the barriers of entry for new users of Hub.

  • Skills: general Python, Data Engineering
  • Expected outcome: The project should lead to a proposal of a new Hub API in the form of a Python package.
  • Mentor: Fariz Rahman (slack: @Fariz Rahman)

Auto Dataset Tuning: Auto Dataset Generation using hub & hub.transform Generate experiments that generate datasets to improve the overall accuracy of the model.

[Feature]

Rating: Hard

Description: Generate datasets with the goal of improving accuracy. One of the best ways to improve a model is by improving the dataset itself. The goal here is to generate multiple "plugin" type strategies to auto-generate datasets and train a model on each generated dataset. This model is then tested on the original test data to generate accuracy metrics and further tune the dataset.

Allow auto improvement on model performance (accuracy etc.) by automatically applying various transformations on the given dataset and retraining.

Create a list of generic hub.transforms:

  • Augmentation
  • Segmentation
  • Crop
  • Rotation
  • Zoom
  • etc.

Use various generic hub.transform to create new datasets. Train on these new datasets and see if accuracy improves.

Process:

  • Search the hyperparameter space where HP are various strategies and parameters inside it.
  • Create 10 datasets using various strategies, check accuracies.
  • Pick the dataset with the highest accuracy and regenerate the datasets for next experiment

Use Hyper Parameter search algorithms to generate these experiments to run on the dataset.

Deliverables:

  • Ability to train & validate each dataset and use the results to rank the strategy.
  • Use previous results to generate new experiments

Requirements:

  • Experience with python programming
  • Experience with TensorFlow, pytorch, and other ML frameworks
  • Experience with hyperparameter search, auto-tuning and autoML for existing frameworks
  • Prior experience in delivering and productionizing ML models.

Metrics include:

  1. Training loss/accuracy
  2. Validation loss/accuracy
  3. IOU
  4. Confusion Matrix
  5. Custom Metrics

Use strategies like: rotation, augmentation, cropping, etc for image data Use strategies like tf-idf, one-hot encoding, etc. for text data Use strategies like normalizing, scaling, etc for numeric data

The various plugins should be able to work with minimal configuration on various kinds of datasets including images, audio, videos & text data.

  • Skills: Data Engineering, Machine Learning, Python, Tensorflow/Pytorch, Data Science
  • Expected outcome: Pipelines to generate datasets that are plugged into a model and trained to improve the overall model accuracy.
  • Mentor: Fariz Rahman (slack : @Fariz Rahman)

Automatic Generation of schema from directory

[Enhancement]

Rating: Hard

Description: Hub datasets are a way to store unstructured data in a structured format. We refer to this structure as the schema. This structure is currently created manually. Automatic inference of the structure from an underlying directory.

  • Skills: Good experience with Python and Dataset management. ML experience might also be required if the proposed solution uses ML to auto infer the schema from the directory structure.
  • Expected outcome: This project would make Hub API very easy for the end-user by allowing them to automatically infer the schema from the directory.
  • Mentor: Fariz Rahman (slack: @Fariz Rahman)

Custom learned compression for storing arrays

[Enhancement]

Rating: Hard

Description: Some of the user's store petabyte-scale datasets. They are using widely used compression techniques (gzip, png, jpeg, lz4, you name it). If a compression technique could be learned using ML algorithms it can significantly save costs and optimize storage.

  • Skills: Good experience with Python, ML, and C++.
  • Expected outcome: Custom learned lossless compression technique for storing a dataset. Either the model does the compression or finds the best kernel within classic algorithms. Specific datasets learn the data distribution first and then efficiently compress it. Requires extensive benchmarking for both efficiency and effectiveness of the compression technique.
  • Mentor: Fariz Rahman (slack: @Fariz Rahman)