mlr3hf - rstats-gsoc/gsoc2026 GitHub Wiki
Background
Hugging face hub has a lot of data sets.
mlr3 is a machine learning framework in R in which a data set is represented by a Task, including meta-data about column roles (feature, target, group, …).
It would be nice to be able to download data from Hugging face and get a Task.
Related work /impact
- hfhub is a port of the huggingface_hub python library, for downloading data sets from https://huggingface.co/datasets
- These are similar to
mlr3oml::otsk()which also downloads meta-data on columns to use for inputs, outputs, etc. (from OpenML not Hugging Face)
Details of your coding project
- new package
mlr3hf - function
htsk()likeotsk() - support for classification and regression
- docs in Rd + vignettes
- tests
Tests
- easy: demonstrate how to use otsk() for two classification data sets.
- hard: download a HF data set and convert it to a Task. (using only R code, not python nor reticulate)
Mentors
When you have finished at least one test, please add a link to it on this page, then contact the following mentors:
Potential contributor test results (to edit)
IMPORTANT: please avoid using AI code generation tools (Copilot, ChatGPT, etc) for this project. If your test results seem to be AI-generated, then you will probably not be selected as a contributor for this project.
- Contributor Name, links
- Kushal Chhajed - mlr3 + Hugging Face GSoC 2026 Tasks, Live Report | GitHub Repository
- Aksh Kaushik — mlr3 + Hugging Face GSoC 2026, Easy Task: Live Report | GitHub ✅ Hard Task: Live Report | GitHub