mlr3hf - rstats-gsoc/gsoc2026 GitHub Wiki

Background

Hugging face hub has a lot of data sets.

mlr3 is a machine learning framework in R in which a data set is represented by a Task, including meta-data about column roles (feature, target, group, …).

It would be nice to be able to download data from Hugging face and get a Task.

Related work /impact

  • hfhub is a port of the huggingface_hub python library, for downloading data sets from https://huggingface.co/datasets
  • These are similar to mlr3oml::otsk() which also downloads meta-data on columns to use for inputs, outputs, etc. (from OpenML not Hugging Face)

Details of your coding project

  • new package mlr3hf
  • function htsk() like otsk()
  • support for classification and regression
  • docs in Rd + vignettes
  • tests

Tests

  • easy: demonstrate how to use otsk() for two classification data sets.
  • hard: download a HF data set and convert it to a Task. (using only R code, not python nor reticulate)

Mentors

When you have finished at least one test, please add a link to it on this page, then contact the following mentors:

Potential contributor test results (to edit)

IMPORTANT: please avoid using AI code generation tools (Copilot, ChatGPT, etc) for this project. If your test results seem to be AI-generated, then you will probably not be selected as a contributor for this project.