The Taskframe Class: Adding Datasets - Taskframe/taskframe-python GitHub Wiki

This page details methods to add a Dataset to a Taskframe. As in usual data science tools, a Dataset is simply an iterable collection containing the data you want to annotate. Each item in the Dataset will result in an annotation Task.

The library offers several convenient methods to load them from different formats.

Note that these methods are lazily evaluated, the Dataset is not actually submitted until you call taskframe.submit()

Quick Summary


# Assuming you have an image classification Taskframe:
tf = Taskframe(data_type="image", task_type="classification", classes=["cat", "dog"])

# Adding a dataset from a list of urls or local files:
tf.add_dataset_from_list(["https://server.com/img1.jpg", "https://server.com/img1.jpg"])
tf.add_dataset_from_list(["local/path/img1.jpg", "local/path/img2.jpg"])

# Adding a dataset from a folder:
tf.add_dataset_from_folder("local/path")

# Adding a dataset from a CSV containing raw data, urls or paths to local files:
tf.add_dataset_from_csv("local/data.csv", column="items")

# Adding a dataset from a Pandas dataframe containing raw data, urls or paths to local files:
dataframe = pd.DataFrame(...)
tf.add_dataset_from_dataframe(dataframe, column="items")


add_dataset_from_list

Add Dataset directly from a list of items. The items may be direct raw data (for example text), urls or paths to local files. Signature:

def add_dataset_from_list(
    self, items, input_type=None, custom_ids=None, labels=None
):

Parameters:

  • items: a list of items that will be annotated. items may be file paths, urls, or raw data (see below)
  • input_type (optional): the type of items : file, url, data. If not provided it will be inferred.
  • custom_ids (optional): list of unique item ids. length should match items
  • labels (optional): list of initial labels of your items, that will be used to initialize the annotation form. This is useful for example when you already have a Machine learning model that may generate baseline labels and you want workers to correct them. The length of the labelslist should match items (fill with None values if necessary).

Returns: None

Example:

# Assuming you have a text classification Taskframe:
tf = Taskframe(data_type="text", task_type="classification", classes=["positive", "negative"])

tf.add_dataset_from_list(["this product is really awesome!", "I don't like it"])

add_dataset_from_folder

Add Dataset from all files from a specific folder. Signature:

def add_dataset_from_folder(
    self, path, custom_ids=None, labels=None, recursive=False, pattern="*"
)

Parameters:

  • path: string or Path instance of the folder containing your files.
  • recursive: Boolean. If true will also laod sub-directories.
  • pattern: filter allowed file extensions.

Returns: None

Example:

# Assuming you have an image classification Taskframe:
tf = Taskframe(data_type="image", task_type="classification", classes=["cat", "dog"])

tf.add_dataset_from_folder("local/path/to/images")

add_dataset_from_csv

Add dataset from a local CSV file. Rows may contain either raw data (for example text), urls, or paths to local files. Signature:

add_dataset_from_csv(
    self,
    csv_path,
    column=None,
    input_type=None,
    base_path=None,
    custom_id_column=None,
    label_column=None,
)

Parameters:

  • csv_path (required): string or Path containing the path to the CSV file.
  • column: The column of the CSV containing your data. If undefined, takes the first column.
  • input_type: the type of items : file, url, data`. If not provided it will be inferred.
  • base_path: if you are passing relative file paths, you may pass this base_path that will be prepended to each file's path;
  • custom_id_column: the column containing unique item ids.
  • label_column: column containing initial labels for your items

Returns: None

Example:

# Assuming you have an image classification Taskframe:
tf = Taskframe(data_type="image", task_type="classification", classes=["cat", "dog"])

tf.add_dataset_from_csv("mydata.csv", column="item")

add_dataset_from_dataframe

Add a Dataset from a Pandas dataframe. Signature:

add_dataset_from_dataframe(
        self,
        dataframe,
        column=None,
        input_type=None,
        base_path=None,
        custom_id_column=None,
        label_column=None,
    )

Parameters:

  • dataframe: the Pandas dataframe
  • column: The column of the dataframe containing your data. If undefined, takes the first column.
  • input_type: the type of items : file, url, data`. If not provided it will be inferred.
  • base_path: if you are passing relative file paths, you may pass this base_path that will be prepended to each file's path;
  • custom_id_column: the column containing unique item ids.
  • label_column: column containing initial labels of your items

Returns: None

Example:

# Assuming you have an image classification Taskframe:
tf = Taskframe(data_type="image", task_type="classification", classes=["cat", "dog"])
import pandas as pd
dataframe = pd.DataFrame(...)
tf.add_dataset_from_dataframe(dataframe, column="item")