The Taskframe Class: Adding Datasets - Taskframe/taskframe-python GitHub Wiki
This page details methods to add a Dataset
to a Taskframe
. As in usual data science tools, a Dataset
is simply an iterable collection containing the data you want to annotate. Each item in the Dataset will result in an annotation Task
.
The library offers several convenient methods to load them from different formats.
Note that these methods are lazily evaluated, the Dataset
is not actually submitted until you call taskframe.submit()
Quick Summary
# Assuming you have an image classification Taskframe:
tf = Taskframe(data_type="image", task_type="classification", classes=["cat", "dog"])
# Adding a dataset from a list of urls or local files:
tf.add_dataset_from_list(["https://server.com/img1.jpg", "https://server.com/img1.jpg"])
tf.add_dataset_from_list(["local/path/img1.jpg", "local/path/img2.jpg"])
# Adding a dataset from a folder:
tf.add_dataset_from_folder("local/path")
# Adding a dataset from a CSV containing raw data, urls or paths to local files:
tf.add_dataset_from_csv("local/data.csv", column="items")
# Adding a dataset from a Pandas dataframe containing raw data, urls or paths to local files:
dataframe = pd.DataFrame(...)
tf.add_dataset_from_dataframe(dataframe, column="items")
add_dataset_from_list
Add Dataset directly from a list of items. The items may be direct raw data (for example text), urls or paths to local files. Signature:
def add_dataset_from_list(
self, items, input_type=None, custom_ids=None, labels=None
):
Parameters:
items
: a list of items that will be annotated. items may be file paths, urls, or raw data (see below)input_type
(optional): the type of items :file
,url
,data
. If not provided it will be inferred.custom_ids
(optional): list of unique item ids. length should matchitems
labels
(optional): list of initial labels of your items, that will be used to initialize the annotation form. This is useful for example when you already have a Machine learning model that may generate baseline labels and you want workers to correct them. The length of thelabels
list should matchitems
(fill withNone
values if necessary).
Returns: None
Example:
# Assuming you have a text classification Taskframe:
tf = Taskframe(data_type="text", task_type="classification", classes=["positive", "negative"])
tf.add_dataset_from_list(["this product is really awesome!", "I don't like it"])
add_dataset_from_folder
Add Dataset from all files from a specific folder. Signature:
def add_dataset_from_folder(
self, path, custom_ids=None, labels=None, recursive=False, pattern="*"
)
Parameters:
path
: string orPath
instance of the folder containing your files.recursive
: Boolean. If true will also laod sub-directories.pattern
: filter allowed file extensions.
Returns: None
Example:
# Assuming you have an image classification Taskframe:
tf = Taskframe(data_type="image", task_type="classification", classes=["cat", "dog"])
tf.add_dataset_from_folder("local/path/to/images")
add_dataset_from_csv
Add dataset from a local CSV file. Rows may contain either raw data (for example text), urls, or paths to local files. Signature:
add_dataset_from_csv(
self,
csv_path,
column=None,
input_type=None,
base_path=None,
custom_id_column=None,
label_column=None,
)
Parameters:
csv_path
(required): string orPath
containing the path to the CSV file.column
: The column of the CSV containing your data. If undefined, takes the first column.input_type: the type of items :
file,
url,
data`. If not provided it will be inferred.base_path
: if you are passing relative file paths, you may pass thisbase_path
that will be prepended to each file's path;custom_id_column
: the column containing unique item ids.label_column
: column containing initial labels for your items
Returns: None
Example:
# Assuming you have an image classification Taskframe:
tf = Taskframe(data_type="image", task_type="classification", classes=["cat", "dog"])
tf.add_dataset_from_csv("mydata.csv", column="item")
add_dataset_from_dataframe
Add a Dataset
from a Pandas dataframe.
Signature:
add_dataset_from_dataframe(
self,
dataframe,
column=None,
input_type=None,
base_path=None,
custom_id_column=None,
label_column=None,
)
Parameters:
dataframe
: the Pandas dataframecolumn
: The column of the dataframe containing your data. If undefined, takes the first column.input_type: the type of items :
file,
url,
data`. If not provided it will be inferred.base_path
: if you are passing relative file paths, you may pass thisbase_path
that will be prepended to each file's path;custom_id_column
: the column containing unique item ids.label_column
: column containing initial labels of your items
Returns: None
Example:
# Assuming you have an image classification Taskframe:
tf = Taskframe(data_type="image", task_type="classification", classes=["cat", "dog"])
import pandas as pd
dataframe = pd.DataFrame(...)
tf.add_dataset_from_dataframe(dataframe, column="item")