Datasets - HLT-ISTI/QuaPy GitHub Wiki

Datasets

QuaPy makes available several datasets that have been used in quantification literature, as well as an interface to allow anyone import their custom datasets.

A Dataset object in QuaPy is roughly a pair of LabelledCollection objects, one playing the role of the training set, another the test set. LabelledCollection is a data class consisting of the (iterable) instances and labels. This class handles most of the sampling functionality in QuaPy. Take a look at the following code:

import quapy as qp
import quapy.functional as F

instances = [
    '1st positive document', '2nd positive document',
    'the only negative document',
    '1st neutral document', '2nd neutral document', '3rd neutral document'
]
labels = [2, 2, 0, 1, 1, 1]

data = qp.data.LabelledCollection(instances, labels)
print(F.strprev(data.prevalence(), prec=2))

Output the class prevalences (showing 2 digit precision):

[0.17, 0.50, 0.33]

One can easily produce new samples at desired class prevalence values:

sample_size = 10
prev = [0.4, 0.1, 0.5]
sample = data.sampling(sample_size, *prev)

print('instances:', sample.instances)
print('labels:', sample.labels)
print('prevalence:', F.strprev(sample.prevalence(), prec=2))

Which outputs:

instances: ['the only negative document' '2nd positive document'
 '2nd positive document' '2nd neutral document' '1st positive document'
 'the only negative document' 'the only negative document'
 'the only negative document' '2nd positive document'
 '1st positive document']
labels: [0 2 2 1 2 0 0 0 2 2]
prevalence: [0.40, 0.10, 0.50]

Samples can be made consistent across different runs (e.g., to test different methods on the same exact samples) by sampling and retaining the indexes, that can then be used to generate the sample:

index = data.sampling_index(sample_size, *prev)
for method in methods:
    sample = data.sampling_from_index(index)
    ...

However, generating samples for evaluation purposes is tackled in QuaPy by means of the evaluation protocols (see the dedicated entries in the Wiki for evaluation and protocols).

Reviews Datasets

Three datasets of reviews about Kindle devices, Harry Potter's series, and the well-known IMDb movie reviews can be fetched using a unified interface. For example:

import quapy as qp
data = qp.datasets.fetch_reviews('kindle')

These datasets have been used in:

Esuli, A., Moreo, A., & Sebastiani, F. (2018, October). 
A recurrent neural network for sentiment quantification. 
In Proceedings of the 27th ACM International Conference on 
Information and Knowledge Management (pp. 1775-1778).

The list of reviews ids is available in:

qp.datasets.REVIEWS_SENTIMENT_DATASETS

Some statistics of the fhe available datasets are summarized below:

Dataset	classes	train size	test size	train prev	test prev	type
hp	2	9533	18399	[0.018, 0.982]	[0.065, 0.935]	text
kindle	2	3821	21591	[0.081, 0.919]	[0.063, 0.937]	text
imdb	2	25000	25000	[0.500, 0.500]	[0.500, 0.500]	text

Twitter Sentiment Datasets

11 Twitter datasets for sentiment analysis. Text is not accessible, and the documents were made available in tf-idf format. Each dataset presents two splits: a train/val split for model selection purposes, and a train+val/test split for model evaluation. The following code exemplifies how to load a twitter dataset for model selection.

import quapy as qp
data = qp.datasets.fetch_twitter('gasp', for_model_selection=True)

The datasets were used in:

Gao, W., & Sebastiani, F. (2015, August). 
Tweet sentiment: From classification to quantification. 
In 2015 IEEE/ACM International Conference on Advances in 
Social Networks Analysis and Mining (ASONAM) (pp. 97-104). IEEE.

Three of the datasets (semeval13, semeval14, and semeval15) share the same training set (semeval), meaning that the training split one would get when requesting any of them is the same. The dataset "semeval" can only be requested with "for_model_selection=True". The lists of the Twitter dataset's ids can be consulted in:

# a list of 11 dataset ids that can be used for model selection or model evaluation
qp.datasets.TWITTER_SENTIMENT_DATASETS_TEST

# 9 dataset ids in which "semeval13", "semeval14", and "semeval15" are replaced with "semeval"
qp.datasets.TWITTER_SENTIMENT_DATASETS_TRAIN

Some details can be found below:

Dataset	classes	train size	test size	features	train prev	test prev	type
gasp	3	8788	3765	694582	[0.421, 0.496, 0.082]	[0.407, 0.507, 0.086]	sparse
hcr	3	1594	798	222046	[0.546, 0.211, 0.243]	[0.640, 0.167, 0.193]	sparse
omd	3	1839	787	199151	[0.463, 0.271, 0.266]	[0.437, 0.283, 0.280]	sparse
sanders	3	2155	923	229399	[0.161, 0.691, 0.148]	[0.164, 0.688, 0.148]	sparse
semeval13	3	11338	3813	1215742	[0.159, 0.470, 0.372]	[0.158, 0.430, 0.412]	sparse
semeval14	3	11338	1853	1215742	[0.159, 0.470, 0.372]	[0.109, 0.361, 0.530]	sparse
semeval15	3	11338	2390	1215742	[0.159, 0.470, 0.372]	[0.153, 0.413, 0.434]	sparse
semeval16	3	8000	2000	889504	[0.157, 0.351, 0.492]	[0.163, 0.341, 0.497]	sparse
sst	3	2971	1271	376132	[0.261, 0.452, 0.288]	[0.207, 0.481, 0.312]	sparse
wa	3	2184	936	248563	[0.305, 0.414, 0.281]	[0.282, 0.446, 0.272]	sparse
wb	3	4259	1823	404333	[0.270, 0.392, 0.337]	[0.274, 0.392, 0.335]	sparse

UCI Machine Learning

Binary datasets

A set of 32 datasets from the UCI Machine Learning repository used in:

Pérez-Gállego, P., Quevedo, J. R., & del Coz, J. J. (2017).
Using ensembles for problems with characterizable changes 
in data distribution: A case study on quantification.
Information Fusion, 34, 87-100.

The list does not exactly coincide with that used in Pérez-Gállego et al. 2017 since we were unable to find the datasets with ids "diabetes" and "phoneme".

These dataset can be loaded by calling, e.g.:

import quapy as qp

data = qp.datasets.fetch_UCIBinaryDataset('yeast', verbose=True)

This call will return a Dataset object in which the training and test splits are randomly drawn, in a stratified manner, from the whole collection at 70% and 30%, respectively. The verbose=True option indicates that the dataset description should be printed in standard output. The original data is not split, and some papers submit the entire collection to a kFCV validation. In order to accommodate with these practices, one could first instantiate the entire collection, and then creating a generator that will return one training+test dataset at a time, following a kFCV protocol:

import quapy as qp

collection = qp.datasets.fetch_UCIBinaryLabelledCollection("yeast")
for data in qp.data.Dataset.kFCV(collection, nfolds=5, nrepeats=2):
  ...

Above code will allow to conduct a 2x5FCV evaluation on the "yeast" dataset.

All datasets come in numerical form (dense matrices); some statistics are summarized below.

Dataset	classes	instances	features	prev	type
acute.a	2	120	6	[0.508, 0.492]	dense
acute.b	2	120	6	[0.583, 0.417]	dense
balance.1	2	625	4	[0.539, 0.461]	dense
balance.2	2	625	4	[0.922, 0.078]	dense
balance.3	2	625	4	[0.539, 0.461]	dense
breast-cancer	2	683	9	[0.350, 0.650]	dense
cmc.1	2	1473	9	[0.573, 0.427]	dense
cmc.2	2	1473	9	[0.774, 0.226]	dense
cmc.3	2	1473	9	[0.653, 0.347]	dense
ctg.1	2	2126	22	[0.222, 0.778]	dense
ctg.2	2	2126	22	[0.861, 0.139]	dense
ctg.3	2	2126	22	[0.917, 0.083]	dense
german	2	1000	24	[0.300, 0.700]	dense
haberman	2	306	3	[0.735, 0.265]	dense
ionosphere	2	351	34	[0.641, 0.359]	dense
iris.1	2	150	4	[0.667, 0.333]	dense
iris.2	2	150	4	[0.667, 0.333]	dense
iris.3	2	150	4	[0.667, 0.333]	dense
mammographic	2	830	5	[0.514, 0.486]	dense
pageblocks.5	2	5473	10	[0.979, 0.021]	dense
semeion	2	1593	256	[0.901, 0.099]	dense
sonar	2	208	60	[0.534, 0.466]	dense
spambase	2	4601	57	[0.606, 0.394]	dense
spectf	2	267	44	[0.794, 0.206]	dense
tictactoe	2	958	9	[0.653, 0.347]	dense
transfusion	2	748	4	[0.762, 0.238]	dense
wdbc	2	569	30	[0.627, 0.373]	dense
wine.1	2	178	13	[0.669, 0.331]	dense
wine.2	2	178	13	[0.601, 0.399]	dense
wine.3	2	178	13	[0.730, 0.270]	dense
wine-q-red	2	1599	11	[0.465, 0.535]	dense
wine-q-white	2	4898	11	[0.335, 0.665]	dense
yeast	2	1484	8	[0.711, 0.289]	dense

Issues:

All datasets will be downloaded automatically the first time they are requested, and stored in the quapy_data folder for faster further reuse. However, some datasets require special actions that at the moment are not fully automated.

Datasets with ids "ctg.1", "ctg.2", and "ctg.3" (Cardiotocography Data Set) load an Excel file, which requires the user to install the xlrd Python module in order to open it.
The dataset with id "pageblocks.5" (Page Blocks Classification (5)) needs to open a "unix compressed file" (extension .Z), which is not directly doable with standard Pythons packages like gzip or zip. This file would need to be uncompressed using OS-dependent software manually. Information on how to do it will be printed the first time the dataset is invoked.
It is a good idea to ignore datasets acute.a, acute.b and balance.2, since the former two are very easy (many classifiers would score 100% accuracy) while the latter is extremely difficult (probably there is some problem with this dataset, the errors it tends to produce are orders of magnitude greater than for other datasets, and this has a disproportionate impact in the average performance).

Multiclass datasets

A collection of 5 multiclass datasets from the UCI Machine Learning repository. The datasets were first used in this paper and can be instantiated as follows:

import quapy as qp
data = qp.datasets.fetch_UCIMulticlassLabelledCollection('dry-bean', test_split=0.4, verbose=True)

There are no pre-defined train-test partitions for these datasets, but you can easily create your own with the split_stratified method, e.g., data.split_stratified(). This is equivalent to calling the following method directly:

data = qp.datasets.fetch_UCIMulticlassDataset('dry-bean', test_split=0.4, verbose=True)

The datasets correspond to all the datasets that can be retrieved from the platform using the following filters:

datasets for classification
more than 2 classes
containing at least 1,000 instances
can be imported using the Python API.

Some statistics about these datasets are displayed below:

Dataset	classes	train size	test size
dry-bean	7	9527	4084
wine-quality	7	3428	1470
academic-success	3	3096	1328
digits	10	3933	1687
letter	26	14000	6000

LeQua 2022 Datasets

QuaPy also provides the datasets used for the LeQua 2022 competition. In brief, there are 4 tasks (T1A, T1B, T2A, T2B) having to do with text quantification problems. Tasks T1A and T1B provide documents in vector form, while T2A and T2B provide raw documents instead. Tasks T1A and T2A are binary sentiment quantification problems, while T2A and T2B are multiclass quantification problems consisting of estimating the class prevalence values of 28 different merchandise products.

Every task consists of a training set, a set of validation samples (for model selection) and a set of test samples (for evaluation). QuaPy returns this data as a LabelledCollection (training) and two generation protocols (for validation and test samples), as follows:

training, val_generator, test_generator = fetch_lequa2022(task=task)

See the lequa2022_experiments.py in the examples folder for further details on how to carry out experiments using these datasets.

The datasets are downloaded only once, and stored for fast reuse.

Some statistics are summarized below:

Dataset	classes	train size	validation samples	test samples	docs by sample	type
T1A	2	5000	1000	5000	250	vector
T1B	28	20000	1000	5000	1000	vector
T2A	2	5000	1000	5000	250	text
T2B	28	20000	1000	5000	1000	text

For further details on the datasets, we refer to the original paper:

Esuli, A., Moreo, A., Sebastiani, F., & Sperduti, G. (2022).
A Detailed Overview of LeQua@ CLEF 2022: Learning to Quantify.

IFCB Plankton dataset

IFCB is a dataset of plankton species in water samples hosted in Zenodo <https://zenodo.org/records/10036244>. This dataset is based on the data available publicly at WHOI-Plankton repo <https://github.com/hsosik/WHOI-Plankton> and in the scripts for the processing are available at P. González's repo <https://github.com/pglez82/IFCB_Zenodo>_.

This dataset comes with precomputed features for testing quantification algorithms.

Some statistics:

	Training	Validation	Test
samples	200	86	678
total instances	584474	246916	2626429
mean per sample	2922.3	2871.1	3873.8
min per sample	266	59	33
max per sample	6645	7375	9112

The number of features is 512, while the number of classes is 50. In terms of prevalence, the mean is 0.020, the minimum is 0, and the maximum is 0.978.

The dataset can be loaded for model selection (for_model_selection=True, thus returning the training and validation) or for test (for_model_selection=False, thus returning the training+validation and the test).

Additionally, the training can be interpreted as a list (a generator) of samples (single_sample_train=False) or as a single training set (single_sample_train=True).

Example:

train, val_gen = qp.datasets.fetch_IFCB(for_model_selection=True, single_sample_train=True)
# ... model selection

train, test_gen = qp.datasets.fetch_IFCB(for_model_selection=False, single_sample_train=True)
# ... train and evaluation

Adding Custom Datasets

QuaPy provides data loaders for simple formats dealing with text, following the format:

class-id \t first document's pre-processed text \n
class-id \t second document's pre-processed text \n
...

and sparse representations of the form:

{-1, 0, or +1} col(int):val(float) col(int):val(float) ... \n
...

The code in charge in loading a LabelledCollection is:

@classmethod
def load(cls, path:str, loader_func:callable):
    return LabelledCollection(*loader_func(path))

indicating that any loader_func (e.g., a user-defined one) which returns valid arguments for initializing a LabelledCollection object will allow to load any collection. In particular, the LabelledCollection receives as arguments the instances (as an iterable) and the labels (as an iterable) and, additionally, the number of classes can be specified (it would otherwise be inferred from the labels, but that requires at least one positive example for all classes to be present in the collection).

The same loader_func can be passed to a Dataset, along with two paths, in order to create a training and test pair of LabelledCollection, e.g.:

import quapy as qp

train_path = '../my_data/train.dat'
test_path = '../my_data/test.dat'

def my_custom_loader(path):
    with open(path, 'rb') as fin:
        ...
    return instances, labels

data = qp.data.Dataset.load(train_path, test_path, my_custom_loader)

Data Processing

QuaPy implements a number of preprocessing functions in the package qp.data.preprocessing, including:

text2tfidf: tfidf vectorization
reduce_columns: reducing the number of columns based on term frequency
standardize: transforms the column values into z-scores (i.e., subtract the mean and normalizes by the standard deviation, so that the column values have zero mean and unit variance).
index: transforms textual tokens into lists of numeric ids)