Dataset - Sablayrolles/debates GitHub Wiki
You have the following list of dataset to help you managing it :
- directory_creator.sh : create directories automaticly
echo "Usage : directory_creator.sh -[h|c|y|d] (name) (year)"
echo "-h : affiche l'aide et quitte"
echo "-c : creer le pays 'name'"
echo "-y : ajoute l'annee 'year' pour le pays 'name'"
echo "-d : ajoute un debat dans le pays 'name' pour l'annee 'year'"
- create_aam_as_files.py : create .aam and .as files (need to use directory_creator before) See below for informations about the strcuture of infos.xml
The dataset is group by country and year of debates following this tree:
dataset/
country_name/
year/
num_of_debate/
annotated/
ac-aa/
debate.aam
debate.as
brut/
x.txt
...
full/
full.txt
reactions/
x.txt
...
segmented/
x.txt
...
infos.xml
Descriptions
- full.txt contains the full script of the debate
- brut/ contains the brut script cut in topics(questions) numbered from 1 to x
- segmented/ contains the segmented script cut in topics(questions) numbered from 1 to x with '&' for split character (need to add auto generate method for this)
- reactions/ same format as brut but contains also the reactions from the persons (format will be changed in the future for easier use and annotation [glozz])
- annotated/ contains the files for glozz annoted and the parameters file (need to add auto generate method for this)
- infos.xml contains data about the debates
(need to add script for auto generate glozz files and segmented files and annoted reations easier)
The file need absolutly to respect this format :
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<informations>
<debateNum>1</debateNum>
<country>usa</country>
<language>en</language>
<numberQuestion>9</numberQuestion>
<date>09/26/16</date>
<participant>
<presentator>HOLT</presentator>
<candidate>CLINTON</candidate>
<candidate>TRUMP</candidate>
</participant>
</informations>
The script need to be like below:
NAME: text
NAME: text