Shoonya Dataset Types - AI4Bharat/Shoonya GitHub Wiki

Dataset Types

Shoonya currently supports the following dataset types: Block Texts, Conversations, OCR Documents, Sentence Texts, Speech Conversations and Translation Pairs

Conversations

Conversation dataset is used for translating text conversation data between two or more people from one language to another. The list of fields and the data type and description of the fields are as follows:

  1. Parent Data (A reference to database object referenced by integer ID field) - The parent data item id of this record. When a single sentence in a particular language is to be translated into another language, the source data item id is used as a 'Parent Data' and a separate child conversation data item is created for each language for which the source data item is to be translated into.

  2. Instance id (integer) - The dataset instance id to which this particular record belongs to.

  3. Metadata_json (JSON) - The metadata of this conversation data

  4. Domain (String) - The domain to which this conversation belongs to

  5. Topic (String) - The topic on which this conversation is based on

  6. Prompt (String) - The prompt to which this conversation belongs to

  7. Speaker Count (Integer) - The number of speakers who are part of this conversation

  8. Speaker Details (JSON) - Details of the speakers like speaker_id, name and gender

    Sample Speaker Details JSON: [{"name": "Kaushik", "gender": "M", "speaker_id": 0}, {"name": "Raman", "gender": "M", "speaker_id": 1}]

  9. Language (String) - This will be the language of this conversation if this is the source conversation. Otherwise, this will be the language into which the conversation will be translated into.

  10. Conversation_details (JSON) - Source Conversation in JSON format. For child conversation data item, this will be initially 'null' since the source data item will be referenced in the 'Parent Data' field. This field will later be populated with the actual translated conversation done by the language expert on the Shoonya project annotation page.
    Sample Conversation Details JSON: [{"sentences": ["Are we planning to visit Chittorgarh Fort?", "I have heard that it has lots of palaces and temples inside it."], "speaker_id": 0}, {"sentences": ["Yes, it does.", "It also has some beautiful lakes"], "speaker_id": 1}, {"sentences": ["I believe this is also a UNESCO World Heritage Site"], "speaker_id": 0}]

  11. Machine_translated_conversation (JSON) - Machine translation of the source conversation Sample Machine Translated Conversation: [{"sentences": ["మనం చిత్తోర్గఢ్ కోటను సందర్శించాలని యోచిస్తున్నామా?", "దీని లోపల చాలా రాజభవనాలు, దేవాలయాలు ఉన్నాయని నేను విన్నాను."], "speaker_id": 0}, {"sentences": ["అవును, అది చేస్తుంది.", "ఇక్కడ కొన్ని అందమైన సరస్సులు కూడా ఉన్నాయి."], "speaker_id": 1}, {"sentences": ["ఇది కూడా యునెస్కో ప్రపంచ వారసత్వ సంపద అని నేను నమ్ముతున్నాను."], "speaker_id": 0}]

Sample Conversation Dataset in CSV format can be found here.

Speech Conversations

Speech Conversation dataset is used for transcribing audio data having either single speaker speech or multi-speaker conversation. The list of fields and the data type and description of the fields are as follows:

  1. Parent Data (A reference to database object referenced by integer ID field) - The parent data item id of this record.

  2. Instance id (integer) - The dataset instance id to which this particular record belongs to.

  3. Metadata_json (JSON) - The metadata of this audio file conversation data

  4. Domain (String) - The domain to which this audio file belongs to

  5. Scenario (String) - The domain to which this audio belongs to

  6. Speaker Count (Integer) - The number of speakers who are part of this audio data

  7. Speaker Details (JSON) - Details of the speakers like speaker_id, name and gender

  8. Language (String) - This will be the language of the audio.

  9. Audio URL (String) - The location of the audio file to be transcribed.

  10. Audio Duration (Float) - The duration of the audio file.

  11. Reference Raw Transcript (String) - The raw transcript of the speech in the audio file

  12. Prediction JSON (JSON) - The prediction transcription for the audio file

  13. Machine Transcribed JSON (JSON) - The machine-generated transcription for the audio file

  14. Transcribed JSON (JSON) - The transcription done by the annotator.

Sentence Text

Sentence Text dataset is used for identifying the quality of the sentence text data, whether a given sentence is clean or not. The list of fields and the data type and description of the fields are as follows:

  1. Language (String) - Language of the sentence

  2. Text (String) - The sentence to be verified

  3. Context (String) - Context from where the sentence is taken

  4. Corrected Text (String) - The sentence after correcting the mistakes, if any

  5. Domain (String) - The domain to which this sentence belongs to

  6. Quality Status (String) - The quality of the given sentence. It can have one of the following values: Unchecked, Clea, Profane, Difficult Vocabulary, Ambiguous Sentence, Context Incomplete, Corrupt

Translation Pair

Translation Pair dataset is used for translating a sentence from one language to another. The list of fields and the data type and description of the fields are as follows:

  1. Input Language (String) - Language of the source sentence

  2. Output Language (String) - Language into which the source sentence has to be translated into

  3. Input Text (String) - The sentence to be translated.

  4. Output Text (String) - The translation of the given sentence

  5. Context (String) - Context from where the source sentence is taken

  6. Machine Translation (String) - Machine-generated translation of the source sentence

  7. Labse Score (Decimal) - The Labse Score of the translation

  8. Rating (String) - The rating of the translation.