Shoonya Dataset Types - AI4Bharat/Shoonya GitHub Wiki

Dataset Types

Shoonya currently supports the following dataset types: Block Texts, Conversations, OCR Documents, Sentence Texts, Speech Conversations and Translation Pairs

`SentenceText`

The SentenceText dataset stores monolingual sentences along with relevant metadata. The list of fields and the data type and description of the fields are as follows:

Language (String) - The language of the sentence.
Text (String) - The sentence text.
Context (String) - The context in which the sentence is used (optional).
Corrected Text (String) - The corrected version of the sentence text (optional).
Domain (String) - The domain to which the sentence belongs. Choices include:
- None
- Business
- Culture
- General
- News
- Education
- Legal
- Government-Press-Release
- Healthcare
- Agriculture
- Automobile
- Tourism
- Financial
- Movies
- Subtitles
- Sports
- Technology
- Lifestyle
- Entertainment
- Parliamentary
- Art-and-Culture
- Economy
- History
- Philosophy
- Religion
- National-Security-and-Defence
- Literature
- Geography
Quality Status (String) - The quality status of the sentence, indicating whether it has been checked or remains unchecked. Choices include:
- Unchecked
- Clean
- Profane
- Difficult vocabulary
- Ambiguous sentence
- Context incomplete
- Corrupt

`TranslationPair`

The TranslationPair dataset stores pairs of text translations between different languages. The list of fields and the data type and description of the fields are as follows:

Input Language (String) - The language of the input text.
Output Language (String) - The language of the output text.
Input Text (String) - The text to be translated.
Output Text (String) - The translated text (optional).
Machine Translation (String) - The machine-translated version of the input text (optional).
Context (String) - The context in which the input text is used (optional).
LaBSE Score (Decimal) - The LaBSE score of the translation, indicating the quality (optional).
Rating (Integer) - The rating of the translation (optional).
Domain (String) - The domain to which the sentence belongs. Choices include:
- None
- Business
- Culture
- General
- News
- Education
- Legal
- Government-Press-Release
- Healthcare
- Agriculture
- Automobile
- Tourism
- Financial
- Movies
- Subtitles
- Sports
- Technology
- Lifestyle
- Entertainment
- Parliamentary
- Art-and-Culture
- Economy
- History
- Philosophy
- Religion
- National-Security-and-Defence
- Literature
- Geography

`OCRDocument`

The OCRDocument dataset stores information related to OCR files and their annotations. The list of fields and the data type and description of the fields are as follows:

File Type (String) - The type of OCR file. Choices include:
- PDF
- JPG
- JPEG
- PNG
File URL (String) - The URL of the OCR file (optional).
Image URL (String) - The URL of the image associated with the OCR document.
Page Number (Integer) - The page number in the OCR document.
Language (String) - The language of the OCR document.
OCR Type (String) - The type of OCR performed. Choices include:
- ST (ScenicText)
- DT (DenseText)
- PR (Printed)
- HN (Handwritten)
OCR Domain (String) - The domain of the OCR content. Choices include:
- BO (Books)
- FO (Forms)
- OT (Others)
- TB (Textbooks)
- NV (Novels)
- NP (Newspapers)
- MG (Magazines)
- RP (Research_Papers)
- FM (Form)
- BR (Brochure_Posters_Leaflets)
- AR (Acts_Rules)
- PB (Publication)
- NT (Notice)
- SY (Syllabus)
- QP (Question_Papers)
- MN (Manual)
OCR Transcribed JSON (JSON) - The transcribed content of the OCR document (optional).
OCR Prediction JSON (JSON) - The predicted OCR content (optional).
Image Details JSON (JSON) - Details of the image used in the OCR process (optional).
BBoxes Relation JSON (JSON) - The relationship between bounding boxes in the OCR document (optional).
BBoxes Relation Prediction JSON (JSON) - The predicted relationships between bounding boxes (optional).
Annotated Document Details JSON (JSON) - Details of the annotated OCR document (optional).

`BlockText`

The BlockText dataset stores monolingual blocks of text that contain multiple sentences. The list of fields and the data type and description of the fields are as follows:

Language (String) - The language of the block of text.
Text (String) - The block of text containing multiple sentences.
Splitted Text Prediction (JSON) - The predicted splitting of the block text into individual sentences (optional).
Splitted Text (String) - The actual sentences split from the block of text (optional).
Domain (String) - The domain to which the block text belongs.

`Conversation`

The Conversation dataset stores conversation data, including details about speakers and the content of the conversation. The list of fields and the data type and description of the fields are as follows:

Domain (String) - The domain to which the conversation belongs.
Topic (String) - The topic of the conversation (optional).
Scenario (String) - The scenario in which the conversation takes place (optional).
Prompt (String) - The prompt that initiated the conversation (optional).
Speaker Count (Integer) - The number of speakers involved in the conversation.
Speakers Details (JSON) - Detailed information about the speakers (optional).
Language (String) - The language of the conversation.
Conversation Details (JSON) - The content of the conversation in JSON format (optional).
Machine Translated Conversation (JSON) - The machine-translated version of the conversation (optional).
Unverified Conversation (JSON) - The unverified version of the conversation data (optional).
Conversation Quality Status (String) - The quality status of the conversation, indicating whether it has been checked or remains unchecked. Choices include:
- Unchecked
- Clean
- Profane
- Difficult vocabulary
- Ambiguous sentence
- Context incomplete
- Corrupt

`SpeechConversation`

The SpeechConversation dataset stores conversation data derived from speech, including speaker information and transcription details. The list of fields and the data type and description of the fields are as follows:

Domain (String) - The domain of the speech conversation.
Scenario (String) - The scenario in which the speech conversation takes place (optional).
Speaker Count (Integer) - The number of speakers involved in the speech conversation.
Speakers Details (JSON) - Detailed information about the speakers.
Language (String) - The language of the speech conversation.
Transcribed JSON (JSON) - The transcribed conversation data (optional).
Machine Transcribed JSON (JSON) - The machine-transcribed version of the conversation data (optional).
Audio URL (String) - The URL of the audio file associated with the conversation.
Audio Duration (Float) - The length of the audio file in seconds.
Reference Raw Transcript (String) - The plain text transcription used by the speaker (optional).
Prediction JSON (JSON) - The predicted transcription for the implemented models (optional).
Final Transcribed JSON (JSON) - The final version of the transcribed conversation data (optional).