Documentation - QAML/S3QACoreFramework GitHub Wiki

Input Data Format

Different community question-and-answering datasets have different data formats. For instance, the one used at CQA-QL-2016 consists of XML files, whereas Ask Ubuntu contains files of plain text together with ids files.

In S3QA we opt for a generic XML format and readers to pass from original corpora formats to ours. (e.g., this one).

We contemplate two potential tasks, described here. XSD templates are provided for each as follows.

Task A: comment ranking given a forum question

Each instance consists of one question and a number of attached thread comments. The elements and attributes are as follows:

  • related_question

    • subject - subject of the question (mandatory)

    • body - body of the question (mandatory)

    • required attributes

      • id - unique identifier
      • lang - language of the text (e.g., en, ar)
      • numberOfCandidates - number of comments in this thread
    • optional attributes

      • category - if the forum has different categories
      • dateTime - posting time
      • userID - unique id of the poster
      • userName - name of the poster
    • comment the text of a comment in the current thread

      • required attributes

        • id - unique identifier
        • lang - language of the text (e.g., en, ar)
        • index - position of the comment in the thread
        • totalNumberExamples - total number of comments in the dataset
      • optional attributes

        • relevance - whether the comment is relevant to the question or not (e.g., Good, Bad)
        • date - posting time
        • userID - unique id of the poster
        • userName - name of the poster

    The question can have one or more comments within its thread. The fields are as follows:

Task B: forum questions ranking given a fresh question

Each instance consists of one new question and a number of attached forum questions. The elements and attributes are as follows:

  • user_question

    • subject - subject of the question (mandatory)

    • body - body of the question (mandatory)

    • required attributes

      • id - unique identifier
      • lang - language of the text (e.g., en, ar)
      • numberOfCandidates - number of related questions (to be ranked)
    • related_question

    • subject - subject of the question (mandatory)

    • body - body of the question (mandatory)

    • required attributes

      • id - unique identifier
      • lang - language of the text (e.g., en, ar)
      • index - position in the list of related questions (e.g., given by a search engine)
      • totalNumberExamples - total number of question candidates
    • optional attributes

      • relevance - whether the forum question is relevant for the user the question or not
      • category - if the forum has different categories
      • date - posting time
      • rank - ranking of the question
      • userID - unique id of the poster
      • userName - name of the poster