Quality Assurance process - d-fine/Dataland GitHub Wiki

In Dataland, external contributors can aid in quality assurance by uploading so-called Quality Assurance Reports for individual data sets. Based on the quality assurance reports, data uploaders or Dataland itself can improve the quality of their data sets. The quality assurance reports are visible to any Dataland user. Data quality reports will be displayed in the front end next to the data set itself. Generally, the company association of the provider of the data quality report can also be made visible.

The upload of quality assurance reports does require REVIEWER rights for the API. If you are interested in aiding in quality assurance for Dataland and receiving REVIEWER rights please get in contact with Erik Breen via [email protected] .

The creation and upload of a data quality report is described in the following:

Downloading a data set to perform QA

Quality assurance can be performed for data sets in QaStatus ACCEPTED or PENDING.

Datasets for which Quality Assurance is outstanding can be identified by using the GET /datasets endpoint of the Dataland QA API. Reviewer rights are required to access these.
An overview of datasets that are already accepted can be accessed through the GET /metadata endpoint of the Dataset API
For datasets in both, PENDING and ACCEPTED status, the Get /datasets/{dataId} endpoint can be used to identify the DataTypeEnum of the dataset which can e.g. be sfdr.
A data set can be retrieved with its framework's corresponding GET endpoint. E.g. GET /data/sfdr/{dataId} for a SFDR data set.

Downloading referenced documents

In a data set, many data points will be accompanied by a data source. Taking this sample extract of a SFDR data set:

{
  "companyId": "7475cfd8-4715-4495-bfcf-ae1bdaf92466",
  "reportingPeriod": "2023",
  "data": {
    "general": {
    },
    "environmental": {
      "greenhouseGasEmissions": {
        "scope1GhgEmissionsInTonnes": {
          "value": 1100,
          "quality": "Reported",
          "comment": "The company's greenhouse gas emissions for Scope 1 is 1,100 tons of CO2e in 2023.",
          "dataSource": {
            "page": 60,
            "tagName": null,
            "fileName": "YearlyReport.pdf",
            "fileReference": "dfc3d090d4d1265f1b7dc41f52fdadf7b249af9fd852079936c4c830f4b91200"
          }
        }
      }
    },
    "social": {
    }
  }
}

The value of 1'100 Tons of Co2e Scope 2 green house gas emissions should be validatable against page 60 of the document with reference dfc3d090d4d1265f1b7dc41f52fdadf7b249af9fd852079936c4c830f4b91200.

With this reference, the document in question can be downloaded from the GET /{documentId} endpoint of the Document API.

Creating a QA Report

For each framework, a corresponding QA report exists. For each Framework a corresponding endpoint exists through with QAReports for the respective framework can be uploaded. Examples are: SFDR: https://dataland.com/qa/swagger-ui/index.html#/sfdr-data-qa-report-controller/postSfdrDataQaReport EU Taxonomy non-financials: https://dataland.com/qa/swagger-ui/index.html#/eutaxonomy-non-financials-data-qa-report-controller/postEutaxonomyNonFinancialsDataQaReport

Generally, the Qa Report data model mimics the framework data model but have a comment, verdict, and correctedData field for each data point. The correctedData field should still contain all values of the data point that were correct and have incorrect values corrected. E.g. in the example above, if the value of 1100 was correct but the page on which the information can be found was 62 instead of 60 the QA Report should look as follows:

{
  "companyId": "7475cfd8-4715-4495-bfcf-ae1bdaf92466",
  "reportingPeriod": "2023",
  "data": {
    "general": {
    },
    "environmental": {
      "greenhouseGasEmissions": {
        "scope1GhgEmissionsInTonnes": {
          "comment": "The state page is incorrect",
	  "verdict": "QaRejected",
	  "correctedData": {
            "value": 1100,
            "quality": "Reported",
            "comment": "The company's greenhouse gas emissions for Scope 1 is 1,100 tons of CO2e in 2023.",
            "dataSource": {
              "page": 62,
              "tagName": null,
              "fileName": "YearlyReport.pdf",
              "fileReference": "dfc3d090d4d1265f1b7dc41f52fdadf7b249af9fd852079936c4c830f4b91200"
            }
		  }
        }
      }
    },
    "social": {
    }
  }
}

If the quality assurer can't correct any values, the correctedData field should be left empty.

QA Report Norms

When creating a QA report, please follow these norms:

QA Verdict

The verdict field can be filled with the following values

Verdict value	Meaning
`QaAccepted`	Some or all of the fields of the data point were validated and no data quality issue was found
`QaInconclusive`	For some of the fields for which QA has been attempted no clear verdict could be made. More details should be given in the `comment` field.
`QaRejected`	Some or all of the fields of the data point were validated and a data quality issue was found for any of the fields. If possible, the corrected value should be provided in the `correctedData`. If this is not possible the `comment` field should state why the data point was rejected
`QaNotAttempted`	None of the fields of the data point were validated

This does imply that a data point can be marked as QaAccepted even if not all fields were reviewed. E.g. if the QA does not check the quality field or the page field.

Page Numbers

On Dataland Page Numbers are defined as the n-th page of the document, i.e. the page number entered when looking at the PDF e.g. in the Chrome, Edge or Firefox Browser. This may deviate from the "human-readable" page which is displayed on the page itself. This may also deviate from the page displayed in Adobe Acrobat Reader as this may pick up specially defined pages such as Roman Numerals at the beginning of the document.

Promoting a dataset from `PENDING` to `ACCEPTED`

[!NOTE] The preconditions under which a data set will be moved from PENDING to ACCEPTED have not been determined yet.