Coding conflict resolution - grambank/grambank GitHub Wiki
Due to data imports from other databases and rare mishaps, we sometimes have more than 1 coding sheet for one language. Most of the time, the different coding sheets agree on the same value for the same feature, but sometimes they don't - these are the 'coding conflicts'. In order to better understand the coding process, we want to go through all conflicts and (a) classify them into different types of conflicts and (b) resolve them to the best of our abilities (i.e. decide which is the appropriate coding).
In order to do this, we have merged all sheets for the same languoid where there is at least 1 conflict into 1 sheet. In the merged sheet for each languoid, there is one row per unique value. That means that if there is a conflict, there is more than 1 row for one feature, see the example below for illustration.
All rows with conflicts are marked by True in the Conflict column. Please disregard completely all rows with False in the conflict column.
Columns
This sheet has some new columns which are relevant for the workflow. Below is a list of all columns in these merged sheets and a description.
Feature_ID
- Feature ID as usualValue
- Value as usualConflict
- This is set to True or False depending on whether the feature has conflicting values. Disregard any row with False.Classification of conflict
- This is to be filled in by the coder reviewing the conflicts from a set or pre-determined categoriesSelect
- This is to be filled in by the coder reviewing the conflicts. The value(s) that is the best coding received a "True" in this column.Sheet
- Specification of from which sheet the coding was takenSource
- Source as listed for this datapoint in the original coding sheet for that valueContributed_Datapoint
- Inherited from original sheets, but is also to be used by reviewing coderComment
- Source as listed for this datapoint in the original coding sheet for that valueWarnings
- If this datapoint triggered any of our automatic quality checks, it would be indicated here. If so, take that into account when reviewing
Example
For fiji1243 we have coding both inherited from the old Sahul-survey (as incited by the coders being MD, GR and RSI) and more recent coding by our grambank coder RSC. For feature GB020, they disagree on the value. This is why the column Conflict
is set to True
. However, for GB021 they agree, which is why Conflict
is False
. Neither coding sheet for neither of the features have any comment, and the source happens to be the same, but the page numbers differ.
Feature_ID | Value | Conflict | Classification of conflict | Select | Sheet | Source | Contributed_Datapoint | Comment | Warnings |
---|---|---|---|---|---|---|---|---|---|
GB020 | 0 | True | MD-GR-RSI_fiji1243 | Dixon 1988:114 | |||||
GB020 | 1 | True | RSC_fiji1243 | Dixon 1988:32, 35-36, 114-115 | |||||
GB021 | 0 | False | MD-GR-RSI_fiji1243 | Dixon 1988:114 | |||||
GB021 | 0 | False | RSC_fiji1243 | Dixon 1988:36 |
Steps
- Assign yourself to a languoid to review (see sheet "Revision (conflicts)" in the general GB to do workbook)
- Find the relevant sources listed in the merged sheet
- Evaluate if the sources are indeed describing the same variety, or if this is an instance where the coding sheets should be assigned to different dialect glottocodes
- Review the coding conflict and determine what is the best coding. Follow the same principles as revisions for Git workflow, i.e.: don't move the file, don't reorder the file, make PR and tag reviewers.
5a. If the best coding is one that exists in the sheet, mark that as "True" in the select column, mark your initials in "Contributed_Datapoint"
5b. If the best coding is not one that is already in the sheet, make a new row, leave Sheet empty and mark that row as "True" in the select column, mark your initials in
Contributed_Datapoint
- Determine the reason for the conflict. See list of pre-determined reasons below.
- Make a PR.
Reasons for conflicts
For each value that you determine is incorrect, you need to select a possible reason for the erroneous coding. For correct coding, just write "correct" in the column "classification of conflict".
correct
- this is the correct value for this datapointtypo
- there is no apparent reason, given the source(s) and the comment the inaccurate coding appears to be a typoovercautious
overconfident
feature interpretation
- the different coders have interpreted the feature differently (some perhaps coded outside of the project and weren't in the loop with the specifics)source interpretation
- the different coders have interpreted the source(s) differentlysources disagree
new source was consulted in other coding
missed
- some coders have missed a section in a sourceless thorough coding
- the other coder(s) were particularly thorough and spotted something that could easily have been missed or contacted expert(s)hard to code
- the language phenomena is difficult to neatly fit into the questionnaire and took considerable discussion to resolveunclear
- it is not clear why there is a conflict
Example of outcome (one, or more, coder is correct)
After reviewing the case, it is determined that the appropriate coding for GB020 for this language is indeed "1", like RSC had coded. The correct row is marked with True
for Select
and Classification of conflict
and Contributed_Datapoint
are also filled out appropriately. Optionally, Comment
is also filled out.
Feature_ID | Value | Conflict | Classification of conflict | Select | Sheet | Source | Contributed_Datapoint | Comment | Warnings |
---|---|---|---|---|---|---|---|---|---|
GB020 | 0 | True | feature interpretation | MD-GR-RSI_fiji1243 | Dixon 1988:114 | ||||
GB020 | 1 | True | True | RSC_fiji1243 | Dixon 1988:32, 35-36, 114-115 | HS | The article 'na' qualifies as a specific article | ||
GB021 | 0 | False | MD-GR-RSI_fiji1243 | Dixon 1988:114 | |||||
GB021 | 0 | False | RSC_fiji1243 | Dixon 1988:36 |
Example of outcome (neither coder is correct)
After reviewing the case, it is determined that the appropriate coding for GB020 for this language is "?", which none of the other coders had selected. A row is added, with "?" as Value
and is marked with True
for Select
and Contributed_Datapoint
are also filled out. Classification of conflict
is filled out. Optionally, Comment
is also filled out.
Feature_ID | Value | Conflict | Classification of conflict | Select | Sheet | Source | Contributed_Datapoint | Comment | Warnings |
---|---|---|---|---|---|---|---|---|---|
GB020 | 0 | True | feature interpretation | MD-GR-RSI_fiji1243 | Dixon 1988:114 | ||||
GB020 | 1 | True | feature interpretation | RSC_fiji1243 | Dixon 1988:32, 35-36, 114-115 | ||||
GB020 | ? | True | True | Dixon 1988:32, 35-36, 114-115 | HS | It is not clear from the grammar what the exact semantics is of 'na' | |||
GB021 | 0 | False | MD-GR-RSI_fiji1243 | Dixon 1988:114 | |||||
GB021 | 0 | False | RSC_fiji1243 | Dixon 1988:36 |
Adding content to the selected data row
If one (or more) coder is correct and you find that you can improve on the data by adding page numbers, more references or commentary - please do so. Add this content to a row which you have tagged as True for the Select
-column.
More than one correct data row
If you are comparing coding of more than 2 sheets, you may find that more than one row is correct and should be marked as selected. In these cases, mark True for Select
for the coding that has the best Source
and Comment
fields. If there are several that are good, mark all of them as True for the Select
-column. Mark all correct values as "correct" in classification of conflict, include those you may not "select" as the best datapoint.
Other
Ignore any rows where the "Value" field is blank. Also ignore any features that are inactive. Inactive features can be flagged for conflicts, but they are marked as "True (inactive feature)" in the column "Conflict". You can use this to filter them out.