Fetching and analysing Grambank data with R - grambank/grambank GitHub Wiki
We are doing a lot of analysis with the Grambank data.
To streamline the workflow and prevent mishaps, the DLCE programmers, Russell Gray and Simon Greenhill have decided on the following recommendations for DLCE-colleagues for getting data and analysing it.
Please remember to cite Grambank, guidelines here.
For more practical guides for using Grambank data, go here.
- you work on analysis project-wise in dedicated private GitHub reposes, preferably under Glottobank. Later, this repos is made public upon publication of the relevant paper and contains the relevant analysis scripts for your science
- preferably: make the main branch "protected" to avoid major disruptions
- we want to make reproducible science. you want to make sure that future people can re-create all the steps of your work from fetching data, doing analysis, making summaries, plots etc. This means that all necessary input data (including what version) should be possible to easily get for another researcher either straight from the codebase (scripts to fetch data, git submodules etc) or by careful documentation and/or is just published straight in the codebase (for example through a zipped SQLlite file, see below). Other users should be able to run all the code on the specified data easily and without errors and disruptions. Ideally, Grambank projects should qualify for the OSF-badges "Open Data" and "Open Analytical Code".
CLDF-data is published continuously as versions on GitHub (see here how to view releases on GitHub) and Zenodo (see here for DOI versioning on Zenodo). Each version has a unique URL and ID.
If you grab the data via GitHub, make sure to grab a specific version and not just the latest version. For example, https://github.com/grambank/grambank/tree/master/cldf will update to the most recent state of the remote head of the main branch whereas https://github.com/grambank/grambank/archive/refs/tags/v1.0.3.zip points specifically to zipped file of the contents of version 1.0.3. Generally, we recommend fetching published data from Zenodo rather than GitHub.
The Grambank-project has several GitHub reposes, for an overview go here. There is a separate one for the clld-app (website), pygrambank, release paper analysis code, etc.
Each Zenodo dataset has a unique record ID and an associated download link. The link can be obtained by right clicking the button "download" on the web page of any given Zenodo-dataset. Below we use the link to Grambank v1.0.3 https://zenodo.org/records/7844558/files/grambank/grambank-v1.0.3.zip
. Each release of Grambank has a separate Zenodo entry with a different version number and DOI.
Please note that in recent Zenodo releases, the URL changed partially from "record" to "records". Any older links with only "record" will give a 404.
- Fetch the grambank CLDF data from zenodo e.g. running
curl -o grambank-v1.0.3.zip https://zenodo.org/records/7844558/files/grambank/grambank-v1.0.3.zip?download=1 unzip grambank-v1.0.3.zip
Install cldfzenodo
via
pip install cldfzenodo
Then download the CLDF dataset to a fresh directory:
$ mkdir gb-v1.0.3
$ cldfbench zenodo.download --directory gb-v1.0.3/ 10.5281/zenodo.7844558
$ tree gb-v1.0.3/
gb-v1.0.3/
├── codes.csv
├── contributors.csv
├── families.csv
├── languages.csv
├── parameters.csv
├── sources.bib
├── StructureDataset-metadata.json
└── values.csv
0 directories, 8 files
The R-package rcldf is being developed by Simon Greenhill and reads in cldf data correctly as one R-object which contains the tables etc.
GB <- rcldf::cldf(mdpath = "https://zenodo.org/records/7740140/files/grambank/grambank-v1.0.zip")
You can also read in Zenodo data without rcldf, example here)
You can also click the button "download" on the relevant Zenodo dataset web page.
-
Document clearly what version you are using - for example in a Makefile, README or other document easily available to a reader/user
-
Read the CLDF data from the CSV tables in
grambank-grambank-7ae000c/cldf
or load the data into a SQLite database and access it from there (see below).7ae000c
refers to the most recent commit of the relevant version, this will change with each release.
-
run the
createdb
subcommand from pycldfcldf createdb grambank-grambank-7ae000c/cldf grambank.sqlite
on the unzipped directory, to create a (~94MB) SQLite file "grambank.sqlite".
-
Commit the resulting SQLite file to your GitHub repos (possibly after zipping if it's above GitHub's 100MB file size limit for later versions of Grambank). You can keep scripts for steps 1-4 in your project, but you and other users can from this point use this SQLite file as the starting point for further analysis. By committing it to the repos and including the file in supplementary code etc, the rest of the code you write for the analysis, plotting etc will be executable for any other users as long as it's stored in the same place as the SQLite file.
-
Read in the SQLite into R with the package RSQLite (see example)
There are several reasons for using an SQL interface to the data, one of them is the relationship between the files in a CLDF-dataset and the CLDF-ontology, i.e. is the "LanguageTable" found in the file called "languages.csv"? By making a SQLite file, you can point to files reliably using the ontology-terms and not rely on stable file names. Accessing the data via SQLite will also help unifying the interface for functions in a Grambank R package - see below. Since manipulating (e.g. joining or filtering) data using SQL is very efficient, this will also make it possible to run analyses without the need write/read intermediate data representations to/from files.
In different projects, we often have to do the same tasks. For example:
- reducing dialects
- binarise mutlistate features correctly
- prune EDGE-tree or other phylogenies appropriately
- render theoretical metrics (fusion, informativity etc)
So far we have used ad-hoc solutions, mainly by either copy-pasting scripts from grambank-analysed or using grambank-analysed as a git submodule inside other projects. We want to streamline this by creating an R-package instead. Simon Greenhill is coordinating this.
Next time you need to do a common task with Grambank data, file an issue at rgrambank to create a function in the package. The functions need to be things we are doing repeatedly in different projects.
In the meantime, here are some useful links:
Grambank data v.10
Table with meta-data on features for analysis (fusion score, informativity etc)
Specific R scripts associated with the release paper (including binarisation and dialect-merging)
- README
- make wide and dialect merge
- binarise multistate features
- impute missing data in pruned dataset
- prune global EDGE-tree
- make fusion score
- make rest of theoretical scores
- make PCA RGD color worldmap
NB: some scripts require other scripts to be run beforehand. For example, make_wide_binarized.R requires make_glottolog-cldf_table.R and make_wide.R to be run first. You can see run order in the makefile rules. If you have questions about grambank-analysed scripts, contact Hedvig Skirgård.
If you are working with specifically the Gray et al-tree of Austronesian from 2009, here is a script for matching it to the Grambank v1.0 dataset (list of Oceanic duplicates to drop here).
If you use the EDGE-tree by Bockaert et al, please take note to cite:
Bouckaert, R., Redding, D., Sheehan, O., Kyritsis, T., Gray, R., Jones, K. E., & Atkinson, Q. (2022). Global language diversification is linked to socio-ecology and threat status. Preprint. https://osf.io/preprints/socarxiv/f8tr6/
rcldf is a package for cldf-manipulation, coordinated by Simon Greenhill. It is not on CRAN, but you can install it directly from GitHub.
Install RCLDF:
devtools::install_github("SimonGreenhill/rcldf", dependencies = TRUE)
> library(rcldf)
> gb <- cldf('grambank/cldf/')
# what tables are there?
> summary(gb)
A Cross-Linguistic Data Format (CLDF) dataset:
Name: Grambank v1.0
Identifier: https://grambank.clld.org
JSON: /Users/simon/Desktop/gb/data/grambank/cldf
Type: http://cldf.clld.org/v1.0/terms.rdf#StructureDataset
Tables:
1/6: CodeTable (4 columns, 398 rows)
2/6: contributors.csv (5 columns, 139 rows)
3/6: families.csv (2 columns, 215 rows)
4/6: LanguageTable (13 columns, 2467 rows)
5/6: ParameterTable (12 columns, 195 rows)
6/6: ValueTable (9 columns, 441663 rows)
Sources: 4241
# get languages:
head(gb$tables$LanguageTable)
# A tibble: 6 × 13
ID Name Macroarea Latitude Longitude Glottocode ISO639P3code provenance
<chr> <chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
1 abad1241 Abadi Papunesia -9.03 147. abad1241 NA JLA_abad1…
2 abar1238 Mung… Africa 6.58 10.2 abar1238 NA ML_abar12…
3 abau1245 Abau Papunesia -3.97 141. abau1245 NA MD-GR-RSI…
4 abee1242 Abé Africa 5.60 -4.38 abee1242 NA RHE_abee1…
5 aben1249 Aben… Papunesia 15.4 120. aben1249 NA SR_aben12…
6 abip1241 Abip… South Am… -29 -61 abip1241 NA RHE_abip1…
# get values:
head(gb$tables$ValueTable)
# A tibble: 6 × 9
ID Language_ID Parameter_ID Value Code_ID Comment Source Source_comment
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 GB020-ab… abad1241 GB020 ? NA Author… s_OaP… Oa & Paul 201…
2 GB021-ab… abad1241 GB021 ? NA Author… s_OaP… Oa & Paul 201…
3 GB022-ab… abad1241 GB022 ? NA Author… s_OaP… Oa & Paul 201…
4 GB023-ab… abad1241 GB023 ? NA Author… s_OaP… Oa & Paul 201…
5 GB024-ab… abad1241 GB024 2 GB024-2 NA s_OaP… Oa & Paul 201…
6 GB025-ab… abad1241 GB025 1 GB025-1 NA s_OaP… Oa & Paul 201…
# get all the data, resolving ID keys:
> df.wide <- as.cldf.wide(gb, 'ValueTable')
GRAMBANK_VERSION=grambank-v1.0.3.zip
$(shell mkdir -p data)
all: data/grambank.sqlite
### get data
data/$(GRAMBANK_VERSION):
curl -o $@ "https://zenodo.org/records/7844558/files/grambank/$(GRAMBANK_VERSION)?download=1"
data/grambank/cldf/StructureDataset-metadata.json: data/$(GRAMBANK_VERSION)
$(shell mkdir -p data/grambank)
bsdtar -C data/grambank --strip-components=1 -xvf $<
data/grambank.sqlite: data/grambank/cldf/StructureDataset-metadata.json env
./env/bin/cldf createdb $< $@
### bootstrap python
env:
python -m venv $@
./$@/bin/python -m pip install --upgrade pip
./$@/bin/python -m pip install pycldf
### clean: removes auto-generated files
.PHONY: clean
clean:
rm -rf data env