data management concept and context - cccs-web/soc-maps GitHub Wiki

Toward a Coherent System of Managing and Publishing Geospatial Information

Managing Spatial Data Repositories

Concept

CCCS maintains separate geospatial data repositories for publicly-available data that can be displayed on the CCCS website and distinct from data specific to each of our client projects (which typically include a mixture of of both public and proprietary data sources). CCCS' archive of publicly-available data is currently managed via 'GitHub' at /cccs-web/soc-maps/.

Our first, most fundamental, challenge with regard to data management related to how to store and link GIS data (and databases) to best ensure that data are appropriately managed, version-controlled, stored and shared across projects. We wish to avoid excessive and un-tracked branching of information. Related to this challenge are decisions about data storage formats. Most GIS data we receive come in the form of shapefiles. To archive these data (and to preserve along with them the associated meta-data, including 'original' file names and name of the data source), CCCS is currently using postgreSQL. Our idea is that using databases to manage shapefiles will also facilitate better interoperability and referential integration with other data types. Among other database applications, postgreSQL is arguably best suited for management of spatial data. The merits of the added efforts of database utilization, however, are not a matter about which CCCS is sufficiently informed at this moment in time.

Initial Approach and Current Standing

Repositories and GeoSpatial Data Management

CCCS' early development work utilized GitHub to version-control both the web mapping software application as well as the spatial data that we intended to referenced by that application: /cccs-web/soc-maps/. While this approach made it convenient for simultaneously sharing both spatial data and the applications' code base within one repository to our various developers and application users, both CCCS' IT expert, Paul Whipp, and our GIS consultants, Kartoza Pty., recommended against this 'combined' approach. Factors cited in their recommendations include: 1) the Git version-control system is non-optimal for management of spatial data unless it is in some plain text format like GML (geographic markup language) or CSV, 2) repository size for spatial data tends to grow too large for Git manage effectively [e.g., one tends to start encountering timeout errors when cloning and pulling larger repositories, and the system may not be able to manage any single file over 2 gigabytes in size (such as a a dump of a spatial database)].

Following the advice of our consultants, CCCS adopted an alternative system for sharing spatial data using BTSync. While this approach has the advantage of allowing us to share larger-sized repositories of spatial data outside of Git, it has the disadvantage of creating parallel data sources (i.e., the shared BTSync repository on the client side and the sever-side 'production' repository). Another disadvantage of the BTSync approach is that data are not version controlled. We therefore view the BTSync data-management approach as a 'temporary' solution.

CCCS and Kartoza Pty have settled on using the 'GeoGig' platform (formerly GeoGit) for distributed version-control of shapefile data—for the vector datasets. The GeoGig software allows users to import raw geospatial data (currently from Shapefiles, PostGIS or SpatiaLite).

Kartoza has investigated use of the GeoGig application, though CCCS has yet to receive a briefing about the particular details of this work. Kartoza invoices to CCCS suggest that most of the work relates to embedding GeoGig into a Docker image; CCCS has requested Kartoza review of work charged for Docker-related development on the grounds that our request to stop all Docker-related development was issued prior to this work being conducted.

Data Storage Formats

With regard to the creation and management of geospatial data in a postgreSQL database format, the current standing is is as follows:

Kartoza Pty. created scripts to load both CCCS' public geospatial data repository and that of our clients into postgreSQL. As introduced above, these script reference data that is currently stored in BTSync repositories. CCCS 'public' data resides on our local machines and on the server at /home/sync/cccs-maps/public/ [15.0 GB]. Our client projects are organized by name within the umbrella directory /home/sync/cccs-maps/private/.

In their current formulation, the data import scripts load shapefiles into a postgreSQL that is embedded within a Docker container [^1] As indicated above, CCCS has asked to eliminate Docker from our system architecture and team work flow

[NOTE^1: Docker is a software application that CCCS has requested be eliminated from our software application stack.]

To facilitate the movement away from Docker, CCCS dumped these databases and re-imported them directly to our server's postgreSQL database. In this process, we renamed each database as follows:

cccs_gis
abadi_gis [^2]

These copies of these database dumps are available via their respective repositories (linked above). Please verify that your databases are appropriately renamed on import.

[NOTE^2: The Docker-oriented data import scripts for CCCS' repository appear to have worked fine. Those for our client project, by contrast, return many errors. It is unclear if the data we are seeing is the same as what Kartoza is seeing. Also, our client project had a record of numerous *.lyr. CCCS has recently converted these to a non-proprietary format so that they can also be loaded. We raise these points here for context; they are elaborated in the respective client project wiki].

Overcoming Challenges—Next Steps

Toward a Revised Software Stack for GeoSpatial Repository Management

As indicated above, CCCS would like to revise our current stack of software applications for geospatial data management. The tools we would like to use are:

This is our current vision for a data management workflow:

Geospatial data is obtained from public sources and clients either as vector data [such as shapefiles] or as raster data [such as high-resolution satellite imagery]. These data are initially stored in a 'regular', non-version-controlled file system (i.e. downloaded to your local hard drive upon receiving the data).
Vector data are loaded into a postgreSQL database that is version-controlled using 'GeoGig'
Raster data are loaded into S3 and version-controlled using git-annex [NOTE: This solution may not be feasible. See further discussion, below.].

This proposed software stack and workflow should allow us the capacity to share and manage data without the many challenges associated with reliance on both Git and BTSync (both of which require that users download the full contents of a data repository, even if they are only interested in using a limited sub-set of that data). Use of postgresSQL and S3 have the added advantage of allowing us to free ourselves from constantly changing import scripts to account for revisions to our file directory structure.

Next Steps for Integrating GeoGig

With regard to moving forward with 'GeoGig' in particular:

CCCS needs to create separate GeoGig repositories for 'public' data and for each of our 'client' projects (focusing initially on 'abadi'. The GeoGig repositories must have their 'MASTER' branch hosted on CCCS' servers.

CCCS progress with regard to enabling GeoGig on our server for shared team access:

We installed GeoGig to our data server:

geogig@ip-10-167-186-14:/home/aaron$ geogig version

Project Version : 1.0-beta1 Build Time : August 14, 2014 at 17:44:46 ART Build User Name : Gabriel Roldan Build User Email : [email protected] Git Branch : r1.0-beta1 Git Commit ID : 9aae709f4f451802a09c14293c92a46372c868bd Git Commit Time : August 14, 2014 at 17:43:33 ART Git Commit Author Name : Gabriel Roldan Git Commit Author Email : [email protected] Git Commit Message : Set version to 1.0-beta1
We created a 'geogig' user added the appropriate the PATH variable to the user's /bashrc file to allow the user to call the geogig application
We created a DNS entry to link traffic coming in from http://geogig.crossculturalconsult.com to our desired geogig server

Remaining tasks to get the GeoGig set-up "working" for our current map-production needs are:

We need to configure nginx appropriately to allow us to push and pull data to each GeoGig project (this is an issue for @pwhipp)
Once GeoGig is up on the server, we then need to import all our existing data (this is an issue for Kartoza, @timlinux). In addition to shapefiles, it is possible to import postgreSQL data into GeoGig.

It remains unclear to CCCS what is occurring in the GeoGig 'import' process. That is: What happens if we import both a postgreSQL database as well as shapefiles? Are each of these "objects" a unique GeoGig entity, meaning that--as with our current Git-managed repositories--the shapefiles are managed in situ and would remain separate and distinct from any postgreSQL database entities? Or does GeoGig re-structure the data as part of its 'import' process (such as when importing data into postgresSQL] so that neither 'source'file object is relevant to GeoGig after the initial import? If the later is the case, are the challenges of importing multiple and different PostregreSQL databases into GeoGig the same as they would be for merging databases within PostregreSQL (e.g. conflicting schema names)? Does GeoGig allow us to re-export its data as a single postgreSQL database, or does it keep each file entity separate?

[Tangentially related: Is GeoGig linked to its own database back-end? To what extent does GeoGig system act as the database, and how does it version control changes to it's own database?]

IMPORTANT: With respect to the data uploaded to postgreSQL, please remember to keep track of meta-data, including the project file name. [@pwhipp ideally, we'd also want some for of a script that would allow us to upload shapefile data into the document management system.]

We'll need to improve our documentation and team understanding about GeoGig database management. As the several questions raised above indicate, CCCS current understanding of how GeoGig manages data is limited. We would like to learn more about the extent to which it is possible (and recommended) to use GeoGig to manage and version-control other data (such as census data of socio-economic indicators). We should spend some time to identify and prioritize needed tutorials and coaching sessions of data management using GeoGig as our version-control system.

Next Steps for Integrating Git-annex

The git-annex documentation suggests that people are using it with files stored on S3 [ex. 1, ex. 2], which can be utilized as a VCS. The challenge for our use case, I suspect, is to have raster data stored in S3 be accessible to data manipulation software such as QGIS without too much hassle (especially given that all client data must be kept private. and requiring the use of permission rules). One option that may help in this regard is to mount S3 as a file system directory. Greater investigation into the implications is needed before pursuing this option.