2. Collections - tjmisko/sspi-data-webapp GitHub Wiki

MongoDB Collections

MongoDB collections (often infelicitously referred to as "databases" in the course of discussion) are persistent stores of documents. In the SSPI, we use collections to store data

Metadata

  • sspi_metadata: The SSPI Metadata Database plays a crucial role in structuring the flows of data. It describes the structure of the overall data and provides crucial information used at various stages of the processing. It functions as the single source of truth for information about the items and datasets in the SSPI.

Dataflow Collections

  1. sspi_raw_api_data
  2. sspi_clean_api_data
  3. sspi_indicator_data
  4. sspi_incomplete_indicator_data
  5. sspi_imputed_data
  6. sspi_item_data

Final Collections

Static Collections

The SSPI first existed as a "static index" with data only for 2018. This version of the SSPI was painstakingly assembled by hand without the benefit of a dataflow model to guarantee reproducibility and reliability.

Documents

MongoDB is a NoSQL document database. It stores documents and indexes them based on the information they contain, which allows for fast queries. See [What are Documents?](1.-Introduction#What are Documents?) for a high-level overview; refer to the examples of expected document formats in each collection listed above.

Collection Wrappers

All of our MongoDB databases are "wrapped" in classes that contain some extra validation logic, a few helper methods, and code that handles some repetitive steps. The core PyMongo API methods of insert_one, insert_many, find, delete_one, and delete_many are exposed by the wrapper. Other functionalities (e.g. aggregation pipelines) are implemented as purpose-built methods (e.g. the tabulate_ids and delete_duplicates methods). Should you need to use other MongoDB functionality, implement it in the wrapper class.