Data Model - the-engine-room/Malawi-Replication GitHub Wiki
Data Model notes by Lesedi
Below are some notes on the current data model and some sample queries to answer questions that have been predicted. Bold refers to object types represented by tables in the data model, or the tables themselves.
A schema diagram is not included here as the naming conventions should hopefully be sufficient to identify foreign keys at this stage.
Note: MySQL was part of the original project spec; I’m not sure if there were particular reasons for this but given the potential use of GIS data from the cadastre, and the limitations of modelling tree structures in MySQL it may be that other options such as PostgreSQL should be considered. Excepting minor changes to query syntax, nothing in the model relies on MySQL specifically.
Licenses and Concessions
Each license consists of a number of concessions. The license is identified by license number (e.g. PEL0034). As a matter of style, the current data model assigns separate primary keys to each object rather than meaningful identifiers such as license numbers, company registration numbers etc.
License holdings
To model both the current state of license ownership, and the historical state, a separate table license_holdings is used, with start and end dates marking the validity of the ownership record. Start and end dates throughout the data model work as such:
A null start date means that the object existed at the start of the period of interest. For a license holding, this means that the government ‘owned’ it before it was granted. A null end date means that the object exists at the present time. Where date_approximate is true (a TINYINT of 1 in MySQL), the date entered is not precise; information on bounds is found in the notes column.
##Transfers
There is a certain degree of denormalization in the current data model, in order to make querying simpler (at the cost of application complexity, of course!).
Transfers is one example. Whereas license_holdings models license ownership state, transfers models the change of state. That means that during data entry of a change in license ownership, for example, the application would be expected to update the existing row in the license_holdings table to show a new end date, insert a new license_holdings row to represent the new ownership, and insert a new transfers row to represent the change in ownership.
It may be that this complexity can be removed, and modelling the state itself is sufficient, but my reading of the questions being asked of the data suggests that the ability to query change in state is necessary, and pulling this from the license_holdings table could be tricky.
##Operators
The operators table models the current and historical state of the company operating each license.
##Companies
The concept of a company has expanded to cover any entity, governmental or otherwise.
A company has a Jurisdiction, which may be in a Tax haven.
Companies have ownership relationships with each other. One company may own part of another, and both companies could conceivably own a stake in a license. So in addition to Transfers of ownerships of licenses between companies, we also have to consider transfers of Company Holdings between companies. In this way, a single license may potentially have multiple companies (both governmental and commercial) involved as stakeholders, a company involved as an Operator and multiple companies involved as parent companies of some of the stakeholders.
The company_holdings table models the ownership in a tree structure. This part of the data model is deliberately incomplete as MySQL doesn’t provide tools for managing or querying arbitrary tree structures, and if a nested set structure is used it is likely to be application-specific. Hopefully the structure that is there already provides enough to get started with, and the implementation can be worked out by the application developers!
##People, roles and relationships
Individual people may have multiple roles at the same company, and separate roles at multiple companies.
Further, people may have relationships to each other (for example, family relationships or separate business relationships outside the scope of the data model). We envisage that some of these relationships records may only be exposed to some users of the database – for example, it may be useful to flag to a researcher that a transfer of a license or company holding happened between companies represented by two people with some kind of relationship, without necessarily revealing the precise nature of the relationship as it may be uncertain or without a public source.