File format - wattzhikang/terrainHydrology GitHub Wiki
With PR #56, this project uses a new save file. Note that this change will likely lead to significant architectural changes, so this format will change. Moreover, with Issue #57, support for PostGIS will be added.
This document will not describe the data model for this program; a separate document should be written for that. This document will merely describe how it is encoded.
Legacy format
The original save file format was developed in April and May 2021. The first logic that serialized parts of the data model into a sequence of bytes was written to enable communication between hydrology.py
and the native module. (Some of the earliest logic to do this was committed in 224fe7c
on 13 April 2021.) This logic was soon adapted to serialize the entire data model. This architecture was largely finalized with commit b9adaff
on 26 May. Breaking changes were often introduced to the serialization logic, necessitating that every save file be prepended with a version number to indicate these breaking changes.
The file was written by painstakingly encoding every value into binary one at a time, and it was not done very well. Interested parties can examine the details of this format by inspecting the git history of SaveFile.py
.
Motivation to develop a new file
I always knew that this was a terrible scheme. For a while I had looked into the possibility of writing the data model as a set of ESRI Shapefiles (see Issue #9), but I was not satisfied with this solution. Finally, it occurred to me that SpatiaLite was the obvious solution. SpatiaLite would:
- Keep all of the data in a single file
- Perfectly preserve the relationships among the data
- Be easily usable by end users
- Allow the basic task of serialization to be done by developers who are smarter than me
- Make schema changes less likely to make older files entirely unusable
- Speed up development by not forcing me to write a lot of verbose logic for every minor change to the data model
New schema
The new file is a SQLite database that uses SpatiaLite extensions.
This diagram can be recreated with dbdiagram.io using the following markup:
Table Shoreline {
id int [pk]
loc geom
}
Table RiverNodes {
id int [pk]
parent int [ref: > RiverNodes.id]
elevation float
localwatershed float
inheritedwatershed float
flow float
contourIndex int [ref: - Shoreline.id]
loc geom
}
Table Qs {
id int [pk]
elevation float
loc geom
}
Table Cells {
rivernode int [ref: > RiverNodes.id]
polygonOrder int
q int [ref: > Qs.id]
}
Table Ts {
id int [pk]
rivercell int [ref: > RiverNodes.id]
elevation float
loc geom
}
Table Edges {
id int [pk]
q0 int [ref: > Qs.id]
q1 int [ref: > Qs.id]
hasRiver bool
isShore bool
shore0 int [ref: > Shoreline.id]
shore1 int [ref: > Shoreline.id]
}
Table RiverPaths {
id int [pk]
rivernode int [ref: > RiverNodes.id]
path geom
}
Table Parameters {
key text [pk]
value text
}
All technical details of the schema can be found in src/db-init.sql
, which creates all the tables and sets up their relationships.
The following sections will review each table.
RiverNodes
Each record in this table represents a single river node in the Hydrology.
This is one of the most important tables in the schema. A river node doesn't just have a river. A river node is an area enclosed by a polygon, partitioning its cell from its neighbors. That cell consists of ridge primitives. It also has a set of terrain primitives.
There is data in these records that is redundant, and will probably be removed in a future release:
localwatershed
andinheritedwatershed
are just the areas of cell polygonsflow
is computed from the areas of cell polygons
RiverPaths
Each record represents the path of the river that flows through a river node.
In this table, there is a 1:1 relationship between river nodes and river paths. In the data model, every leaf node has a unique river, and the rest of the river nodes just reference one of those. This will be reconciled eventually.
Qs
Each record represents a ridge primitive.
Edges
Each record represents an edge that either divides 2 river node cells, or demarcates the shore between a river node and the ocean.
Each edge lies between 2 ridge primitives.
There is some redundant data that will probably be removed in a future release:
hasRiver
just means that the 2 cells that this edge divides have a parent-child relationship. This could be easily determined by a query.isShore
is more dubious, because it could just mean thatshore0
andshore1
areNULL
.
Shoreline
Each record represents a point on the shoreline.
This table is probably redundant. In reality, each point on the shore is a ridge primitive, but there is no relationship between Qs
and Shoreline
. This doesn't make any sense, and will be changed in a future release.
Cells
Each record indicates a relationship between a ridge primitive and a river node.
This is a through table. Every river node will have a number of ridge primitives, and many ridge primitives will border 3 or more river nodes.
The column polygonOrder
indicates the sort order for the Qs relative to a particular river node. This can sort the Qs in counterclockwise order for a particular river node. As suggested by the name, this column is important for deriving the geometry of cells.
Ts
Each record represents a terrain primitive.
Parameters
Each record represents a parameter that was used to generate the terrain.
The parameters currently stored in the table, edgeLength
and resolution
, will probably be rendered irrelevant by future architecture changes. But there will be future architecture changes, such as the effort to make this program more deterministic, that will probably preserve the relevance of this table.
Usage
This schema avoids storing redundant information. For example, Edges are lines. Yet the Edges
table is not a spatial table. Instead, it represents edges by relating edges to the Qs of which they are comprised. This avoids redundancy, but it does mean that a user who wishes to inspect edge geometry must derive this information using queries. Furthermore, SpatiaLite does not support spatial views that dynamically generate geometry.
But because this schema preserves the data model's structure, it is possible to derive many different kinds of information with SQL queries. Moreover, there is no reason why a user cannot add more tables to the database. If a user chooses to do this, he is encouraged to a) not modify the core schema and b) use triggers and/or other techniques to ensure that the data is consistent.
This section will describe queries that can derive commonly-used data.
Edge lines
This is a simple example.
SELECT
Edges.id
,MakeLine(q0s.loc, q1s.loc) AS edgeline
FROM
Edges
JOIN Qs AS q0s ON q0s.id = Edges.q0
JOIN Qs AS q1s ON q1s.id = Edges.q1
;
Cell polygons
This example is less simple.
Polygons must be constructed according to rules. Two rules that are especially relevant to cell polygons are:
- Polygons must be assembled with the vertices in counterclockwise order
- Polygons must be closed: That is, polygons must end with the same vertex that they started with.
It is trivial to determine the order of cell polygons. The polygonOrder
column of the Cells
table contains exactly this. And polygonOrder
does arrange Qs in counterclockwise order around a cell. However, SpatiaLite aggregate functions are not guaranteed to respect ORDER BY
. In practice, a naive queries that use ORDER BY
and an aggregate function often work as expected, but in order to ensure that results are correct, a recursive common table expression must be used.
Eventually, when PostGIS support is added, it will be possible to do this with a simpler query. PostgreSQL allows ORDER BY
within aggregate functions.
Polygons must be closed. This can be done by 'starting' the polygon with the end of the polygon.
WITH RECURSIVE CellLineStrings(rivernode, polygonOrder, seg) AS (
-- the anchor members are the last segments of all the cells
SELECT
order0.rivernode
,(qCounts.numQs - 1) AS polygonOrder -- this indicates the polygonOrder of the first Q in the linestring so far
,MakeLine(q0.loc, q1.loc) AS seg
FROM
-- get the last Q in a cell
(
SELECT
rivernode
,COUNT(*) AS numQs
FROM
Cells
GROUP BY
rivernode
) AS qCounts
JOIN Cells AS order0 ON order0.rivernode = qCounts.rivernode AND order0.polygonOrder = (qCounts.numQs - 1)
-- get the first Q in a cell (see the WHERE clause)
JOIN Cells AS order1 ON order1.rivernode = order0.rivernode
-- get the actual Qs
JOIN Qs AS q0 ON q0.id = order0.q
Join Qs AS q1 ON q1.id = order1.q
WHERE
order1.polygonOrder = 0
UNION ALL
-- the recursive query prepends points to the linestring until there aren't any more points in the cell
-- because the 2nd JOIN is an inner join on the cell's next Q, this query will return nothing when there
-- are no more Qs in the cell. This will terminate the recursion.
SELECT
CellLineStrings.rivernode
,Cells.polygonOrder AS polygonOrder
,AddPoint(CellLineStrings.seg, Qs.loc, 0) AS seg
FROM
CellLineStrings
JOIN Cells ON Cells.rivernode = CellLineStrings.rivernode AND Cells.polygonOrder = (CellLineStrings.polygonOrder - 1)
JOIN Qs ON Qs.id = Cells.q
)
SELECT
rivernode
,MIN(polygonOrder)
,AsText(MakePolygon(seg)) -- per SQLite documentation, this will be the string associated with the min polygonOrder
FROM
CellLineStrings
GROUP BY
rivernode
;
Watershed polygons
Getting the watershed of a root node is a task. The query below shows that this can be done in pure SQL, but using a programming language that is intended for general use would result in a program that is both easier to read and probably more efficient.
NOTE: QGIS has had trouble creating a layer from this query. This seems to be a problem with its integration with SQLite. The error messages that I have seem come down to "SQLite objects created in a thread can only be used in that same thread.". This doesn't seem to affect other queries, just this one.
WITH RECURSIVE NodeAncestors(rivernode, ancestor) AS (
-- the anchor members are just all nodes that have a parent
SELECT
RiverNodes.id AS rivernode
,RiverNodes.id AS ancestor -- the node is its own ancestor. That may not make sense, but it makes the following query easier
FROM
RiverNodes
-- WHERE
-- RiverNodes.parent IS NOT NULL
UNION ALL
-- given a river node and one of its ancestors, return a record with that ancestor's parent, if it exists
SELECT
NodeAncestors.rivernode AS rivernode
,ancestorOfAncestor.id AS ancestor
FROM
NodeAncestors
JOIN RiverNodes AS ancestor ON ancestor.id = NodeAncestors.ancestor
-- since this is an inner join, no records will be returned if this ancestor has no parent
JOIN RiverNodes AS ancestorOfAncestor ON ancestorOfAncestor.id = ancestor.parent
), EdgeWatershedDivisions (edge, watershed0, watershed1) AS (
SELECT
Edges.id AS edge
,side0cells.ancestor AS watershed0
,side1cells.ancestor AS watershed1
FROM
Edges
JOIN EdgeCells ON EdgeCells.edge = Edges.id
-- line up all the nodes that are on one side of the edge, down to its root
JOIN NodeAncestors AS side0cells ON
side0cells.rivernode = EdgeCells.node0
-- only continue if this cell doesn't appear in the ancestor list of the other side
-- determine this by looking for it, and if the join doesn't find it, then it isn't
-- there
LEFT JOIN NodeAncestors AS counterfactual0 ON
counterfactual0.rivernode = EdgeCells.node1
AND counterfactual0.ancestor = side0cells.ancestor
-- for all the nodes that aren't in the intersection, all of the nodes are divided
-- from all the others
JOIN NodeAncestors AS side1cells ON
side1cells.rivernode = EdgeCells.node1
-- we also have to make sure that these cells aren't in the ancestor list of the other side
LEFT JOIN NodeAncestors AS counterfactual1 ON
counterfactual1.rivernode = EdgeCells.node0
AND counterfactual1.ancestor = side1cells.ancestor
WHERE
counterfactual0.ancestor IS NULL
AND counterfactual1.ancestor IS NULL
), WatershedFirstBound (rivernode, edge) AS (
SELECT
RiverNodes.id AS rivernode
,divisions.edge -- this will be arbitrarily selected, but that doesn't matter
FROM
RiverNodes
JOIN EdgeWatershedDivisions AS divisions ON RiverNodes.id IN (divisions.watershed0, divisions.watershed1)
GROUP BY
RiverNodes.id
), Watersheds (rivernode, count, firstQ, lastEdge, lastQ, watershedline) AS (
-- the anchor members are the first line segment of the watershed for every node
SELECT
RiverNodes.id AS rivernode
,0 AS count
,CASE
WHEN cellq0s.polygonOrder < cellq1s.polygonOrder OR cellq1s.polygonOrder = 0 THEN q0s.id
ELSE q1s.id
END AS firstQ
,Edges.id AS lastEdge
,CASE
WHEN cellq0s.polygonOrder < cellq1s.polygonOrder OR cellq1s.polygonOrder = 0 THEN q1s.id
ELSE q0s.id
END AS lastQ
,CASE
WHEN cellq0s.polygonOrder < cellq1s.polygonOrder OR cellq1s.polygonOrder = 0 THEN MakeLine(q0s.loc, q1s.loc)
ELSE MakeLine(q1s.loc, q0s.loc)
END AS watershedline
FROM
RiverNodes
JOIN WatershedFirstBound ON WatershedFirstBound.rivernode = RiverNodes.id
JOIN Edges ON Edges.id = WatershedFirstBound.edge
JOIN Cells AS cellq0s ON cellq0s.rivernode = RiverNodes.id AND cellq0s.q = Edges.q0
JOIN Qs AS q0s ON q0s.id = cellq0s.q
JOIN cells AS cellq1s ON cellq1s.rivernode = RiverNodes.id AND cellq1s.q = Edges.q1
JOIN Qs AS q1s ON q1s.id = cellq1s.q
UNION ALL
SELECT
Watersheds.rivernode AS rivernode
,Watersheds.count + 1 AS count
,Watersheds.firstQ AS firstQ
,Edges.id AS lastEdge
,CASE
WHEN Watersheds.lastQ = Edges.q0 THEN Edges.q1
ELSE Edges.q0
END AS lastQ
,CASE
WHEN Watersheds.lastQ = Edges.q0 THEN AddPoint(Watersheds.watershedline, q1s.loc, -1)
ELSE AddPoint(Watersheds.watershedline, q0s.loc, -1)
END AS watershedline
FROM
Watersheds
JOIN Edges ON Watersheds.lastQ IN (Edges.q0, Edges.q1) AND Edges.id <> Watersheds.lastEdge
-- JOIN Qs AS q0s ON q0s.id = Edges.q0
-- JOIN Qs AS q1s ON q1s.id = Edges.q1
JOIN EdgeCells ON EdgeCells.edge = Edges.id
-- So far, we have 1 record for every edge that could potentially be next
-- Now we're going to narrow it down to those edges that actually bound the watershed
JOIN EdgeWatershedDivisions AS divisions ON
divisions.edge = Edges.id
AND Watersheds.rivernode IN (divisions.watershed0, divisions.watershed1)
-- to be sure that it narrows down to 1 record, match up against the other cell,
-- the cell on the outside of the watershed
-- but, in case we happen to be around the root node, make sure we're not just
-- matching that
AND (
(EdgeCells.node0 <> Watersheds.rivernode AND EdgeCells.node0 IN (divisions.watershed0, divisions.watershed1))
OR (EdgeCells.node1 <> Watersheds.rivernode AND EdgeCells.node1 IN (divisions.watershed0, divisions.watershed1))
)
JOIN Qs AS q0s ON q0s.id = Edges.q0
JOIN Qs AS q1s ON q1s.id = Edges.q1
WHERE
Watersheds.lastQ <> Watersheds.firstQ
UNION ALL
-- the edge that is downstream from the root node technically doesn't divide one watershed
-- from another, but for our purposes, it is part of the boundary of the watershed
SELECT
Watersheds.rivernode AS rivernode
,Watersheds.count + 1 AS count
,Watersheds.firstQ AS firstQ
,Edges.id AS lastEdge
,CASE
WHEN Watersheds.lastQ = Edges.q0 THEN Edges.q1
ELSE Edges.q0
END AS lastQ
,CASE
WHEN Watersheds.lastQ = Edges.q0 THEN AddPoint(Watersheds.watershedline, q1s.loc, -1)
ELSE AddPoint(Watersheds.watershedline, q0s.loc, -1)
END AS watershedline
FROM
Watersheds
JOIN DownstreamEdges ON DownstreamEdges.rivernode = Watersheds.rivernode
JOIN Edges ON
Edges.id = DownstreamEdges.downstreamEdge
AND Watersheds.lastQ IN (Edges.q0, Edges.q1)
AND Edges.id <> Watersheds.lastEdge
JOIN Qs AS q0s ON q0s.id = Edges.q0
JOIN Qs AS q1s ON q1s.id = Edges.q1
UNION ALL
-- don't forget shorelines. They won't show up in the first part of the query. But at
-- least they're easier to find.
SELECT
Watersheds.rivernode AS rivernode
,Watersheds.count + 1 AS count
,Watersheds.firstQ AS firstQ
,Edges.id AS lastEdge
,CASE
WHEN Watersheds.lastQ = Edges.q0 THEN Edges.q1
ELSE Edges.q0
END AS lastQ
,CASE
WHEN Watersheds.lastQ = Edges.q0 THEN AddPoint(Watersheds.watershedline, q1s.loc, -1)
ELSE AddPoint(Watersheds.watershedline, q0s.loc, -1)
END AS watershedline
FROM
Watersheds
JOIN Edges ON Watersheds.lastQ IN (Edges.q0, Edges.q1) AND Edges.id <> Watersheds.lastEdge
-- since this is a shore edge, it only borders 1 node. We just need to confirm that
-- the node that it borders traces back to the root
JOIN Cells AS q0s ON q0s.q = Edges.q0
JOIN Cells AS q1s ON q1s.q = Edges.q1 AND q1s.rivernode = q0s.rivernode
JOIN NodeAncestors ON
NodeAncestors.rivernode = q1s.rivernode
AND NodeAncestors.ancestor = Watersheds.rivernode
JOIN Qs AS q0s ON q0s.id = Edges.q0
JOIN Qs AS q1s ON q1s.id = Edges.q1
WHERE
Edges.isShore = 1
)
SELECT
Watersheds.rivernode AS rivernode
,MAX(Watersheds.count) AS count
,MakePolygon(Watersheds.watershedline) AS watershedgeometry
FROM
Watersheds
GROUP BY
Watersheds.rivernode
;