File format - wattzhikang/terrainHydrology GitHub Wiki

With PR #56, this project uses a new save file. Note that this change will likely lead to significant architectural changes, so this format will change. Moreover, with Issue #57, support for PostGIS will be added.

This document will not describe the data model for this program; a separate document should be written for that. This document will merely describe how it is encoded.

Legacy format

The original save file format was developed in April and May 2021. The first logic that serialized parts of the data model into a sequence of bytes was written to enable communication between hydrology.py and the native module. (Some of the earliest logic to do this was committed in 224fe7c on 13 April 2021.) This logic was soon adapted to serialize the entire data model. This architecture was largely finalized with commit b9adaff on 26 May. Breaking changes were often introduced to the serialization logic, necessitating that every save file be prepended with a version number to indicate these breaking changes.

The file was written by painstakingly encoding every value into binary one at a time, and it was not done very well. Interested parties can examine the details of this format by inspecting the git history of SaveFile.py.

Motivation to develop a new file

I always knew that this was a terrible scheme. For a while I had looked into the possibility of writing the data model as a set of ESRI Shapefiles (see Issue #9), but I was not satisfied with this solution. Finally, it occurred to me that SpatiaLite was the obvious solution. SpatiaLite would:

  • Keep all of the data in a single file
  • Perfectly preserve the relationships among the data
  • Be easily usable by end users
  • Allow the basic task of serialization to be done by developers who are smarter than me
  • Make schema changes less likely to make older files entirely unusable
  • Speed up development by not forcing me to write a lot of verbose logic for every minor change to the data model

New schema

The new file is a SQLite database that uses SpatiaLite extensions.

A diagram of the database schema. Note that the relationship between RiverNodes and RiverPaths is incorrectly indicated as being 1 : many. It is actually 1:1.

This diagram can be recreated with dbdiagram.io using the following markup:

Table Shoreline {
  id int [pk]
  loc geom
}

Table RiverNodes {
  id int [pk]
  parent int [ref: > RiverNodes.id]
  elevation float
  localwatershed float
  inheritedwatershed float
  flow float
  contourIndex int [ref: - Shoreline.id]
  loc geom
}

Table Qs {
  id int [pk]
  elevation float
  loc geom
}

Table Cells {
  rivernode int [ref: > RiverNodes.id]
  polygonOrder int
  q int [ref: > Qs.id]
}

Table Ts {
  id int [pk]
  rivercell int [ref: > RiverNodes.id]
  elevation float
  loc geom
}

Table Edges {
  id int [pk]
  q0 int [ref: > Qs.id]
  q1 int [ref: > Qs.id]
  hasRiver bool
  isShore bool
  shore0 int [ref: > Shoreline.id]
  shore1 int [ref: > Shoreline.id]
}

Table RiverPaths {
  id int [pk]
  rivernode int [ref: > RiverNodes.id]
  path geom
}

Table Parameters {
  key text [pk]
  value text
}

All technical details of the schema can be found in src/db-init.sql, which creates all the tables and sets up their relationships.

The following sections will review each table.

RiverNodes

Each record in this table represents a single river node in the Hydrology.

This is one of the most important tables in the schema. A river node doesn't just have a river. A river node is an area enclosed by a polygon, partitioning its cell from its neighbors. That cell consists of ridge primitives. It also has a set of terrain primitives.

There is data in these records that is redundant, and will probably be removed in a future release:

  • localwatershed and inheritedwatershed are just the areas of cell polygons
  • flow is computed from the areas of cell polygons

RiverPaths

Each record represents the path of the river that flows through a river node.

In this table, there is a 1:1 relationship between river nodes and river paths. In the data model, every leaf node has a unique river, and the rest of the river nodes just reference one of those. This will be reconciled eventually.

Qs

Each record represents a ridge primitive.

Edges

Each record represents an edge that either divides 2 river node cells, or demarcates the shore between a river node and the ocean.

Each edge lies between 2 ridge primitives.

There is some redundant data that will probably be removed in a future release:

  • hasRiver just means that the 2 cells that this edge divides have a parent-child relationship. This could be easily determined by a query.
  • isShore is more dubious, because it could just mean that shore0 and shore1 are NULL.

Shoreline

Each record represents a point on the shoreline.

This table is probably redundant. In reality, each point on the shore is a ridge primitive, but there is no relationship between Qs and Shoreline. This doesn't make any sense, and will be changed in a future release.

Cells

Each record indicates a relationship between a ridge primitive and a river node.

This is a through table. Every river node will have a number of ridge primitives, and many ridge primitives will border 3 or more river nodes.

The column polygonOrder indicates the sort order for the Qs relative to a particular river node. This can sort the Qs in counterclockwise order for a particular river node. As suggested by the name, this column is important for deriving the geometry of cells.

Ts

Each record represents a terrain primitive.

Parameters

Each record represents a parameter that was used to generate the terrain.

The parameters currently stored in the table, edgeLength and resolution, will probably be rendered irrelevant by future architecture changes. But there will be future architecture changes, such as the effort to make this program more deterministic, that will probably preserve the relevance of this table.

Usage

This schema avoids storing redundant information. For example, Edges are lines. Yet the Edges table is not a spatial table. Instead, it represents edges by relating edges to the Qs of which they are comprised. This avoids redundancy, but it does mean that a user who wishes to inspect edge geometry must derive this information using queries. Furthermore, SpatiaLite does not support spatial views that dynamically generate geometry.

But because this schema preserves the data model's structure, it is possible to derive many different kinds of information with SQL queries. Moreover, there is no reason why a user cannot add more tables to the database. If a user chooses to do this, he is encouraged to a) not modify the core schema and b) use triggers and/or other techniques to ensure that the data is consistent.

This section will describe queries that can derive commonly-used data.

Edge lines

This is a simple example.

SELECT
    Edges.id
    ,MakeLine(q0s.loc, q1s.loc) AS edgeline
FROM
    Edges
    JOIN Qs AS q0s ON q0s.id = Edges.q0
    JOIN Qs AS q1s ON q1s.id = Edges.q1
;

Cell polygons

This example is less simple.

Polygons must be constructed according to rules. Two rules that are especially relevant to cell polygons are:

  1. Polygons must be assembled with the vertices in counterclockwise order
  2. Polygons must be closed: That is, polygons must end with the same vertex that they started with.

It is trivial to determine the order of cell polygons. The polygonOrder column of the Cells table contains exactly this. And polygonOrder does arrange Qs in counterclockwise order around a cell. However, SpatiaLite aggregate functions are not guaranteed to respect ORDER BY. In practice, a naive queries that use ORDER BY and an aggregate function often work as expected, but in order to ensure that results are correct, a recursive common table expression must be used.

Eventually, when PostGIS support is added, it will be possible to do this with a simpler query. PostgreSQL allows ORDER BY within aggregate functions.

Polygons must be closed. This can be done by 'starting' the polygon with the end of the polygon.

WITH RECURSIVE CellLineStrings(rivernode, polygonOrder, seg) AS (
    -- the anchor members are the last segments of all the cells
    SELECT
        order0.rivernode
        ,(qCounts.numQs - 1) AS polygonOrder -- this indicates the polygonOrder of the first Q in the linestring so far
        ,MakeLine(q0.loc, q1.loc) AS seg
    FROM
        -- get the last Q in a cell
        (
            SELECT
                rivernode
                ,COUNT(*) AS numQs
            FROM
                Cells
            GROUP BY
                rivernode
        ) AS qCounts
        JOIN Cells AS order0 ON order0.rivernode = qCounts.rivernode AND order0.polygonOrder = (qCounts.numQs - 1)
        -- get the first Q in a cell (see the WHERE clause)
        JOIN Cells AS order1 ON order1.rivernode = order0.rivernode
        -- get the actual Qs
        JOIN Qs AS q0 ON q0.id = order0.q
        Join Qs AS q1 ON q1.id = order1.q
    WHERE
        order1.polygonOrder = 0

    UNION ALL

    -- the recursive query prepends points to the linestring until there aren't any more points in the cell
    -- because the 2nd JOIN is an inner join on the cell's next Q, this query will return nothing when there
    -- are no more Qs in the cell. This will terminate the recursion.
    SELECT
        CellLineStrings.rivernode
        ,Cells.polygonOrder AS polygonOrder
        ,AddPoint(CellLineStrings.seg, Qs.loc, 0) AS seg
    FROM
        CellLineStrings
        JOIN Cells ON Cells.rivernode = CellLineStrings.rivernode AND Cells.polygonOrder = (CellLineStrings.polygonOrder - 1)
        JOIN Qs ON Qs.id = Cells.q
)
SELECT
    rivernode
    ,MIN(polygonOrder)
    ,AsText(MakePolygon(seg)) -- per SQLite documentation, this will be the string associated with the min polygonOrder
FROM
    CellLineStrings
GROUP BY
    rivernode
;

Watershed polygons

Getting the watershed of a root node is a task. The query below shows that this can be done in pure SQL, but using a programming language that is intended for general use would result in a program that is both easier to read and probably more efficient.

NOTE: QGIS has had trouble creating a layer from this query. This seems to be a problem with its integration with SQLite. The error messages that I have seem come down to "SQLite objects created in a thread can only be used in that same thread.". This doesn't seem to affect other queries, just this one.

WITH RECURSIVE NodeAncestors(rivernode, ancestor) AS (
    -- the anchor members are just all nodes that have a parent
    SELECT
        RiverNodes.id AS rivernode
        ,RiverNodes.id AS ancestor -- the node is its own ancestor. That may not make sense, but it makes the following query easier
    FROM
        RiverNodes
    -- WHERE
    --     RiverNodes.parent IS NOT NULL

    UNION ALL

    -- given a river node and one of its ancestors, return a record with that ancestor's parent, if it exists
    SELECT
        NodeAncestors.rivernode AS rivernode
        ,ancestorOfAncestor.id AS ancestor
    FROM
        NodeAncestors
        JOIN RiverNodes AS ancestor ON ancestor.id = NodeAncestors.ancestor
        -- since this is an inner join, no records will be returned if this ancestor has no parent
        JOIN RiverNodes AS ancestorOfAncestor ON ancestorOfAncestor.id = ancestor.parent
), EdgeWatershedDivisions (edge, watershed0, watershed1) AS (
    SELECT
        Edges.id AS edge
        ,side0cells.ancestor AS watershed0
        ,side1cells.ancestor AS watershed1
    FROM
        Edges
        JOIN EdgeCells ON EdgeCells.edge = Edges.id
        -- line up all the nodes that are on one side of the edge, down to its root
        JOIN NodeAncestors AS side0cells ON
            side0cells.rivernode = EdgeCells.node0
        -- only continue if this cell doesn't appear in the ancestor list of the other side
        -- determine this by looking for it, and if the join doesn't find it, then it isn't
        -- there
        LEFT JOIN NodeAncestors AS counterfactual0 ON
            counterfactual0.rivernode = EdgeCells.node1
            AND counterfactual0.ancestor = side0cells.ancestor
        -- for all the nodes that aren't in the intersection, all of the nodes are divided
        -- from all the others
        JOIN NodeAncestors AS side1cells ON
            side1cells.rivernode = EdgeCells.node1
        -- we also have to make sure that these cells aren't in the ancestor list of the other side
        LEFT JOIN NodeAncestors AS counterfactual1 ON
            counterfactual1.rivernode = EdgeCells.node0
            AND counterfactual1.ancestor = side1cells.ancestor
    WHERE
        counterfactual0.ancestor IS NULL
        AND counterfactual1.ancestor IS NULL
), WatershedFirstBound (rivernode, edge) AS (
    SELECT
        RiverNodes.id AS rivernode
        ,divisions.edge -- this will be arbitrarily selected, but that doesn't matter
    FROM
        RiverNodes
        JOIN EdgeWatershedDivisions AS divisions ON RiverNodes.id IN (divisions.watershed0, divisions.watershed1)
    GROUP BY
        RiverNodes.id
), Watersheds (rivernode, count, firstQ, lastEdge, lastQ, watershedline) AS (
    -- the anchor members are the first line segment of the watershed for every node
    SELECT
        RiverNodes.id AS rivernode
        ,0 AS count
        ,CASE
            WHEN cellq0s.polygonOrder < cellq1s.polygonOrder OR cellq1s.polygonOrder = 0 THEN q0s.id
            ELSE q1s.id
        END AS firstQ
        ,Edges.id AS lastEdge
        ,CASE
            WHEN cellq0s.polygonOrder < cellq1s.polygonOrder OR cellq1s.polygonOrder = 0 THEN q1s.id
            ELSE q0s.id
        END AS lastQ
        ,CASE
            WHEN cellq0s.polygonOrder < cellq1s.polygonOrder OR cellq1s.polygonOrder = 0 THEN MakeLine(q0s.loc, q1s.loc)
            ELSE MakeLine(q1s.loc, q0s.loc)
        END AS watershedline
    FROM
        RiverNodes
        JOIN WatershedFirstBound ON WatershedFirstBound.rivernode = RiverNodes.id
        JOIN Edges ON Edges.id = WatershedFirstBound.edge
        JOIN Cells AS cellq0s ON cellq0s.rivernode = RiverNodes.id AND cellq0s.q = Edges.q0
        JOIN Qs AS q0s ON q0s.id = cellq0s.q
        JOIN cells AS cellq1s ON cellq1s.rivernode = RiverNodes.id AND cellq1s.q = Edges.q1
        JOIN Qs AS q1s ON q1s.id = cellq1s.q

    UNION ALL

    SELECT
        Watersheds.rivernode AS rivernode
        ,Watersheds.count + 1 AS count
        ,Watersheds.firstQ AS firstQ
        ,Edges.id AS lastEdge
        ,CASE
            WHEN Watersheds.lastQ = Edges.q0 THEN Edges.q1
            ELSE Edges.q0
        END AS lastQ
        ,CASE
            WHEN Watersheds.lastQ = Edges.q0 THEN AddPoint(Watersheds.watershedline, q1s.loc, -1)
            ELSE AddPoint(Watersheds.watershedline, q0s.loc, -1)
        END AS watershedline
    FROM
        Watersheds
        JOIN Edges ON Watersheds.lastQ IN (Edges.q0, Edges.q1) AND Edges.id <> Watersheds.lastEdge
        -- JOIN Qs AS q0s ON q0s.id = Edges.q0
        -- JOIN Qs AS q1s ON q1s.id = Edges.q1
        JOIN EdgeCells ON EdgeCells.edge = Edges.id
        -- So far, we have 1 record for every edge that could potentially be next
        -- Now we're going to narrow it down to those edges that actually bound the watershed
        JOIN EdgeWatershedDivisions AS divisions ON
            divisions.edge = Edges.id
            AND Watersheds.rivernode IN (divisions.watershed0, divisions.watershed1)
            -- to be sure that it narrows down to 1 record, match up against the other cell,
            -- the cell on the outside of the watershed
            -- but, in case we happen to be around the root node, make sure we're not just
            -- matching that
            AND (
                (EdgeCells.node0 <> Watersheds.rivernode AND EdgeCells.node0 IN (divisions.watershed0, divisions.watershed1))
                OR (EdgeCells.node1 <> Watersheds.rivernode AND EdgeCells.node1 IN (divisions.watershed0, divisions.watershed1))
            )
        JOIN Qs AS q0s ON q0s.id = Edges.q0
        JOIN Qs AS q1s ON q1s.id = Edges.q1
    WHERE
        Watersheds.lastQ <> Watersheds.firstQ

    UNION ALL

    -- the edge that is downstream from the root node technically doesn't divide one watershed
    -- from another, but for our purposes, it is part of the boundary of the watershed
    SELECT
        Watersheds.rivernode AS rivernode
        ,Watersheds.count + 1 AS count
        ,Watersheds.firstQ AS firstQ
        ,Edges.id AS lastEdge
        ,CASE
            WHEN Watersheds.lastQ = Edges.q0 THEN Edges.q1
            ELSE Edges.q0
        END AS lastQ
        ,CASE
            WHEN Watersheds.lastQ = Edges.q0 THEN AddPoint(Watersheds.watershedline, q1s.loc, -1)
            ELSE AddPoint(Watersheds.watershedline, q0s.loc, -1)
        END AS watershedline
    FROM
        Watersheds
        JOIN DownstreamEdges ON DownstreamEdges.rivernode = Watersheds.rivernode
        JOIN Edges ON
            Edges.id = DownstreamEdges.downstreamEdge
            AND Watersheds.lastQ IN (Edges.q0, Edges.q1)
            AND Edges.id <> Watersheds.lastEdge
        JOIN Qs AS q0s ON q0s.id = Edges.q0
        JOIN Qs AS q1s ON q1s.id = Edges.q1

    UNION ALL

    -- don't forget shorelines. They won't show up in the first part of the query. But at
    -- least they're easier to find.
    SELECT
        Watersheds.rivernode AS rivernode
        ,Watersheds.count + 1 AS count
        ,Watersheds.firstQ AS firstQ
        ,Edges.id AS lastEdge
        ,CASE
            WHEN Watersheds.lastQ = Edges.q0 THEN Edges.q1
            ELSE Edges.q0
        END AS lastQ
        ,CASE
            WHEN Watersheds.lastQ = Edges.q0 THEN AddPoint(Watersheds.watershedline, q1s.loc, -1)
            ELSE AddPoint(Watersheds.watershedline, q0s.loc, -1)
        END AS watershedline
    FROM
        Watersheds
        JOIN Edges ON Watersheds.lastQ IN (Edges.q0, Edges.q1) AND Edges.id <> Watersheds.lastEdge
        -- since this is a shore edge, it only borders 1 node. We just need to confirm that
        -- the node that it borders traces back to the root
        JOIN Cells AS q0s ON q0s.q = Edges.q0
        JOIN Cells AS q1s ON q1s.q = Edges.q1 AND q1s.rivernode = q0s.rivernode
        JOIN NodeAncestors ON
            NodeAncestors.rivernode = q1s.rivernode
            AND NodeAncestors.ancestor = Watersheds.rivernode
        JOIN Qs AS q0s ON q0s.id = Edges.q0
        JOIN Qs AS q1s ON q1s.id = Edges.q1
    WHERE
        Edges.isShore = 1
)
SELECT
    Watersheds.rivernode AS rivernode
    ,MAX(Watersheds.count) AS count
    ,MakePolygon(Watersheds.watershedline) AS watershedgeometry
FROM
    Watersheds
GROUP BY
    Watersheds.rivernode
;