PostGIS - w4111/w4111.github.io GitHub Wiki

Bane Trajkovic (bt2639)
- Wrote the problem, the solution, and the relationship with 4111
Andrew Yang (asy2130)
- Summary, alternatives
Christopher Henry (cch2201)
- Tutorial, Spatial Indexing

PostGIS

The Problem and Solution

The Problem that PostGIS solves

Relational databases such as PostgreSQL, while powerful, face limitations when dealing with spatial data. These limitations include:

Lack of built-in spatial queries
- Operations like finding the nearest neighbor or checking containment are not natively supported.
- Implementing these queries manually adds complexity and slows down performance.
No support for spatial data types
- Databases lack native support for spatial data types, making it difficult to represent objects like points (e.g., specific GPS locations), lines (e.g., highways), and polygons (e.g., park boundaries).
- Without a specialized framework, developers are forced to store spatial data in separate files or as unstructured text fields, leading to inefficiency and difficulty in querying.
No integrated geographic calculations
- Calculating distances between two points considering the Earth's curvature or finding the area of a polygon is mathematically complex.
- Relational databases such as PostgreSQL lack built-in functionality to perform these calculations efficiently and accurately.

How does PostGIS solve the problem?

PostGIS extends PostgreSQL with powerful geospatial capabilities, solving these challenges through:

Spatial queries
- Provides many built-in spatial functions to execute queries such as ST_Distance (returns the distance between two geometries) and ST_Within (returns true if one geometry is contained within another).
- PostGIS example: Identifying which parks are completely contained within city boundaries:
```
SELECT cities.name, parks.name
FROM cities
JOIN parks
ON ST_Contains(cities.boundary, parks.boundary);
```
  This query finds all cities where the entire park boundary falls within the city's boundary.
Support for spatial data types
- PostGIS introduces new data types, such as geometry and geography, enabling the storage and manipulation of spatial data directly in the database.
- These data types allow for representing points, lines, polygons, and more, facilitating direct integration of geospatial data.
Built-in geographic calculations
- Handles complex geographic computations with precision, such as great-circle distances (shortest distance on Earth's surface).
- PostGIS example: Calculating the total area of a city's parks using ST_Area to sum up the polygons' areas:
```
SELECT cities.name AS city_name, SUM(ST_Area(parks.boundary)) AS total_park_area
FROM cities
JOIN parks
ON ST_Within(parks.boundary, cities.boundary)
GROUP BY cities.name;
```
  This query calculates the total area of parks contained within each city by summing the areas of park polygons (ST_Area) and grouping by city.

Relation to concepts from 4111

PostGIS relates directly to several core database concepts covered during the lectures:

Schema design and data types
- PostGIS extends the relational model with spatial data types such as geometry and geography, while supporting schemas that store spatial and non-spatial data together.
Indexing and query optimization
- PostGIS employs GiST (Generalized Search Tree) spatial indexing, as opposed to B-tree indexing, covered during the lecture, which is less useful for spatial data.
- Spatial indexing significantly improves performance for queries involving geospatial relationships.
Joins
- PostGIS enables joins between different tables with the join key being the spatial relationship itself.
- PostGIS example: Finding all parks within a 10-kilometer radius of a city center:
```
SELECT parks.name
FROM parks
JOIN cities
ON ST_DWithin(parks.location, cities.center, 10000);
```
  This query finds parks near each city using spatial join based on distance.
Spatial Indexing
- Like the classic DBMS, postGIS allows for the creation of indices for query optimization.
```
CREATE INDEX nyc_census_blocks_geom_idx
ON nyc_census_blocks
USING GIST (geom);
```
- Here, GIST is the spatial index framework (one of three), and geom is the underlying spatial datatype (line, point, polygon, etc).
- However, unlike the indices studied in class, the spatial indexing is unable to construct an index based on the geometric object itself. Instead, they index the boundary boxes surrounding the object.
- In this example, we ask about which of the lines are intersecting the star. If we construct a spatial index then run the query, the query optimizer will use a 2-step plan: 1. If a line's boundary box intersects the star's boundary box, feed it to the next query. If not, return false. 2. Check if the line in question actually intersects the star.
- The GIST framework leverages an R-tree hierarchy (which is coincidentally extremely similar to the B-tree studied in class) which constructs a height-balanced tiered hierarchy of boundary boxes using a dynamic algorithm, with leaf nodes containing pointers to the spatial data objects.
- INSERTION: when a leaf node is inserted, begin at root. Traverse the tree, choosing whichever branch's boundary box needs to be expanded the least to accommodate the new leaf. Repeat until a sibling leaf node is found
- SEARCH: create a boundary box around the spatial object being queried, S. Begin at root node, label T. For all T's children, if that boundary box intersects S, label the child T and repeat. If not, then you know for sure that none of it's children can intersect S, so that branch of the tree can be ignored safely.
- For further reading, this is the original R-Tree search index paper published in 1984: https://postgis.net/docs/support/rtree.pdf
- For the sake of completeness, there are two other possible search indices when querying spatial objects: SP-GIST and BRIN.
- SP-GIST (space-partitioning) creates non-overlapping, distinct boundary boxes. It is generally faster to both build and query for non-overlapping data sets. It also tends to struggle when the number of spatial objects starts to hit the millions, which is when most switch to GIST.
- BRIN is a little weird. It's designed for sorted tables and builds "boxes" around the spatial objects based on their adjacency within the table itself (?). It is extremely fast to build the index within this narrow context, though, and handily beats the other two when it comes to using memory space.
- For Further Reading: https://postgis.net/docs/using_postgis_dbmanagement.html#brin_indexes

Summary

PostGIS builds upon PostgreSQL by focusing specifically on geospatial data. People who work in the GIS (Geographic Information System) sphere often work with large amounts of spatial data, and require a streamlined way to work with and analyze this data. This is especially difficult to do by default using Postgres, since much of GIS data is in geometric form (i.e. points, lines, polygons) in both 2D and 3D.

PostGIS not only allows for the storage of different types of spatial data, but also allows users to query that data.

In addition, PostGIS allows users to derive specific features of geometric data, such as calculating distances, areas, intersections, and more.

Alternatives with pros & cons

Postgres
- One of the alternatives to PostGIS is using Postgres itself, which can be sufficient in some use cases, especially when the data is basic, and the need to perform complex mathematical operations isn't present.
- Specifically, Postgres only supports basic data types like floats, which means that it is sufficient for basic information like points and lines in 2D space. However, more complex geometric operations like calculating the area of a polygon, or computing a buffer is significantly more difficult to do without PostGIS, which takes advantage of additional spatial types.
NoSQL solutions
- PostGIS is the de facto SQL solution for working with GIS data, but geohashes (which allows the encoding of a location to a hash) make it possible to perform geospatial queries.
- MongoDB's 2d and 2dsphere types use geohashes, allowing for geospatial queries like "find all restaurants in a specific radius" or "calculate the distance between two points."
  - This approach falls apart when trying to perform geometric operations like calculating areas or doing complex transformations.
  - NoSQL options like MongoDB are more easily scalable.
- Elasticsearch, which although it isn't a database solution in itself, it provides powerful geospatial query tools that can be used for GIS data.
  - Elasticsearch supports specialized field types like geo_point and geo_shape for storing and indexing geographic data. Similar to MongoDB's 2d/2dsphere, it supports queries like “find all documents within this bounding box,” “return documents within this polygon,” or “calculate the nearest point.”
  - The limitations are similar to MongoDB in the sense that complex geospatial operations are not supported.
Oracle Spatial Database
- Oracle has their own closed source database solution that also supports spatial data, built on their own flavor of SQL, that can be significantly more expensive to use than PostGIS.
  - In some domains ODS has more features, but in others PostGIS has more features, so it is mostly usecase dependent which is superior for each scenario.
  - Oracle's solution may make sense for people with larger budgets (i.e. enterprise customers) and for people that are already within the Oracle ecosystem.
  - PostGIS has the advantage of being an established open source project, which means that there is a greater chance that it will be maintained in the future, while also reducing upfront costs.

Tutorial and Example

Provides all the sudo installations and psql commands needed to install and run postgis in the remote Colab env, just make a copy in your own drive. (Note the link sharing stuff might be weird, it might be lionmail exclusive to open it?)

It also recreates a sample database using backup data, downloaded from one of the author's github: https://github.com/chrispower43/postgisdemo Alt link: https://s3.amazonaws.com/s3.cleverelephant.ca/postgis-workshop-2020.zip

The sample data is all NYC centric, including things like subway stations, streets, neighborhoods, and census data.

The tutorial includes some intro to the essential spatial datatypes, some sql queries that are very familiar to the relational ones we've been doing all semester, and then some postgis exclusive spatial joins and queries. It finishes with an extension of a Project 1 library database management system, using the actual coordinates of real libraries, converting it to the necessary geometric datatype, and running some queries on it.

Colab Notebook tutorial (⌐■_■): https://colab.research.google.com/drive/1DyH-hNTOldGNIneofR8i6hzylH9xbHdv?usp=sharing