technical details - Remi-C/Pointcloud_in_db GitHub Wiki

##Technical details##

About performances: * Loading * if you have binary data use pg_bulkload to accelerate things * The query to make patch will be much faster with fewer points , so it may be interessting to split your points into small files. * querying * Server

Details on architecture choices * We assumed that querying a database table can be fast, if there are indexes and the table is not too big. * the cost of building/storing/maintaining indexes strongly depends on the number of rows in a table, therefore we cannot store billions of points as billions of lines. This lead to the idea that we have to regoup points. * The way we regroup points is directly linked to the way we would like to get points. If we intend to find points near something or in a certain area, we should group points spatially. If we intend to get all points that have been acquired between Time T and T+X, we should group points temporally. If we intend to get points sharing a common property (like intensity), we could group by attributes . To a certain extend it can be efficient to query points using several criterias. * For the use case we tested, the time of acquisition is somewhat linked to the position, as points acquired around the same time tend to be relatively close together. Therefore it can be efficient to filter by time of acquisition on small spattialy grouped points.
Why the number of row should be kept low ?
- The way postgres works it is more efficient that when querying a table, this table index fit into the server memory. If the table fit also into memory it is even better
- We have to take into concern update/insert/delete (cost, lock).
The alternatives
- The problem of having lots of data and querying efficiently in postgres can be reduced to "not having one table with too many rows". So either we split the table into several smaller table, or we regroup row to lessen the count.
  - The classical solution is table partitionning, that is split the table to several smaller table, while using the postgres inheritance feature. This mean that all the tables are children to one master table, representing all the data. This is used with great success, but has several cons. In short, a lot of work is rejected upon the user
    - First, when querying the master table, in order for the server to not look into every inherited tables (which would not scale, and would keep from getting indexes into memory), we need to define constraint so the query planner can skip whole table. The power of table partitionning therefore relies in the constraints we define on inherited table. For instances, We could decide that table 1 correspond to one area, table 2 to another area, and so on.
    - So to make it works, we have to manually taking car of constraints for every table, meaning enforcing it and updating it. This may be a lot of work, and slow operations
    - Currently, postgres can't handle insert or update in the master table, meaning it's up to the user to deal with it. SO if we want to change a point attribute, we have to programmaticaly find in which table it is, then adapt the query to update it in this table. This can be done using rewritting rules, but is again a lot's of work.
    - Lastly, we are generally not interested in one point! , so why bother creating and maintaining a complexe system to get point by point when we generaly want to work with hundreds thousands?
  - The other solution is to regroup rows so to have fewer rows which will be larger. Using this solution we have to deal with the maximum row size issue.
    - The maximum row size is user defined and is around 8kByte, that means between 100 and 300 points. Therefore a multi-billion points acquisition would take several tens of millions rows, which is too much.
    - Therefore to be able to scall, we have to use an internal feature of postgres which is the TOAST table. When storing more than the maximum size allowed per row, postgres will store the object in a toast table, which can be seen as a shadow table, as it is totally transparent to the user.
    - With previous version of postgres it could have been a problem as when filtering a row, the row was read. This could have been very performance-consuming with large rows. This is why we chose the 9.2 postgres version, where index only scan where introduced. In short : postgres tries not to read the row when it doesn't need it. This guarantees good performances.
Balance between row size/patch size/table size/data compression
- It may be interesting to try to keep row size under the max row size limit, so as to avoid to use toast table. In the same way the max row size could be increased.
- the distribution of the number of points in patch could be studied, as having almost emptu patches is not efficient for querying (increase the number of rows) and for compression (points in patch are compressed)
- the toast table is compressed by postgres, it could be intersting to turn this off. In our experiment it was not worth it as the time gained on compressing/uncompressing is less than the additionnal time to read/write bigger data.
- the patches are compressed, we didn't notice faster processing while storing it uncompressed. This may be however very implementation dependant.