Case study::batching in psycopg2 - KeynesYouDigIt/Knowledge GitHub Wiki

Vince Is there a reason not to use a server side cursor with an INSERT statement? In my head, I could feed that cursor a bunch of records and wait for it to return a batch of work, but it looks like from psycopg2.extras import execute_batch does that already? is that why a server side cursor doesn’t help us here? 6:04 I promise im only 5 more years of experience away from understanding database cursors :face_with_rolling_eyes:

Nick Potts 6:11 PM Unless you absoultely know you need a database cursor you dont need a database cursor.

Vincent Buscarello 6:17 PM right im just curious as to why, especially because I do care about data returned related to the insert

Nick Potts 6:21 PM most likely because of performance issues. Your Operation of inserting 2M rows will be IO limited. ... [execute_batch and execute_values] allow you, in blocks,. to perform updates that otherwise would take forever because of table locks

Matt Wood 9:04 AM I agree with pretty much everything Nick wrote above. Server-side cursors are to be avoided unless you are going to hit a few specific needs:

  • You intend to query multiple times to get the full amount of the data (or at least the portion you want.)
    • The canonical example here is pagination but this can run afoul of specific need #2 below. All of this indicates that your returned data is too large or takes too long to query in 1 trip to the DB. If you’re getting your data in 1 trip you very much want to avoid a server-side cursor.
  • You’re going to continue your query soon. So, if you want to query the 2nd half of your data tomorrow, don’t use a server-side cursor. We don’t want to leave the server-side-cursor sitting around for a significant portion of time.

To my knowledge, we use server-side cursors in exactly 2 places:

  • Update-metrics to pull all policies that st_intersects with a geometry. This dataset is potentially HUGE (Tokyo) and the query can be slow. We intend to run through the entirety of the data-set (not just look for the top 10 or top 100). So the server-side-cursor allows us to pull the potentially-massive data-set in manageable pieces and we clean up the server-side-cursor quickly once we’re done.
  • Download-scripts to pull all the policies that were in the most recent scrape. This dataset is also potentially HUGE (In fact, potentially larger than update-metrics as it’s query is a single geom-type.) In this case the query isn’t as slow because it doesn’t need to run st_intersects but the Download-scripts server has an ugly tendency to run out of memory so we have to handle the giant data-set in pieces and push them off to AWS-S3 as a chunk before we handle the next portion of the query.

In both cases, we know the data is potentially too large to handle in 1 trip to the db or to handle in memory at once. And we know we’ll clean-up after ourself quickly once we’re done.

If you’re were going to run a simple normal select or insert, there’s little to no benefit from the server-side-cursor. It’s all about creating a stateful-marker on the DB to keep track of what has been queried thus-far. So if you’re getting all your data in 1 go… that cost brings zero benefit.

More info - https://hakibenita.com/fast-load-data-python-postgresql