100: FAQ - heterodb/pg-strom GitHub Wiki

This note is FAQ (frequently asked questions) when people are interested in PG-Strom.

PG-Strom features

Why GPU-rev Sorting plan is not supported?

Sorry, it has been supported at v6.0. Combination with LIMIT or Window-function (e.g rank() < 10) allows to reduce number of rows to be written back to the CPU-memory.

Supported devices/platforms

What GPU supports GPU-Direct SQL?

We implemented GPU-Direct SQL feature on top of NVIDIA GPUDirect RDMA (https://docs.nvidia.com/cuda/gpudirect-rdma/index.html). It allows third-party Linux kernel modules (like, nvme_strom.ko by HeteroDB) to intermediate direct data transfer from other PCI-E devices to GPU. However, this functionality is only enabled at Tesla/Quadro. So, we cannot support SSD-to-GPU Direct SQL on Geforce RTX/GTX devices. In addition, most of GPU devices, except for high-end Tesla, have only 256MB of PCI-E Bar1 memory space; that is used for the window of data transfer at P2P DMA/RDMA. For PG-Strom usage, 256MB is too small for concurrent & multiplexed data transfer. Right now, only Tesla V100, P100 and P40 offers PCI-E Bar1 memory space more than its physical device memory. (NVIDIA A100 shall have large PCI-E Bar1 memory as well) So, we support SSD-to-GPU Direct SQL at the following devices only.

  • NVIDIA A100 (Ampere gen)
  • NVIDIA A40 (Ampere gen)
  • NVIDIA Tesla V100 (Volta gen)
  • NVIDIA Tesla P100 (Pascal gen)
  • NVIDIA Tesla P40 (Pascal gen)

Why Maxwell/Kepler GPUs are not supported?

Pascal model newly supports the feature of demand paging which allocates physical page frame on run-time, like as modern operating system doing. It enables to consume the least amount of device memory, thus provides a significant improvement for the database workloads. Even though we estimates number of result rows for SCAN, JOIN and GROUP BY during query optimization, basically, we cannot know the exact result size unless query is not executed actually. On the other hands, we have to allocate device memory for result buffer prior to launch of GPU kernel when we run these workloads on GPU. It may be able to allocate device memory based on the estimation and some margins, but not perfect. Result size may be often larger than the estimation with margin, and large margin will increase the dead space where other concurrent jobs could use. Once we utilize the demand paging feature of Pascal or later, implementation becomes much much simpler. Even if we reserved very large memory address space, it does not consume physical device memory immediately. The device memory shall be assigned according to growth of consumption of the result buffer. So, PG-Strom is now designed to rely on the demand paging feature entirely. It is not an easy job to support the older architecture, and we have no plan to support them again.