mpp Massively parallel processing - ghdrako/doc_snipets GitHub Wiki

Massively parallel processing (MPP) refers to the use of many processors (or computers) to perform coordinated computations simultaneously by communicating with each other as needed.

MPP technologies encompass databases and query engines designed to handle large-scale data processing across many independent nodes. Traditional MPP databases, such as HP Vertica and Teradata, are self contained systems that manage data storage, processing, and optimization internally. They are engineered for high performance on dedicated hardware and are optimized for complex analytics on structured data.

In contrast, modern MPP query engines such as Athena, Presto, and Impala decouple the compute and storage aspects of data processing. These engines are designed to work with data stored in distributed filesystems such as Hadoop (HDFS) or object storage such as Amazon S3 and Azure ADLS. They are inherently more flexible, allowing users to query data where it lies without the need to import it into a proprietary system. You only need to describe the data to read, which is generally done on technical data catalogs/metadata layers.

The role of a technical catalog, also known as a metadata layer (for example, AWS Glue or Hive Metastore), is critical in modern MPP technologies. It stores metadata about the data sources, including the location of the files, the format, and the structure of the data. This metadata is used by the query engines to understand where and how to access the data, enabling them to perform queries efficiently without needing to manage the storage layer. It acts as a map of data, allowing MPP engines to optimize query execution by distributing workloads and retrieving only relevant data.

– The architecture of modern MPP

obraz Here, S3 represents the storage system, and Glue Data Catalog represents the technical catalog. The SQL-like queries are translated into actions that interact with the storage layer. For example, when a query is executed to select products with a quantity greater than 15, the engine uses the catalog to understand the query, where it can find the data on S3 and how to extract ProductName for projection and Quantity for filtering. Then, it ensures that the engines are querying the correct datasets and interpreting the file contents accurately, which is vital for returning correct and fast query results.