Data Engineering - bobbae/gcp GitHub Wiki

Data Engineering is a set of operations aimed at creating interfaces and mechanisms for the flow and access of information. Data engineers set up and operate the organization’s data infrastructure preparing it for further analysis by data analysts and scientists.

There are awesome and freely available resources, roadmaps and courses.

Data Science

Data Science about finding patterns in data, to make future predictions.

Data Analytics

What is the difference between Data Science vs Data Engineering?

What's the difference between Data Analytics vs Data Engineering?

Google Analytics 4

Google Analytics 4 is an analytics service that enables you to measure traffic and engagement across your websites and apps.

https://developers.google.com/analytics/devguides/collection/ga4

GCP Data Analyst training

https://github.com/GoogleCloudPlatformTraining/training-data-analyst

Security for data engineering

https://medium.com/p/26516b5d28e4

Data analyst driven organizations

https://cloud.google.com/blog/products/data-analytics/building-the-data-analyst-driven-organization

Data Analytics developments

https://cloud.google.com/blog/products/data-analytics/google-cloud-next-rollup-for-data-analytics

Data Analytics Design Patterns

https://cloud.google.com/blog/products/data-analytics/accelerate-time-to-value-with-google-cloud-design-patterns

Data transformation

https://cloud.google.com/blog/products/data-analytics/unlocking-opportunities-data-transformation

Open data system

https://cloud.google.com/blog/products/data-analytics/building-most-open-data-cloud-all-data-all-source-any-platform/

Database options

One of Google Cloud Platform's competitive advantages is the strong ecosystem of managed databases.

Choosing the right database for your workloads can be confusing. We can compare different GCP Database services and make the best decision for each use case.

Besides the GCP database services, there are also a lot of other databases.

https://dbdb.io/browse

Data Engineering Cookbook

https://github.com/andkret/Cookbook

Cloud SQL

Cloud SQL is a fully-managed relational database service on Google Cloud Platform.

You can use Cloud SQL with MySQL, PostgreSQL, or SQL Server.

SQL

SQL stands for Structured Query Language. SQL is used to communicate with a database.

Cloud Bigtable

Bigtable is ideal for storing very large amounts of data in a key-value store. Bigtable supports high read and write throughput at low latency.

Cloud Dataproc

Dataproc is a fully managed and highly scalable service for running Apache Hadoop, Apache Spark, Apache Flink, Presto, and 30+ open source tools and frameworks.

Cloud Pub/Sub

Pub/Sub works as a messaging middleware for traditional service integration or a simple communication medium for modern micro-services.

BigQuery

BigQuery is a server-less, cost-effective and multi-cloud data warehouse designed to help you turn big data into valuable business insights.

Cloud Spanner

Cloud Spanner is a distributed SQL database management and storage service that is scalable, multi-version, globally-distributed, and synchronously-replicated.

Cloud Data Fusion

Cloud Data Fusion is a fully managed, cloud-native, enterprise data integration service for quickly building and managing data pipelines.

CDAP

Cloud Data Fusion is powered by the open source project CDAP.

https://cloud.google.com/data-fusion/docs

Dataflow

Dataflow is a fully managed streaming analytics service that minimizes latency, processing time, and cost through autoscaling and batch processing.

Cloud Dataprep

Dataprep is an interactive web application in which users define the data preparation rules by interacting with a sample of their data.

Cloud Composer

Cloud Composer is a fully managed data workflow orchestration service that empowers you to author, schedule, and monitor pipelines.

Google Data Studio

Data Studio is a free tool that turns your data into informative, easy to read, easy to share, and fully customizable dashboards and reports.

https://support.google.com/datastudio#topic=6267740

Demonstration of Data Studio.

https://www.youtube.com/watch?v=NhGLOVkyKjg

Datalab

Cloud Datalab can be used to easily explore, visualize, analyze, and transform data using familiar languages, such as Python and SQL, interactively.

Looker

Looker is a business intelligence software and big data analytics platform that helps you explore, analyze and share real-time business analytics easily.

Data Catalog

Data Catalog is a fully managed and scalable metadata management service that empowers organizations to quickly discover, understand, and manage all of their data.

https://cloud.google.com/data-catalog/docs

Cloud Life Sciences

Cloud Life Sciences is a suite of services and tools for managing, processing, and transforming life sciences data. It also enables advanced insights and operational workflows using highly scalable and compliant infrastructure.

Firestore

Cloud Firestore is a cloud-hosted, NoSQL database that your iOS, Android, and web apps can access directly via native SDKs.

Datastore

Datastore is a schema-less database, which allows you to worry less about making changes to your underlying data structure as your application evolves.

Firestore in datastore mode

Firestore in Datastore mode is a NoSQL document database built for automatic scaling, high performance, and ease of application development.

https://cloud.google.com/datastore/docs

Memorystore

Memorystore automates complex tasks for open source Redis and Memcached like enabling high availability, failover, patching, and monitoring.

Firebase

Firebase is Google's mobile platform that helps you quickly develop high-quality apps and grow your business.

Data Transfer

There are various Data Transfer options in GCP.

Data Ingestion

https://cloud.google.com/blog/products/data-analytics/data-ingestion-planning-principles

Data Lakehouse

https://cloud.google.com/blog/products/data-analytics/open-data-lakehouse-on-google-cloud

Data Mesh

https://martinfowler.com/articles/data-mesh-principles.html

https://medium.com/google-cloud/10-reasons-why-you-should-not-adopt-data-mesh-7a0b045ea40f

Data Mesh on GCP using BigQuery and DataPlex

https://cloud.google.com/blog/products/data-analytics/building-a-data-mesh-on-google-cloud-using-bigquery-and-dataplex/

Examples

Data stack architecture at Monzo

https://medium.com/data-monzo/an-introduction-to-monzos-data-stack-827ae531bc99

Data architecture at Twitter

https://blog.twitter.com/engineering/en_us/topics/infrastructure/2021/processing-billions-of-events-in-real-time-at-twitter-

Ingesting GA4 Events for Discovery solutions

https://cloud.google.com/blog/topics/developers-practitioners/ingesting-ga4-events-discovery-solutions