Data Engineering - bobbae/gcp GitHub Wiki
Data Engineering is a set of operations aimed at creating interfaces and mechanisms for the flow and access of information. Data engineers set up and operate the organization’s data infrastructure preparing it for further analysis by data analysts and scientists.
There are awesome and freely available resources, roadmaps and courses.
Data Science
Data Science about finding patterns in data, to make future predictions.
Data Analytics
What is the difference between Data Science vs Data Engineering?
What's the difference between Data Analytics vs Data Engineering?
Google Analytics 4
Google Analytics 4 is an analytics service that enables you to measure traffic and engagement across your websites and apps.
https://developers.google.com/analytics/devguides/collection/ga4
GCP Data Analyst training
https://github.com/GoogleCloudPlatformTraining/training-data-analyst
Security for data engineering
https://medium.com/p/26516b5d28e4
Data analyst driven organizations
https://cloud.google.com/blog/products/data-analytics/building-the-data-analyst-driven-organization
Data Analytics developments
https://cloud.google.com/blog/products/data-analytics/google-cloud-next-rollup-for-data-analytics
Data Analytics Design Patterns
Data transformation
https://cloud.google.com/blog/products/data-analytics/unlocking-opportunities-data-transformation
Open data system
Database options
One of Google Cloud Platform's competitive advantages is the strong ecosystem of managed databases.
Choosing the right database for your workloads can be confusing. We can compare different GCP Database services and make the best decision for each use case.
Besides the GCP database services, there are also a lot of other databases.
Data Engineering Cookbook
https://github.com/andkret/Cookbook
Cloud SQL
Cloud SQL is a fully-managed relational database service on Google Cloud Platform.
You can use Cloud SQL with MySQL, PostgreSQL, or SQL Server.
SQL
SQL stands for Structured Query Language. SQL is used to communicate with a database.
Cloud Bigtable
Bigtable is ideal for storing very large amounts of data in a key-value store. Bigtable supports high read and write throughput at low latency.
Cloud Dataproc
Dataproc is a fully managed and highly scalable service for running Apache Hadoop, Apache Spark, Apache Flink, Presto, and 30+ open source tools and frameworks.
Cloud Pub/Sub
Pub/Sub works as a messaging middleware for traditional service integration or a simple communication medium for modern micro-services.
BigQuery
BigQuery is a server-less, cost-effective and multi-cloud data warehouse designed to help you turn big data into valuable business insights.
Cloud Spanner
Cloud Spanner is a distributed SQL database management and storage service that is scalable, multi-version, globally-distributed, and synchronously-replicated.
Cloud Data Fusion
Cloud Data Fusion is a fully managed, cloud-native, enterprise data integration service for quickly building and managing data pipelines.
CDAP
Cloud Data Fusion is powered by the open source project CDAP.
https://cloud.google.com/data-fusion/docs
Dataflow
Dataflow is a fully managed streaming analytics service that minimizes latency, processing time, and cost through autoscaling and batch processing.
Cloud Dataprep
Dataprep is an interactive web application in which users define the data preparation rules by interacting with a sample of their data.
Cloud Composer
Cloud Composer is a fully managed data workflow orchestration service that empowers you to author, schedule, and monitor pipelines.
Google Data Studio
Data Studio is a free tool that turns your data into informative, easy to read, easy to share, and fully customizable dashboards and reports.
https://support.google.com/datastudio#topic=6267740
Demonstration of Data Studio.
https://www.youtube.com/watch?v=NhGLOVkyKjg
Datalab
Cloud Datalab can be used to easily explore, visualize, analyze, and transform data using familiar languages, such as Python and SQL, interactively.
Looker
Looker is a business intelligence software and big data analytics platform that helps you explore, analyze and share real-time business analytics easily.
Data Catalog
Data Catalog is a fully managed and scalable metadata management service that empowers organizations to quickly discover, understand, and manage all of their data.
https://cloud.google.com/data-catalog/docs
Cloud Life Sciences
Cloud Life Sciences is a suite of services and tools for managing, processing, and transforming life sciences data. It also enables advanced insights and operational workflows using highly scalable and compliant infrastructure.
Firestore
Cloud Firestore is a cloud-hosted, NoSQL database that your iOS, Android, and web apps can access directly via native SDKs.
Datastore
Datastore is a schema-less database, which allows you to worry less about making changes to your underlying data structure as your application evolves.
Firestore in datastore mode
Firestore in Datastore mode is a NoSQL document database built for automatic scaling, high performance, and ease of application development.
https://cloud.google.com/datastore/docs
Memorystore
Memorystore automates complex tasks for open source Redis and Memcached like enabling high availability, failover, patching, and monitoring.
Firebase
Firebase is Google's mobile platform that helps you quickly develop high-quality apps and grow your business.
Data Transfer
There are various Data Transfer options in GCP.
Data Ingestion
https://cloud.google.com/blog/products/data-analytics/data-ingestion-planning-principles
Data Lakehouse
https://cloud.google.com/blog/products/data-analytics/open-data-lakehouse-on-google-cloud
Data Mesh
https://martinfowler.com/articles/data-mesh-principles.html
https://medium.com/google-cloud/10-reasons-why-you-should-not-adopt-data-mesh-7a0b045ea40f
Data Mesh on GCP using BigQuery and DataPlex
Examples
Data stack architecture at Monzo
https://medium.com/data-monzo/an-introduction-to-monzos-data-stack-827ae531bc99