System design of data engineering - RatneshKumarSrivastava/Ratnesh GitHub Wiki
https://medium.com/@poojatripathi0697/how-to-read-a-100gb-file-in-pyspark-without-breaking-your-cluster-f23961e0f7b0 https://medium.com/@vishalbarvaliya/system-design-for-data-engineers-fdcd5b626f1a
system design-
- data source - connectors, API calls, files
- Data ingestion - for batch apche nifi and AWS glue or real time -Kafka, GCP pubsub
- data storage - for structured data - relational database(postgresSQL, myssql, oracle) and for unstructured database - NoSQL db-MongoDB and GCS, s3 for semi structured (JSON) - Firestore
- data processing - for large scale dataset - apache spark and hadoop , for normal other tools like python scripts, talend, informatica.
- Data Analytics - SQL, Tableau, Microstreagdy,power BI, Big query
- Data visualization - dashboard
- data governance - GDPR steps- requirement - volume, velocity(frequency), variety of data and business requirement(purpose and goal).
Best practice - Scalability, Reliability(recover from failure), performance, security, Documentation.