System design of data engineering - RatneshKumarSrivastava/Ratnesh GitHub Wiki

https://medium.com/@poojatripathi0697/how-to-read-a-100gb-file-in-pyspark-without-breaking-your-cluster-f23961e0f7b0 https://medium.com/@vishalbarvaliya/system-design-for-data-engineers-fdcd5b626f1a

system design-

  1. data source - connectors, API calls, files
  2. Data ingestion - for batch apche nifi and AWS glue or real time -Kafka, GCP pubsub
  3. data storage - for structured data - relational database(postgresSQL, myssql, oracle) and for unstructured database - NoSQL db-MongoDB and GCS, s3 for semi structured (JSON) - Firestore
  4. data processing - for large scale dataset - apache spark and hadoop , for normal other tools like python scripts, talend, informatica.
  5. Data Analytics - SQL, Tableau, Microstreagdy,power BI, Big query
  6. Data visualization - dashboard
  7. data governance - GDPR steps- requirement - volume, velocity(frequency), variety of data and business requirement(purpose and goal).

Best practice - Scalability, Reliability(recover from failure), performance, security, Documentation.