Big Data - amitbhilagude/userfullinks GitHub Wiki

  1. Big Data Technology Evolutions
    1. Teradata
    2. Apache Hadoop
    3. Apache Spark
    4. Data Lake
    5. Data Warehouse
    6. AI and Machine learning using ML Studio, TensorFlow, etc.
  2. Big Data Scenarios
    1. Analytics by capturing clicks, logs
    2. IoT devices like RFIDs
  3. What is big data
    1. Data volume in the 100s of TBs or PBs
    2. It does parallel processing using some of the technology like hadoop, spark.
  4. Data Transformation and Pipelines
    1. It performs in two different ways
      1. ETL: Extract, Transform and Load
        1. Extract data from source and store into DataLake
        2. Transform data into Azure Data Factory or databricks
        3. Load data into Destinations like SQL data warehouse
      2. ELT: Extract, Load and Transform
        1. Extract data from source and store into DataLake
        2. Load data into Destinationtion like SQL data warehouse
        3. Transform data into Azure Data Factory or databricks
  5. Common Big data technologies
    1. Hadoop

      1. Cloud Provides who have own Hadoop Service
        1. HDInsights in Azure
        2. EMR in Amazon
        3. DataProc in GCP
    2. Spark

      1. Advanced version of Hadoop. Used for in-memory data set instead of Disk. If you use Spark SQL, Those data sets will be stored in Data frames.
      2. Data brick is most commonly used in Spark space.
      3. Azure has Azure Databrick which is on top of Databrick space
    3. Kafka

    4. Hive

    5. Presto

  6. Big Data Roles
    1. Data Analyst: Focus on analyzing data and understand data in a business context
    2. Data Engineer: Coder, Builds Data pipeline and Transforms it using code or visualization
    3. Data Skewed: Put Goverenece on data
    4. Data Scientist: AI and Machine Learning expert
    5. Machine Learning Engineer: Administrative tasks on Machine learning
    6. Chief data/ analytics/ Officer: In charge of data and business decision
  7. Data Lake
    1. Repository for storing Big Data.
  8. Parquet file
    1. New file format over CSV file
    2. This format is heavily used to store files in data lake which requires smaller space in compression