Course :: Data science and Big Data - up1/training-courses GitHub Wiki

Outline

Module 1 :: Introduction to Big Data, Data Science and Machine Learning

  • Understanding Big Data

    • Definition and types (structured, semi-structured, unstructured)
    • Characteristics: The 5 V’s (Volume, Velocity, Variety, Veracity, Value)
    • Big Data Ecosystem
      • Tools and technologies
      • Architecture and storage
  • Overview of Data Science

    • Definition and importance of data science
    • Key components: data collection, cleaning, analysis, and visualization
  • Introduction to Machine Learning

    • Basics of Machine Learning
      • Supervised, unsupervised, and reinforcement learning
      • Model evaluation and metrics (accuracy, precision, recall, F1 score)
    • Supervised Learning Algorithms
      • Regression and classification (linear regression, decision trees)
    • Unsupervised Learning Algorithms
      • Clustering and dimensionality reduction (K-means, PCA)

Module 2 :: Data Collection and Preprocessing

  • Data Collection Techniques
    • Web scraping, APIs, and databases
    • Considerations for big data sources
  • Data Cleaning and Preparation
    • Handling missing data, outliers, and duplicates
    • Data normalization and transformation techniques
  • Workshop :: Data Wrangling with Python and Pandas
    • Pandas, NumPy, and data manipulation techniques
    • Implementing a data-cleaning pipeline

Module 3 :: Data Analysis and Visualization

  • Exploratory Data Analysis (EDA)
    • Analyzing datasets and identifying trends
    • Basic statistics for data analysis
  • Visualization Techniques
    • Creating impactful visualizations for data insights
  • Workshop :: EDA and creating visualizations for a sample dataset
    • Python
    • Matplotlib, Seaborn, and Plotly

Module 4 :: Machine Learning with Big Data

  • Scalable Machine Learning Approaches
    • Using Spark MLlib and TensorFlow for large datasets
  • Feature Engineering and Selection
    • Techniques to enhance model performance in big data contexts
  • Model Evaluation and Hyperparameter Tuning
    • Cross-validation, grid search, and automated tuning tools
  • Workshop :: Applying Spark MLlib for big data analysis