Course :: Data science and Big Data - up1/training-courses GitHub Wiki
Outline
Module 1 :: Introduction to Big Data, Data Science and Machine Learning
-
Understanding Big Data
- Definition and types (structured, semi-structured, unstructured)
- Characteristics: The 5 Vâs (Volume, Velocity, Variety, Veracity, Value)
- Big Data Ecosystem
- Tools and technologies
- Architecture and storage
-
Overview of Data Science
- Definition and importance of data science
- Key components: data collection, cleaning, analysis, and visualization
-
Introduction to Machine Learning
- Basics of Machine Learning
- Supervised, unsupervised, and reinforcement learning
- Model evaluation and metrics (accuracy, precision, recall, F1 score)
- Supervised Learning Algorithms
- Regression and classification (linear regression, decision trees)
- Unsupervised Learning Algorithms
- Clustering and dimensionality reduction (K-means, PCA)
- Basics of Machine Learning
Module 2 :: Data Collection and Preprocessing
- Data Collection Techniques
- Web scraping, APIs, and databases
- Considerations for big data sources
- Data Cleaning and Preparation
- Handling missing data, outliers, and duplicates
- Data normalization and transformation techniques
- Workshop :: Data Wrangling with Python and Pandas
- Pandas, NumPy, and data manipulation techniques
- Implementing a data-cleaning pipeline
Module 3 :: Data Analysis and Visualization
- Exploratory Data Analysis (EDA)
- Analyzing datasets and identifying trends
- Basic statistics for data analysis
- Visualization Techniques
- Creating impactful visualizations for data insights
- Workshop :: EDA and creating visualizations for a sample dataset
- Python
- Matplotlib, Seaborn, and Plotly
Module 4 :: Machine Learning with Big Data
- Scalable Machine Learning Approaches
- Using Spark MLlib and TensorFlow for large datasets
- Feature Engineering and Selection
- Techniques to enhance model performance in big data contexts
- Model Evaluation and Hyperparameter Tuning
- Cross-validation, grid search, and automated tuning tools
- Workshop :: Applying Spark MLlib for big data analysis