Data Science and Machine Learning - CameronAuler/python-devops GitHub Wiki
Python is a leading language for data science and machine learning, offering powerful libraries for numerical computing, data manipulation, visualization, and machine learning algorithms.
Table of Contents
- NumPy (Numerical Computing)
- Pandas (Data Manipulation)
- Matplotlib & Seaborn (Data Visualization)
- Scikit-learn (Machine Learning)
NumPy (Numerical Computing)
NumPy (Numerical Python) provides efficient multi-dimensional arrays and mathematical operations. It is significantly faster than Python lists due to vectorization and optimized memory use.
Install NumPy
pip install numpy
Creating NumPy Arrays
Efficiently storing and processing large numerical datasets.
import numpy as np
arr = np.array([1, 2, 3, 4, 5]) # Create 1D array
print(arr)
arr2d = np.array([1, 2, 3], [4, 5, 6](/CameronAuler/python-devops/wiki/1,-2,-3],-[4,-5,-6)) # Create 2D array
print(arr2d)
# Output:
[1 2 3 4 5]
[[1 2 3]
[4 5 6]]
NumPy Array Operations
Fast vectorized operations for scientific computing.
arr = np.array([1, 2, 3, 4])
print(arr + 10) # Element-wise addition
print(arr * 2) # Element-wise multiplication
print(np.mean(arr)) # Compute mean
print(np.sqrt(arr)) # Compute square root
Pandas (Data Manipulation)
Pandas provides DataFrames, a powerful table-like structure for handling structured data. It supports loading, cleaning, transforming, and analyzing data.
Installing Pandas
pip install pandas
Creating & Viewing a DataFrame
Use Case: Organizing tabular data (like spreadsheets & databases).
import pandas as pd
data = {"Name": ["Alice", "Bob", "Charlie"], "Age": [25, 30, 35]}
df = pd.DataFrame(data)
print(df) # View the DataFrame
print(df.head()) # View first 5 rows
print(df.info()) # Display data info
Loading & Saving Data
Use case: Handling real-world datasets from files, APIs, and databases.
df = pd.read_csv("data.csv") # Load from CSV
df.to_csv("output.csv", index=False) # Save to CSV
Filtering & Aggregating Data
Use case: Cleaning and analyzing structured data.
filtered_df = df[df["Age"] > 28] # Filter rows where Age > 28
print(df["Age"].mean()) # Compute average age
print(df.groupby("Age").count()) # Aggregate by Age
Matplotlib & Seaborn (Data Visualization)
- Matplotlib: Low-level plotting library for creating graphs.
- Seaborn: High-level statistical visualization built on Matplotlib.
Import Matploitlib & Seaborn
pip install matplotlib seaborn
Creating Basic Plots with Matplotlib
Use case: Creating line graphs, bar charts, scatter plots.
import matplotlib.pyplot as plt
x = [1, 2, 3, 4]
y = [10, 20, 25, 30]
plt.plot(x, y, marker="o", linestyle="-")
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.title("Line Plot Example")
plt.show()
Creating Statistical Plots with Seaborn
Use case: Creating statistical visualizations with simple syntax.
import seaborn as sns
import pandas as pd
df = pd.DataFrame({"Category": ["A", "B", "C"], "Values": [10, 20, 30]})
sns.barplot(x="Category", y="Values", data=df)
plt.show()
Scikit-learn (Machine Learning)
Scikit-learn provides machine learning algorithms for:
- Classification (e.g., decision trees, logistic regression)
- Regression (e.g., linear regression, random forests)
- Clustering (e.g., K-Means, DBSCAN)
- Feature selection and model evaluation
Installing Scikit-learn
pip install scikit-learn
Training a Machine Learning Model (Linear Regression)
Use case: Predicting trends (e.g., house prices, sales forecasting).
from sklearn.linear_model import LinearRegression
import numpy as np
# Training data (X: features, y: target)
X = np.array([1], [2], [3], [4], [5](/CameronAuler/python-devops/wiki/1],-[2],-[3],-[4],-[5))
y = np.array([2, 4, 6, 8, 10])
# Create and train model
model = LinearRegression()
model.fit(X, y)
# Predict new values
predictions = model.predict([6], [7](/CameronAuler/python-devops/wiki/6],-[7))
print(predictions)
# Output:
[12. 14.]
Evaluating Model Performance
Use case: Measuring model accuracy.
from sklearn.metrics import mean_squared_error
y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0.0, 2, 8]
mse = mean_squared_error(y_true, y_pred)
print("Mean Squared Error:", mse)
Loading Prebuilt Datasets
Use cases: Quickly experimenting with prebuilt datasets.
from sklearn.datasets import load_iris
iris = load_iris()
print(iris.data[:5]) # Print first 5 rows