Data Science and Machine Learning - CameronAuler/python-devops GitHub Wiki

Python is a leading language for data science and machine learning, offering powerful libraries for numerical computing, data manipulation, visualization, and machine learning algorithms.

NumPy (Numerical Computing)
Pandas (Data Manipulation)
Matplotlib & Seaborn (Data Visualization)
Scikit-learn (Machine Learning)

NumPy (Numerical Computing)

NumPy (Numerical Python) provides efficient multi-dimensional arrays and mathematical operations. It is significantly faster than Python lists due to vectorization and optimized memory use.

Install NumPy

pip install numpy

Creating NumPy Arrays

Efficiently storing and processing large numerical datasets.

import numpy as np

arr = np.array([1, 2, 3, 4, 5])  # Create 1D array
print(arr)

arr2d = np.array([1, 2, 3], [4, 5, 6](/CameronAuler/python-devops/wiki/1,-2,-3],-[4,-5,-6))  # Create 2D array
print(arr2d)

# Output:
[1 2 3 4 5]
[[1 2 3]
 [4 5 6]]

NumPy Array Operations

Fast vectorized operations for scientific computing.

arr = np.array([1, 2, 3, 4])

print(arr + 10)  # Element-wise addition
print(arr * 2)  # Element-wise multiplication
print(np.mean(arr))  # Compute mean
print(np.sqrt(arr))  # Compute square root

Pandas (Data Manipulation)

Pandas provides DataFrames, a powerful table-like structure for handling structured data. It supports loading, cleaning, transforming, and analyzing data.

Installing Pandas

pip install pandas

Creating & Viewing a DataFrame

Use Case: Organizing tabular data (like spreadsheets & databases).

import pandas as pd

data = {"Name": ["Alice", "Bob", "Charlie"], "Age": [25, 30, 35]}
df = pd.DataFrame(data)

print(df)  # View the DataFrame
print(df.head())  # View first 5 rows
print(df.info())  # Display data info

Loading & Saving Data

Use case: Handling real-world datasets from files, APIs, and databases.

df = pd.read_csv("data.csv")  # Load from CSV
df.to_csv("output.csv", index=False)  # Save to CSV

Filtering & Aggregating Data

Use case: Cleaning and analyzing structured data.

filtered_df = df[df["Age"] > 28]  # Filter rows where Age > 28
print(df["Age"].mean())  # Compute average age
print(df.groupby("Age").count())  # Aggregate by Age

Matplotlib & Seaborn (Data Visualization)

Matplotlib: Low-level plotting library for creating graphs.
Seaborn: High-level statistical visualization built on Matplotlib.

Import Matploitlib & Seaborn

pip install matplotlib seaborn

Creating Basic Plots with Matplotlib

Use case: Creating line graphs, bar charts, scatter plots.

import matplotlib.pyplot as plt

x = [1, 2, 3, 4]
y = [10, 20, 25, 30]

plt.plot(x, y, marker="o", linestyle="-")
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.title("Line Plot Example")
plt.show()

Creating Statistical Plots with Seaborn

Use case: Creating statistical visualizations with simple syntax.

import seaborn as sns
import pandas as pd

df = pd.DataFrame({"Category": ["A", "B", "C"], "Values": [10, 20, 30]})

sns.barplot(x="Category", y="Values", data=df)
plt.show()

Scikit-learn (Machine Learning)

Scikit-learn provides machine learning algorithms for:

Classification (e.g., decision trees, logistic regression)
Regression (e.g., linear regression, random forests)
Clustering (e.g., K-Means, DBSCAN)
Feature selection and model evaluation

Installing Scikit-learn

pip install scikit-learn

Training a Machine Learning Model (Linear Regression)

Use case: Predicting trends (e.g., house prices, sales forecasting).

from sklearn.linear_model import LinearRegression
import numpy as np

# Training data (X: features, y: target)
X = np.array([1], [2], [3], [4], [5](/CameronAuler/python-devops/wiki/1],-[2],-[3],-[4],-[5))
y = np.array([2, 4, 6, 8, 10])

# Create and train model
model = LinearRegression()
model.fit(X, y)

# Predict new values
predictions = model.predict([6], [7](/CameronAuler/python-devops/wiki/6],-[7))
print(predictions)

# Output:
[12. 14.]

Evaluating Model Performance

Use case: Measuring model accuracy.

from sklearn.metrics import mean_squared_error

y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0.0, 2, 8]

mse = mean_squared_error(y_true, y_pred)
print("Mean Squared Error:", mse)

Loading Prebuilt Datasets

Use cases: Quickly experimenting with prebuilt datasets.

from sklearn.datasets import load_iris

iris = load_iris()
print(iris.data[:5])  # Print first 5 rows