Python Cheatsheet for Data Science - ElleCoding/Data_Science_Cheatsheets GitHub Wiki

Python Cheatsheet for Data Science

Hi, I'm Elle (ellecoding). Here's a Python cheatsheet for Data Science and Machine Learning I've made. Hope it helps!

Basic Syntax

Assignment: Assign values to variables using =.

pythonCopy code
x = 5  # Assigns the value 5 to the variable x
y = 10  # Assigns the value 10 to the variable y

Data Structures

Lists: Lists are ordered collections that can contain elements of different types. Create a list using square brackets.
```
pythonCopy code
lst = [1, 2, 3, 4]  # Creates a list with elements 1, 2, 3, 4
```
Tuples: Tuples are ordered, immutable collections. Create a tuple using parentheses.
```
pythonCopy code
tpl = (1, 2, 3, 4)  # Creates a tuple with elements 1, 2, 3, 4
```

Dictionaries: Dictionaries are collections of key-value pairs. Create a dictionary using curly braces.

pythonCopy code
dct = {'a': 1, 'b': 2}  # Creates a dictionary with keys 'a' and 'b' and corresponding values 1 and 2

Sets: Sets are unordered collections of unique elements. Create a set using curly braces.
```
pythonCopy code
st = {1, 2, 3, 4}  # Creates a set with unique elements 1, 2, 3, 4
```

Data Import

CSV: Read CSV files using pandas, a powerful data manipulation library.

pythonCopy code
import pandas as pd
df = pd.read_csv("file.csv")  # Reads data from a CSV file into a DataFrame

Excel: Import Excel files using pandas.

pythonCopy code
df = pd.read_excel("file.xlsx")  # Reads data from an Excel file into a DataFrame

Database: Connect to a database and query data using SQLAlchemy.

pythonCopy code
from sqlalchemy import create_engine
engine = create_engine('sqlite:///database.sqlite')
df = pd.read_sql_query("SELECT * FROM table", engine)  # Executes a SQL query and returns the result as a DataFrame

Data Manipulation

pandas: pandas is a powerful library for data manipulation and analysis. Perform various data manipulation tasks using pandas.

pythonCopy code
import pandas as pd
df = pd.DataFrame({'column1': [1, 2, 3], 'column2': [4, 5, 6]})
df = df[df['column1'] > 1]  # Filters rows where column1 is greater than 1
df['new_column'] = df['column1'] * 2  # Creates a new column by multiplying column1 by 2
df = df.sort_values(by='column1', ascending=False)  # Sorts the DataFrame by column1 in descending order

Data Visualization

Matplotlib: Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations. Create plots using matplotlib.

pythonCopy code
import matplotlib.pyplot as plt
plt.plot([1, 2, 3], [4, 5, 6])  # Creates a simple line plot
plt.xlabel('X-axis label')
plt.ylabel('Y-axis label')
plt.title('Plot title')
plt.show()  # Displays the plot

Seaborn: Seaborn is a statistical data visualization library based on matplotlib. Create advanced visualizations using seaborn.

pythonCopy code
import seaborn as sns
sns.scatterplot(x='column1', y='column2', data=df)  # Creates a scatter plot with seaborn
plt.show()  # Displays the plot

Statistical Analysis

Descriptive Statistics: Summarize and understand your data using pandas.

pythonCopy code
df.describe()  # Provides summary statistics of the DataFrame

Correlation: Measure the strength and direction of the relationship between two variables using pandas.
```
pythonCopy code
df.corr()  # Calculates the correlation matrix of the DataFrame
```

t-test: Compare means between two groups using scipy.stats.

pythonCopy code
from scipy.stats import ttest_ind
t_stat, p_val = ttest_ind(df['column1'], df['column2'])  # Performs a t-test to compare means between two columns

Machine Learning

Scikit-learn: Scikit-learn is a powerful library for machine learning in Python.

Linear Regression: Perform linear regression to understand the relationship between variables.

pythonCopy code
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(df['column2', 'column3'](/ElleCoding/Data_Science_Cheatsheets/wiki/'column2',-'column3'), df['column1'])  # Fits a linear regression model
print(model.coef_)  # Prints the coefficients of the model

Random Forest: Build a random forest model, an ensemble learning method.

pythonCopy code
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier()
model.fit(df['column2', 'column3'](/ElleCoding/Data_Science_Cheatsheets/wiki/'column2',-'column3'), df['column1'])  # Trains a random forest model
print(model.feature_importances_)  # Prints the feature importances of the model

Cross-Validation: Use scikit-learn for cross-validation, a technique to assess the performance of a model.

pythonCopy code
from sklearn.model_selection import cross_val_score
scores = cross_val_score(model, df['column2', 'column3'](/ElleCoding/Data_Science_Cheatsheets/wiki/'column2',-'column3'), df['column1'], cv=5)  # Performs 5-fold cross-validation
print(scores)  # Prints the cross-validation scores

Deep Learning

TensorFlow: TensorFlow is an end-to-end open-source platform for machine learning.

pythonCopy code
import tensorflow as tf
from tensorflow.keras import layers

model = tf.keras.Sequential([
    layers.Dense(64, activation='relu', input_shape=(input_shape,)),
    layers.Dense(64, activation='relu'),
    layers.Dense(1)
])
model.compile(optimizer='adam', loss='mean_squared_error')
model.fit(train_data, train_labels, epochs=10, batch_size=32)  # Trains the neural network

Natural Language Processing (NLP)

Text Preprocessing: Preprocess text data using nltk and re.

pythonCopy code
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
import re

text = "Sample text for NLP"
text = text.lower()  # Converts text to lowercase
text = re.sub(r'\W', ' ', text)  # Removes non-alphanumeric characters
stop_words = set(stopwords.words('english'))
text = ' '.join([word for word in text.split() if word not in stop_words])  # Removes stopwords

Text Representation: Represent text using TF-IDF with scikit-learn.

pythonCopy code
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer()
X = tfidf.fit_transform(texts)  # Transforms text data into TF-IDF features

Time Series Analysis

ARIMA: Fit and forecast time series data using ARIMA models with statsmodels.

pythonCopy code
from statsmodels.tsa.arima_model import ARIMA
model = ARIMA(ts_data, order=(1, 1, 1))
model_fit = model.fit(disp=0)  # Fits an ARIMA model to the time series data
forecast = model_fit.forecast(steps=12)  # Forecasts the next 12 time periods

Big Data Technologies

PySpark: Use PySpark for big data processing and analysis.

pythonCopy code
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("example").getOrCreate()
df = spark.read.csv("file.csv", header=True, inferSchema=True)  # Reads a CSV file into a Spark DataFrame
df.show()  # Displays the first few rows of the DataFrame

Data Engineering

ETL Processes: Extract, Transform, Load (ETL) processes using pandas.

pythonCopy code
import pandas as pd

# Extract
df = pd.read_csv("file.csv")

# Transform
df['new_column'] = df['column1'] * 2

# Load
df.to_csv("transformed_file.csv", index=False)  # Saves the transformed DataFrame to a new CSV file

Cloud Computing for Data Science

AWS S3: Interact with AWS S3 using boto3.

pythonCopy code
import boto3

s3 = boto3.client('s3')
s3.download_file('bucket_name', 'file_key', 'local_file')  # Downloads a file from an S3 bucket
s3.upload_file('local_file', 'bucket_name', 'file_key')  # Uploads a file to an S3 bucket

Data Visualization

Tableau: Use tableau-api-lib to interact with Tableau Server.

pythonCopy code
from tableau_api_lib import TableauServerConnection
from tableau_api_lib.utils.querying import get_projects_dataframe

connection = TableauServerConnection()
connection.sign_in()
projects_df = get_projects_dataframe(connection)  # Retrieves a DataFrame of projects from Tableau Server
connection.sign_out()

Additional Topics

Version Control: Use Git for version control.

shCopy code
git init  # Initializes a new Git repository
git add .  # Adds all files to the staging area
git commit -m "Initial commit"  # Commits the files to the repository

Documentation: Use Jupyter Notebooks for interactive coding and documentation.

pythonCopy code
# Jupyter Notebook cell
print("Hello, world!")  # Prints a message

Helpful Libraries

Data Manipulation: pandas, numpy for data manipulation and numerical operations.
Data Visualization: matplotlib, seaborn for creating various types of plots.
Machine Learning: scikit-learn, xgboost for building and evaluating models.
Deep Learning: tensorflow, keras, pytorch for creating neural networks.
NLP: nltk, spacy, gensim for processing and analyzing text data.
Big Data: pyspark, dask for handling large datasets.
Cloud Computing: boto3, google-cloud for interacting with cloud services.

Python Programming Cheatsheet for Data Scientists - Part 2 (Intermediate/Advanced)

Advanced-Data Structures

Deque: A double-ended queue optimized for inserting and removing elements from both ends. Use the deque class from the collections module.
- Code: Demonstrates how to append elements to both ends of a deque.
```
pythonCopy code
from collections import deque
dq = deque([1, 2, 3, 4])
dq.appendleft(0)  # Adds 0 to the beginning
dq.append(5)  # Adds 5 to the end
```

Named Tuples: Tuples with named fields for better readability. Use the namedtuple class from the collections module.

Code: Creates a named tuple Point and initializes it.

pythonCopy code
from collections import namedtuple
Point = namedtuple('Point', ['x', 'y'])
p = Point(1, 2)  # Creates a named tuple Point with values x=1 and y=2

Advanced Data Manipulation

GroupBy and Aggregate: Group data and perform aggregate operations using pandas. This is useful for summarizing data.

Code: Groups a DataFrame by a column and sums another column.

pythonCopy code
import pandas as pd
df = pd.DataFrame({'category': ['A', 'B', 'A', 'B'], 'values': [1, 2, 3, 4]})
grouped = df.groupby('category').agg({'values': 'sum'})  # Groups by 'category' and sums 'values'

Merge and Join: Combine DataFrames using various types of joins to enrich data.

Code: Demonstrates an inner join between two DataFrames on a common column.

pythonCopy code
df1 = pd.DataFrame({'id': [1, 2, 3], 'value1': ['A', 'B', 'C']})
df2 = pd.DataFrame({'id': [1, 2, 4], 'value2': ['X', 'Y', 'Z']})
merged_df = pd.merge(df1, df2, on='id', how='inner')  # Inner join on 'id'

Advanced Data Visualization

Plotly: Create interactive plots with the Plotly library. It supports a wide range of visualizations.

Code: Creates an interactive scatter plot.

pythonCopy code
import plotly.express as px
fig = px.scatter(df, x='column1', y='column2', color='category')  # Creates an interactive scatter plot
fig.show()  # Displays the plot

Bokeh: Create interactive visualizations for modern web browsers using Bokeh.

Code: Generates a simple scatter plot and displays it in a Jupyter Notebook.

pythonCopy code
from bokeh.plotting import figure, show
from bokeh.io import output_notebook
output_notebook()

p = figure(title="Bokeh Plot", x_axis_label='X', y_axis_label='Y')
p.circle([1, 2, 3], [4, 5, 6], size=10, color="navy", alpha=0.5)  # Creates a scatter plot
show(p)  # Displays the plot in a Jupyter Notebook

Advanced Machine Learning

Hyperparameter Tuning: Optimize model parameters using Grid Search and Random Search to improve model performance.

Code: Uses GridSearchCV to perform grid search for hyperparameter tuning.

pythonCopy code
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier

param_grid = {'n_estimators': [50, 100, 200], 'max_depth': [None, 10, 20]}
grid_search = GridSearchCV(RandomForestClassifier(), param_grid, cv=5)
grid_search.fit(X_train, y_train)  # Performs grid search to find the best parameters
print(grid_search.best_params_)  # Prints the best parameters found

Ensemble Learning: Combine predictions from multiple models to improve overall performance.

Code: Demonstrates the use of VotingClassifier to combine predictions from a random forest and gradient boosting model.

pythonCopy code
from sklearn.ensemble import VotingClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

model1 = RandomForestClassifier(n_estimators=100)
model2 = GradientBoostingClassifier(n_estimators=100)
ensemble_model = VotingClassifier(estimators=[('rf', model1), ('gb', model2)], voting='soft')
ensemble_model.fit(X_train, y_train)  # Trains the ensemble model

Deep Learning

Transfer Learning: Use pre-trained models to leverage existing knowledge for new tasks. This is useful for tasks with limited data.

Code: Uses the pre-trained VGG16 model from TensorFlow and adds custom layers for a new task.

pythonCopy code
from tensorflow.keras.applications import VGG16
from tensorflow.keras.models import Model
from tensorflow.keras.layers import Dense, Flatten

base_model = VGG16(weights='imagenet', include_top=False, input_shape=(224, 224, 3))
x = base_model.output
x = Flatten()(x)
x = Dense(1024, activation='relu')(x)
predictions = Dense(10, activation='softmax')(x)
model = Model(inputs=base_model.input, outputs=predictions)

for layer in base_model.layers:
    layer.trainable = False  # Freeze the layers of the base model

model.compile(optimizer='adam', loss='categorical_crossentropy')
model.fit(train_data, train_labels, epochs=10, batch_size=32)  # Trains the model

Natural Language Processing (NLP)

Word Embeddings: Represent words in a continuous vector space using gensim. This helps in capturing semantic relationships between words.

Code: Trains a Word2Vec model and retrieves the vector for a specific word.

pythonCopy code
from gensim.models import Word2Vec

sentences = ['this', 'is', 'a', 'sentence'], ['another', 'sentence'](/ElleCoding/Data_Science_Cheatsheets/wiki/'this',-'is',-'a',-'sentence'],-['another',-'sentence')
model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)  # Trains a Word2Vec model
vector = model.wv['sentence']  # Gets the vector representation of the word 'sentence'

Transformers: Use transformer models like BERT for advanced NLP tasks with transformers. These models are powerful for understanding context in text.

Code: Loads a pre-trained BERT model and tokenizer, and processes input text.

pythonCopy code
from transformers import BertTokenizer, BertModel

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
inputs = tokenizer("Hello, world!", return_tensors="pt")
outputs = model(**inputs)  # Gets the BERT model outputs for the input text

Time Series Analysis

Seasonal Decomposition: Decompose time series data into trend, seasonal, and residual components using statsmodels. This helps in understanding underlying patterns in the data.
- Code: Decomposes a time series and plots the components.
```
pythonCopy code
from statsmodels.tsa.seasonal import seasonal_decompose

result = seasonal_decompose(ts_data, model='additive')
result.plot()  # Plots the decomposed components of the time series
```

Prophet: Perform time series forecasting with Prophet, a tool designed for reliable forecasting of time series data.

Code: Fits a Prophet model to time series data and makes future predictions.

pythonCopy code
from fbprophet import Prophet

df = pd.DataFrame({'ds': dates, 'y': values})  # Prepares the data for Prophet
model = Prophet()
model.fit(df)  # Fits the Prophet model to the data
future = model.make_future_dataframe(periods=365)
forecast = model.predict(future)  # Predicts future values
model.plot(forecast)  # Plots the forecast

Big Data Technologies

Dask: Scale data processing workflows using Dask, which allows parallel computing.

Code: Reads a large CSV file into a Dask DataFrame and performs filtering.

pythonCopy code
import dask.dataframe as dd

df = dd.read_csv('large_file.csv')  # Reads a large CSV file into a Dask DataFrame
df = df[df['column1'] > 1]  # Filters rows where column1 is greater than 1
df.compute()  # Triggers the computation and returns a pandas DataFrame

Hadoop with PySpark: Interact with Hadoop using PySpark for distributed data processing.

Code: Reads a CSV file from HDFS into a Spark DataFrame and displays it.

pythonCopy code
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("example").getOrCreate()
df = spark.read.csv("hdfs:///path/to/file.csv", header=True, inferSchema=True)  # Reads a CSV file from HDFS
df.show()  # Displays the first few rows of the DataFrame

Data Engineering

Airflow: Automate and schedule workflows using Apache Airflow. It helps in managing complex data pipelines.

Code: Defines a simple Airflow DAG with a Python task.

pythonCopy code
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime

def my_function():
    print("Hello from Airflow!")

default_args = {'owner': 'airflow', 'start_date': datetime(2023, 1, 1)}
dag = DAG('my_dag', default_args=default_args, schedule_interval='@daily')

task = PythonOperator(task_id='my_task', python_callable=my_function, dag=dag)

Cloud Computing for Data Science

Google Cloud Platform (GCP): Interact with Google Cloud Storage using google-cloud-storage.

Code: Downloads and uploads a file to Google Cloud Storage.

pythonCopy code
from google.cloud import storage

client = storage.Client()
bucket = client.get_bucket('my_bucket')
blob = bucket.blob('file.txt')
blob.download_to_filename('local_file.txt')  # Downloads a file from GCS
blob.upload_from_filename('local_file.txt')  # Uploads a file to GCS

Data Visualization

Interactive Dashboards: Create interactive dashboards with Dash, a web-based application framework.

Code: Defines a simple Dash app with a bar plot.

pythonCopy code
import dash
import dash_core_components as dcc
import dash_html_components as html

app = dash.Dash(__name__)

app.layout = html.Div([
    dcc.Graph(
        id='example-graph',
        figure={
            'data': [{'x': [1, 2, 3], 'y': [4, 1, 2], 'type': 'bar', 'name': 'SF'},
                     {'x': [1, 2, 3], 'y': [2, 4, 5], 'type': 'bar', 'name': 'NYC'}],
            'layout': {'title': 'Dash Data Visualization'}
        }
    )
])

if __name__ == '__main__':
    app.run_server(debug=True)  # Runs the Dash app

Version Control and Collaboration

Docker: Containerize applications for consistent environments using Docker.

Code: Example of a Dockerfile to create a Docker image.

DockerfileCopy code
# Dockerfile example
FROM python:3.8-slim

WORKDIR /app
COPY requirements.txt requirements.txt
RUN pip install -r requirements.txt
COPY . .

CMD ["python", "app.py"]

Kubernetes: Orchestrate containerized applications using kubectl, a command-line tool for Kubernetes.
- Code: Deploys an application to a Kubernetes cluster and lists running pods.
```
shCopy code
kubectl create -f deployment.yaml  # Deploys an application to a Kubernetes cluster
kubectl get pods  # Lists all running pods in the cluster
```

Additional Topics

API Development: Create RESTful APIs with Flask, a micro web framework for Python.

Code: Defines a simple Flask API with a prediction endpoint.

pythonCopy code
from flask import Flask, jsonify, request

app = Flask(__name__)

@app.route('/predict', methods=['POST'])
def predict():
    data = request.get_json(force=True)
    prediction = model.predict(data['input'])
    return jsonify({'prediction': prediction.tolist()})

if __name__ == '__main__':
    app.run(debug=True)  # Runs the Flask app

Helpful Libraries

Advanced Data Manipulation: pandas, numpy, dask for complex data transformations and handling large datasets.
Advanced Data Visualization: plotly, bokeh, dash for creating interactive and complex visualizations.
Advanced Machine Learning: scikit-learn, xgboost, lightgbm for building and tuning complex models.
Deep Learning: tensorflow, keras, pytorch for advanced neural network architectures and transfer learning.
NLP: nltk, spacy, gensim, transformers for sophisticated text processing and language models.
Big Data: pyspark, dask, hadoop for big data processing and distributed computing.
Cloud Computing: boto3, google-cloud-storage, azure-storage-blob for cloud-based data storage and processing.