Python Cheatsheet for Data Science - ElleCoding/Data_Science_Cheatsheets GitHub Wiki
Python Cheatsheet for Data Science
Hi, I'm Elle (ellecoding). Here's a Python cheatsheet for Data Science and Machine Learning I've made. Hope it helps!
Basic Syntax
-
Assignment: Assign values to variables using
=
.pythonCopy code x = 5 # Assigns the value 5 to the variable x y = 10 # Assigns the value 10 to the variable y
Data Structures
-
Lists: Lists are ordered collections that can contain elements of different types. Create a list using square brackets.
pythonCopy code lst = [1, 2, 3, 4] # Creates a list with elements 1, 2, 3, 4
-
Tuples: Tuples are ordered, immutable collections. Create a tuple using parentheses.
pythonCopy code tpl = (1, 2, 3, 4) # Creates a tuple with elements 1, 2, 3, 4
-
Dictionaries: Dictionaries are collections of key-value pairs. Create a dictionary using curly braces.
pythonCopy code dct = {'a': 1, 'b': 2} # Creates a dictionary with keys 'a' and 'b' and corresponding values 1 and 2
-
Sets: Sets are unordered collections of unique elements. Create a set using curly braces.
pythonCopy code st = {1, 2, 3, 4} # Creates a set with unique elements 1, 2, 3, 4
Data Import
-
CSV: Read CSV files using
pandas
, a powerful data manipulation library.pythonCopy code import pandas as pd df = pd.read_csv("file.csv") # Reads data from a CSV file into a DataFrame
-
Excel: Import Excel files using
pandas
.pythonCopy code df = pd.read_excel("file.xlsx") # Reads data from an Excel file into a DataFrame
-
Database: Connect to a database and query data using
SQLAlchemy
.pythonCopy code from sqlalchemy import create_engine engine = create_engine('sqlite:///database.sqlite') df = pd.read_sql_query("SELECT * FROM table", engine) # Executes a SQL query and returns the result as a DataFrame
Data Manipulation
-
pandas:
pandas
is a powerful library for data manipulation and analysis. Perform various data manipulation tasks usingpandas
.pythonCopy code import pandas as pd df = pd.DataFrame({'column1': [1, 2, 3], 'column2': [4, 5, 6]}) df = df[df['column1'] > 1] # Filters rows where column1 is greater than 1 df['new_column'] = df['column1'] * 2 # Creates a new column by multiplying column1 by 2 df = df.sort_values(by='column1', ascending=False) # Sorts the DataFrame by column1 in descending order
Data Visualization
-
Matplotlib:
Matplotlib
is a comprehensive library for creating static, animated, and interactive visualizations. Create plots usingmatplotlib
.pythonCopy code import matplotlib.pyplot as plt plt.plot([1, 2, 3], [4, 5, 6]) # Creates a simple line plot plt.xlabel('X-axis label') plt.ylabel('Y-axis label') plt.title('Plot title') plt.show() # Displays the plot
-
Seaborn:
Seaborn
is a statistical data visualization library based onmatplotlib
. Create advanced visualizations usingseaborn
.pythonCopy code import seaborn as sns sns.scatterplot(x='column1', y='column2', data=df) # Creates a scatter plot with seaborn plt.show() # Displays the plot
Statistical Analysis
-
Descriptive Statistics: Summarize and understand your data using
pandas
.pythonCopy code df.describe() # Provides summary statistics of the DataFrame
-
Correlation: Measure the strength and direction of the relationship between two variables using
pandas
.pythonCopy code df.corr() # Calculates the correlation matrix of the DataFrame
-
t-test: Compare means between two groups using
scipy.stats
.pythonCopy code from scipy.stats import ttest_ind t_stat, p_val = ttest_ind(df['column1'], df['column2']) # Performs a t-test to compare means between two columns
Machine Learning
- Scikit-learn:
Scikit-learn
is a powerful library for machine learning in Python.-
Linear Regression: Perform linear regression to understand the relationship between variables.
pythonCopy code from sklearn.linear_model import LinearRegression model = LinearRegression() model.fit(df['column2', 'column3'](/ElleCoding/Data_Science_Cheatsheets/wiki/'column2',-'column3'), df['column1']) # Fits a linear regression model print(model.coef_) # Prints the coefficients of the model
-
Random Forest: Build a random forest model, an ensemble learning method.
pythonCopy code from sklearn.ensemble import RandomForestClassifier model = RandomForestClassifier() model.fit(df['column2', 'column3'](/ElleCoding/Data_Science_Cheatsheets/wiki/'column2',-'column3'), df['column1']) # Trains a random forest model print(model.feature_importances_) # Prints the feature importances of the model
-
Cross-Validation: Use
scikit-learn
for cross-validation, a technique to assess the performance of a model.pythonCopy code from sklearn.model_selection import cross_val_score scores = cross_val_score(model, df['column2', 'column3'](/ElleCoding/Data_Science_Cheatsheets/wiki/'column2',-'column3'), df['column1'], cv=5) # Performs 5-fold cross-validation print(scores) # Prints the cross-validation scores
-
Deep Learning
-
TensorFlow:
TensorFlow
is an end-to-end open-source platform for machine learning.pythonCopy code import tensorflow as tf from tensorflow.keras import layers model = tf.keras.Sequential([ layers.Dense(64, activation='relu', input_shape=(input_shape,)), layers.Dense(64, activation='relu'), layers.Dense(1) ]) model.compile(optimizer='adam', loss='mean_squared_error') model.fit(train_data, train_labels, epochs=10, batch_size=32) # Trains the neural network
Natural Language Processing (NLP)
-
Text Preprocessing: Preprocess text data using
nltk
andre
.pythonCopy code import nltk nltk.download('stopwords') from nltk.corpus import stopwords import re text = "Sample text for NLP" text = text.lower() # Converts text to lowercase text = re.sub(r'\W', ' ', text) # Removes non-alphanumeric characters stop_words = set(stopwords.words('english')) text = ' '.join([word for word in text.split() if word not in stop_words]) # Removes stopwords
-
Text Representation: Represent text using TF-IDF with
scikit-learn
.pythonCopy code from sklearn.feature_extraction.text import TfidfVectorizer tfidf = TfidfVectorizer() X = tfidf.fit_transform(texts) # Transforms text data into TF-IDF features
Time Series Analysis
-
ARIMA: Fit and forecast time series data using ARIMA models with
statsmodels
.pythonCopy code from statsmodels.tsa.arima_model import ARIMA model = ARIMA(ts_data, order=(1, 1, 1)) model_fit = model.fit(disp=0) # Fits an ARIMA model to the time series data forecast = model_fit.forecast(steps=12) # Forecasts the next 12 time periods
Big Data Technologies
-
PySpark: Use
PySpark
for big data processing and analysis.pythonCopy code from pyspark.sql import SparkSession spark = SparkSession.builder.appName("example").getOrCreate() df = spark.read.csv("file.csv", header=True, inferSchema=True) # Reads a CSV file into a Spark DataFrame df.show() # Displays the first few rows of the DataFrame
Data Engineering
-
ETL Processes: Extract, Transform, Load (ETL) processes using
pandas
.pythonCopy code import pandas as pd # Extract df = pd.read_csv("file.csv") # Transform df['new_column'] = df['column1'] * 2 # Load df.to_csv("transformed_file.csv", index=False) # Saves the transformed DataFrame to a new CSV file
Cloud Computing for Data Science
-
AWS S3: Interact with AWS S3 using
boto3
.pythonCopy code import boto3 s3 = boto3.client('s3') s3.download_file('bucket_name', 'file_key', 'local_file') # Downloads a file from an S3 bucket s3.upload_file('local_file', 'bucket_name', 'file_key') # Uploads a file to an S3 bucket
Data Visualization
-
Tableau: Use
tableau-api-lib
to interact with Tableau Server.pythonCopy code from tableau_api_lib import TableauServerConnection from tableau_api_lib.utils.querying import get_projects_dataframe connection = TableauServerConnection() connection.sign_in() projects_df = get_projects_dataframe(connection) # Retrieves a DataFrame of projects from Tableau Server connection.sign_out()
Additional Topics
-
Version Control: Use
Git
for version control.shCopy code git init # Initializes a new Git repository git add . # Adds all files to the staging area git commit -m "Initial commit" # Commits the files to the repository
-
Documentation: Use Jupyter Notebooks for interactive coding and documentation.
pythonCopy code # Jupyter Notebook cell print("Hello, world!") # Prints a message
Helpful Libraries
- Data Manipulation:
pandas
,numpy
for data manipulation and numerical operations. - Data Visualization:
matplotlib
,seaborn
for creating various types of plots. - Machine Learning:
scikit-learn
,xgboost
for building and evaluating models. - Deep Learning:
tensorflow
,keras
,pytorch
for creating neural networks. - NLP:
nltk
,spacy
,gensim
for processing and analyzing text data. - Big Data:
pyspark
,dask
for handling large datasets. - Cloud Computing:
boto3
,google-cloud
for interacting with cloud services.
Python Programming Cheatsheet for Data Scientists - Part 2 (Intermediate/Advanced)
Advanced-Data Structures
-
Deque: A double-ended queue optimized for inserting and removing elements from both ends. Use the
deque
class from thecollections
module.- Code: Demonstrates how to append elements to both ends of a deque.
pythonCopy code from collections import deque dq = deque([1, 2, 3, 4]) dq.appendleft(0) # Adds 0 to the beginning dq.append(5) # Adds 5 to the end
-
Named Tuples: Tuples with named fields for better readability. Use the
namedtuple
class from thecollections
module.- Code: Creates a named tuple
Point
and initializes it.
pythonCopy code from collections import namedtuple Point = namedtuple('Point', ['x', 'y']) p = Point(1, 2) # Creates a named tuple Point with values x=1 and y=2
- Code: Creates a named tuple
Advanced Data Manipulation
-
GroupBy and Aggregate: Group data and perform aggregate operations using
pandas
. This is useful for summarizing data.- Code: Groups a DataFrame by a column and sums another column.
pythonCopy code import pandas as pd df = pd.DataFrame({'category': ['A', 'B', 'A', 'B'], 'values': [1, 2, 3, 4]}) grouped = df.groupby('category').agg({'values': 'sum'}) # Groups by 'category' and sums 'values'
-
Merge and Join: Combine DataFrames using various types of joins to enrich data.
- Code: Demonstrates an inner join between two DataFrames on a common column.
pythonCopy code df1 = pd.DataFrame({'id': [1, 2, 3], 'value1': ['A', 'B', 'C']}) df2 = pd.DataFrame({'id': [1, 2, 4], 'value2': ['X', 'Y', 'Z']}) merged_df = pd.merge(df1, df2, on='id', how='inner') # Inner join on 'id'
Advanced Data Visualization
-
Plotly: Create interactive plots with the
Plotly
library. It supports a wide range of visualizations.- Code: Creates an interactive scatter plot.
pythonCopy code import plotly.express as px fig = px.scatter(df, x='column1', y='column2', color='category') # Creates an interactive scatter plot fig.show() # Displays the plot
-
Bokeh: Create interactive visualizations for modern web browsers using
Bokeh
.- Code: Generates a simple scatter plot and displays it in a Jupyter Notebook.
pythonCopy code from bokeh.plotting import figure, show from bokeh.io import output_notebook output_notebook() p = figure(title="Bokeh Plot", x_axis_label='X', y_axis_label='Y') p.circle([1, 2, 3], [4, 5, 6], size=10, color="navy", alpha=0.5) # Creates a scatter plot show(p) # Displays the plot in a Jupyter Notebook
Advanced Machine Learning
-
Hyperparameter Tuning: Optimize model parameters using Grid Search and Random Search to improve model performance.
- Code: Uses
GridSearchCV
to perform grid search for hyperparameter tuning.
pythonCopy code from sklearn.model_selection import GridSearchCV, RandomizedSearchCV from sklearn.ensemble import RandomForestClassifier param_grid = {'n_estimators': [50, 100, 200], 'max_depth': [None, 10, 20]} grid_search = GridSearchCV(RandomForestClassifier(), param_grid, cv=5) grid_search.fit(X_train, y_train) # Performs grid search to find the best parameters print(grid_search.best_params_) # Prints the best parameters found
- Code: Uses
-
Ensemble Learning: Combine predictions from multiple models to improve overall performance.
- Code: Demonstrates the use of
VotingClassifier
to combine predictions from a random forest and gradient boosting model.
pythonCopy code from sklearn.ensemble import VotingClassifier from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier model1 = RandomForestClassifier(n_estimators=100) model2 = GradientBoostingClassifier(n_estimators=100) ensemble_model = VotingClassifier(estimators=[('rf', model1), ('gb', model2)], voting='soft') ensemble_model.fit(X_train, y_train) # Trains the ensemble model
- Code: Demonstrates the use of
Deep Learning
-
Transfer Learning: Use pre-trained models to leverage existing knowledge for new tasks. This is useful for tasks with limited data.
- Code: Uses the pre-trained VGG16 model from
TensorFlow
and adds custom layers for a new task.
pythonCopy code from tensorflow.keras.applications import VGG16 from tensorflow.keras.models import Model from tensorflow.keras.layers import Dense, Flatten base_model = VGG16(weights='imagenet', include_top=False, input_shape=(224, 224, 3)) x = base_model.output x = Flatten()(x) x = Dense(1024, activation='relu')(x) predictions = Dense(10, activation='softmax')(x) model = Model(inputs=base_model.input, outputs=predictions) for layer in base_model.layers: layer.trainable = False # Freeze the layers of the base model model.compile(optimizer='adam', loss='categorical_crossentropy') model.fit(train_data, train_labels, epochs=10, batch_size=32) # Trains the model
- Code: Uses the pre-trained VGG16 model from
Natural Language Processing (NLP)
-
Word Embeddings: Represent words in a continuous vector space using
gensim
. This helps in capturing semantic relationships between words.- Code: Trains a Word2Vec model and retrieves the vector for a specific word.
pythonCopy code from gensim.models import Word2Vec sentences = ['this', 'is', 'a', 'sentence'], ['another', 'sentence'](/ElleCoding/Data_Science_Cheatsheets/wiki/'this',-'is',-'a',-'sentence'],-['another',-'sentence') model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4) # Trains a Word2Vec model vector = model.wv['sentence'] # Gets the vector representation of the word 'sentence'
-
Transformers: Use transformer models like BERT for advanced NLP tasks with
transformers
. These models are powerful for understanding context in text.- Code: Loads a pre-trained BERT model and tokenizer, and processes input text.
pythonCopy code from transformers import BertTokenizer, BertModel tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') model = BertModel.from_pretrained('bert-base-uncased') inputs = tokenizer("Hello, world!", return_tensors="pt") outputs = model(**inputs) # Gets the BERT model outputs for the input text
Time Series Analysis
-
Seasonal Decomposition: Decompose time series data into trend, seasonal, and residual components using
statsmodels
. This helps in understanding underlying patterns in the data.- Code: Decomposes a time series and plots the components.
pythonCopy code from statsmodels.tsa.seasonal import seasonal_decompose result = seasonal_decompose(ts_data, model='additive') result.plot() # Plots the decomposed components of the time series
-
Prophet: Perform time series forecasting with
Prophet
, a tool designed for reliable forecasting of time series data.- Code: Fits a Prophet model to time series data and makes future predictions.
pythonCopy code from fbprophet import Prophet df = pd.DataFrame({'ds': dates, 'y': values}) # Prepares the data for Prophet model = Prophet() model.fit(df) # Fits the Prophet model to the data future = model.make_future_dataframe(periods=365) forecast = model.predict(future) # Predicts future values model.plot(forecast) # Plots the forecast
Big Data Technologies
-
Dask: Scale data processing workflows using
Dask
, which allows parallel computing.- Code: Reads a large CSV file into a Dask DataFrame and performs filtering.
pythonCopy code import dask.dataframe as dd df = dd.read_csv('large_file.csv') # Reads a large CSV file into a Dask DataFrame df = df[df['column1'] > 1] # Filters rows where column1 is greater than 1 df.compute() # Triggers the computation and returns a pandas DataFrame
-
Hadoop with PySpark: Interact with Hadoop using
PySpark
for distributed data processing.- Code: Reads a CSV file from HDFS into a Spark DataFrame and displays it.
pythonCopy code from pyspark.sql import SparkSession spark = SparkSession.builder.appName("example").getOrCreate() df = spark.read.csv("hdfs:///path/to/file.csv", header=True, inferSchema=True) # Reads a CSV file from HDFS df.show() # Displays the first few rows of the DataFrame
Data Engineering
-
Airflow: Automate and schedule workflows using
Apache Airflow
. It helps in managing complex data pipelines.- Code: Defines a simple Airflow DAG with a Python task.
pythonCopy code from airflow import DAG from airflow.operators.python_operator import PythonOperator from datetime import datetime def my_function(): print("Hello from Airflow!") default_args = {'owner': 'airflow', 'start_date': datetime(2023, 1, 1)} dag = DAG('my_dag', default_args=default_args, schedule_interval='@daily') task = PythonOperator(task_id='my_task', python_callable=my_function, dag=dag)
Cloud Computing for Data Science
-
Google Cloud Platform (GCP): Interact with Google Cloud Storage using
google-cloud-storage
.- Code: Downloads and uploads a file to Google Cloud Storage.
pythonCopy code from google.cloud import storage client = storage.Client() bucket = client.get_bucket('my_bucket') blob = bucket.blob('file.txt') blob.download_to_filename('local_file.txt') # Downloads a file from GCS blob.upload_from_filename('local_file.txt') # Uploads a file to GCS
Data Visualization
-
Interactive Dashboards: Create interactive dashboards with
Dash
, a web-based application framework.- Code: Defines a simple Dash app with a bar plot.
pythonCopy code import dash import dash_core_components as dcc import dash_html_components as html app = dash.Dash(__name__) app.layout = html.Div([ dcc.Graph( id='example-graph', figure={ 'data': [{'x': [1, 2, 3], 'y': [4, 1, 2], 'type': 'bar', 'name': 'SF'}, {'x': [1, 2, 3], 'y': [2, 4, 5], 'type': 'bar', 'name': 'NYC'}], 'layout': {'title': 'Dash Data Visualization'} } ) ]) if __name__ == '__main__': app.run_server(debug=True) # Runs the Dash app
Version Control and Collaboration
-
Docker: Containerize applications for consistent environments using Docker.
- Code: Example of a Dockerfile to create a Docker image.
DockerfileCopy code # Dockerfile example FROM python:3.8-slim WORKDIR /app COPY requirements.txt requirements.txt RUN pip install -r requirements.txt COPY . . CMD ["python", "app.py"]
-
Kubernetes: Orchestrate containerized applications using
kubectl
, a command-line tool for Kubernetes.- Code: Deploys an application to a Kubernetes cluster and lists running pods.
shCopy code kubectl create -f deployment.yaml # Deploys an application to a Kubernetes cluster kubectl get pods # Lists all running pods in the cluster
Additional Topics
-
API Development: Create RESTful APIs with
Flask
, a micro web framework for Python.- Code: Defines a simple Flask API with a prediction endpoint.
pythonCopy code from flask import Flask, jsonify, request app = Flask(__name__) @app.route('/predict', methods=['POST']) def predict(): data = request.get_json(force=True) prediction = model.predict(data['input']) return jsonify({'prediction': prediction.tolist()}) if __name__ == '__main__': app.run(debug=True) # Runs the Flask app
Helpful Libraries
- Advanced Data Manipulation:
pandas
,numpy
,dask
for complex data transformations and handling large datasets. - Advanced Data Visualization:
plotly
,bokeh
,dash
for creating interactive and complex visualizations. - Advanced Machine Learning:
scikit-learn
,xgboost
,lightgbm
for building and tuning complex models. - Deep Learning:
tensorflow
,keras
,pytorch
for advanced neural network architectures and transfer learning. - NLP:
nltk
,spacy
,gensim
,transformers
for sophisticated text processing and language models. - Big Data:
pyspark
,dask
,hadoop
for big data processing and distributed computing. - Cloud Computing:
boto3
,google-cloud-storage
,azure-storage-blob
for cloud-based data storage and processing.