Python Cheatsheet for Data Science - ElleCoding/Data_Science_Cheatsheets GitHub Wiki

Python Cheatsheet for Data Science

Hi, I'm Elle (ellecoding). Here's a Python cheatsheet for Data Science and Machine Learning I've made. Hope it helps!

Basic Syntax

  • Assignment: Assign values to variables using =.

    pythonCopy code
    x = 5  # Assigns the value 5 to the variable x
    y = 10  # Assigns the value 10 to the variable y
    
    

Data Structures

  • Lists: Lists are ordered collections that can contain elements of different types. Create a list using square brackets.

    pythonCopy code
    lst = [1, 2, 3, 4]  # Creates a list with elements 1, 2, 3, 4
    
    
  • Tuples: Tuples are ordered, immutable collections. Create a tuple using parentheses.

    pythonCopy code
    tpl = (1, 2, 3, 4)  # Creates a tuple with elements 1, 2, 3, 4
    
    
  • Dictionaries: Dictionaries are collections of key-value pairs. Create a dictionary using curly braces.

    pythonCopy code
    dct = {'a': 1, 'b': 2}  # Creates a dictionary with keys 'a' and 'b' and corresponding values 1 and 2
    
    
  • Sets: Sets are unordered collections of unique elements. Create a set using curly braces.

    pythonCopy code
    st = {1, 2, 3, 4}  # Creates a set with unique elements 1, 2, 3, 4
    
    

Data Import

  • CSV: Read CSV files using pandas, a powerful data manipulation library.

    pythonCopy code
    import pandas as pd
    df = pd.read_csv("file.csv")  # Reads data from a CSV file into a DataFrame
    
    
  • Excel: Import Excel files using pandas.

    pythonCopy code
    df = pd.read_excel("file.xlsx")  # Reads data from an Excel file into a DataFrame
    
    
  • Database: Connect to a database and query data using SQLAlchemy.

    pythonCopy code
    from sqlalchemy import create_engine
    engine = create_engine('sqlite:///database.sqlite')
    df = pd.read_sql_query("SELECT * FROM table", engine)  # Executes a SQL query and returns the result as a DataFrame
    
    

Data Manipulation

  • pandas: pandas is a powerful library for data manipulation and analysis. Perform various data manipulation tasks using pandas.

    pythonCopy code
    import pandas as pd
    df = pd.DataFrame({'column1': [1, 2, 3], 'column2': [4, 5, 6]})
    df = df[df['column1'] > 1]  # Filters rows where column1 is greater than 1
    df['new_column'] = df['column1'] * 2  # Creates a new column by multiplying column1 by 2
    df = df.sort_values(by='column1', ascending=False)  # Sorts the DataFrame by column1 in descending order
    
    

Data Visualization

  • Matplotlib: Matplotlib is a comprehensive library for creating static, animated, and interactive visualizations. Create plots using matplotlib.

    pythonCopy code
    import matplotlib.pyplot as plt
    plt.plot([1, 2, 3], [4, 5, 6])  # Creates a simple line plot
    plt.xlabel('X-axis label')
    plt.ylabel('Y-axis label')
    plt.title('Plot title')
    plt.show()  # Displays the plot
    
    
  • Seaborn: Seaborn is a statistical data visualization library based on matplotlib. Create advanced visualizations using seaborn.

    pythonCopy code
    import seaborn as sns
    sns.scatterplot(x='column1', y='column2', data=df)  # Creates a scatter plot with seaborn
    plt.show()  # Displays the plot
    
    

Statistical Analysis

  • Descriptive Statistics: Summarize and understand your data using pandas.

    pythonCopy code
    df.describe()  # Provides summary statistics of the DataFrame
    
    
  • Correlation: Measure the strength and direction of the relationship between two variables using pandas.

    pythonCopy code
    df.corr()  # Calculates the correlation matrix of the DataFrame
    
    
  • t-test: Compare means between two groups using scipy.stats.

    pythonCopy code
    from scipy.stats import ttest_ind
    t_stat, p_val = ttest_ind(df['column1'], df['column2'])  # Performs a t-test to compare means between two columns
    
    

Machine Learning

  • Scikit-learn: Scikit-learn is a powerful library for machine learning in Python.
    • Linear Regression: Perform linear regression to understand the relationship between variables.

      pythonCopy code
      from sklearn.linear_model import LinearRegression
      model = LinearRegression()
      model.fit(df['column2', 'column3'](/ElleCoding/Data_Science_Cheatsheets/wiki/'column2',-'column3'), df['column1'])  # Fits a linear regression model
      print(model.coef_)  # Prints the coefficients of the model
      
      
    • Random Forest: Build a random forest model, an ensemble learning method.

      pythonCopy code
      from sklearn.ensemble import RandomForestClassifier
      model = RandomForestClassifier()
      model.fit(df['column2', 'column3'](/ElleCoding/Data_Science_Cheatsheets/wiki/'column2',-'column3'), df['column1'])  # Trains a random forest model
      print(model.feature_importances_)  # Prints the feature importances of the model
      
      
    • Cross-Validation: Use scikit-learn for cross-validation, a technique to assess the performance of a model.

      pythonCopy code
      from sklearn.model_selection import cross_val_score
      scores = cross_val_score(model, df['column2', 'column3'](/ElleCoding/Data_Science_Cheatsheets/wiki/'column2',-'column3'), df['column1'], cv=5)  # Performs 5-fold cross-validation
      print(scores)  # Prints the cross-validation scores
      
      

Deep Learning

  • TensorFlow: TensorFlow is an end-to-end open-source platform for machine learning.

    pythonCopy code
    import tensorflow as tf
    from tensorflow.keras import layers
    
    model = tf.keras.Sequential([
        layers.Dense(64, activation='relu', input_shape=(input_shape,)),
        layers.Dense(64, activation='relu'),
        layers.Dense(1)
    ])
    model.compile(optimizer='adam', loss='mean_squared_error')
    model.fit(train_data, train_labels, epochs=10, batch_size=32)  # Trains the neural network
    
    

Natural Language Processing (NLP)

  • Text Preprocessing: Preprocess text data using nltk and re.

    pythonCopy code
    import nltk
    nltk.download('stopwords')
    from nltk.corpus import stopwords
    import re
    
    text = "Sample text for NLP"
    text = text.lower()  # Converts text to lowercase
    text = re.sub(r'\W', ' ', text)  # Removes non-alphanumeric characters
    stop_words = set(stopwords.words('english'))
    text = ' '.join([word for word in text.split() if word not in stop_words])  # Removes stopwords
    
    
  • Text Representation: Represent text using TF-IDF with scikit-learn.

    pythonCopy code
    from sklearn.feature_extraction.text import TfidfVectorizer
    tfidf = TfidfVectorizer()
    X = tfidf.fit_transform(texts)  # Transforms text data into TF-IDF features
    
    

Time Series Analysis

  • ARIMA: Fit and forecast time series data using ARIMA models with statsmodels.

    pythonCopy code
    from statsmodels.tsa.arima_model import ARIMA
    model = ARIMA(ts_data, order=(1, 1, 1))
    model_fit = model.fit(disp=0)  # Fits an ARIMA model to the time series data
    forecast = model_fit.forecast(steps=12)  # Forecasts the next 12 time periods
    
    

Big Data Technologies

  • PySpark: Use PySpark for big data processing and analysis.

    pythonCopy code
    from pyspark.sql import SparkSession
    spark = SparkSession.builder.appName("example").getOrCreate()
    df = spark.read.csv("file.csv", header=True, inferSchema=True)  # Reads a CSV file into a Spark DataFrame
    df.show()  # Displays the first few rows of the DataFrame
    
    

Data Engineering

  • ETL Processes: Extract, Transform, Load (ETL) processes using pandas.

    pythonCopy code
    import pandas as pd
    
    # Extract
    df = pd.read_csv("file.csv")
    
    # Transform
    df['new_column'] = df['column1'] * 2
    
    # Load
    df.to_csv("transformed_file.csv", index=False)  # Saves the transformed DataFrame to a new CSV file
    
    

Cloud Computing for Data Science

  • AWS S3: Interact with AWS S3 using boto3.

    pythonCopy code
    import boto3
    
    s3 = boto3.client('s3')
    s3.download_file('bucket_name', 'file_key', 'local_file')  # Downloads a file from an S3 bucket
    s3.upload_file('local_file', 'bucket_name', 'file_key')  # Uploads a file to an S3 bucket
    
    

Data Visualization

  • Tableau: Use tableau-api-lib to interact with Tableau Server.

    pythonCopy code
    from tableau_api_lib import TableauServerConnection
    from tableau_api_lib.utils.querying import get_projects_dataframe
    
    connection = TableauServerConnection()
    connection.sign_in()
    projects_df = get_projects_dataframe(connection)  # Retrieves a DataFrame of projects from Tableau Server
    connection.sign_out()
    
    

Additional Topics

  • Version Control: Use Git for version control.

    shCopy code
    git init  # Initializes a new Git repository
    git add .  # Adds all files to the staging area
    git commit -m "Initial commit"  # Commits the files to the repository
    
    
  • Documentation: Use Jupyter Notebooks for interactive coding and documentation.

    pythonCopy code
    # Jupyter Notebook cell
    print("Hello, world!")  # Prints a message
    
    

Helpful Libraries

  • Data Manipulation: pandas, numpy for data manipulation and numerical operations.
  • Data Visualization: matplotlib, seaborn for creating various types of plots.
  • Machine Learning: scikit-learn, xgboost for building and evaluating models.
  • Deep Learning: tensorflow, keras, pytorch for creating neural networks.
  • NLP: nltk, spacy, gensim for processing and analyzing text data.
  • Big Data: pyspark, dask for handling large datasets.
  • Cloud Computing: boto3, google-cloud for interacting with cloud services.

Python Programming Cheatsheet for Data Scientists - Part 2 (Intermediate/Advanced)

Advanced-Data Structures

  • Deque: A double-ended queue optimized for inserting and removing elements from both ends. Use the deque class from the collections module.

    • Code: Demonstrates how to append elements to both ends of a deque.
    pythonCopy code
    from collections import deque
    dq = deque([1, 2, 3, 4])
    dq.appendleft(0)  # Adds 0 to the beginning
    dq.append(5)  # Adds 5 to the end
    
    
  • Named Tuples: Tuples with named fields for better readability. Use the namedtuple class from the collections module.

    • Code: Creates a named tuple Point and initializes it.
    pythonCopy code
    from collections import namedtuple
    Point = namedtuple('Point', ['x', 'y'])
    p = Point(1, 2)  # Creates a named tuple Point with values x=1 and y=2
    
    

Advanced Data Manipulation

  • GroupBy and Aggregate: Group data and perform aggregate operations using pandas. This is useful for summarizing data.

    • Code: Groups a DataFrame by a column and sums another column.
    pythonCopy code
    import pandas as pd
    df = pd.DataFrame({'category': ['A', 'B', 'A', 'B'], 'values': [1, 2, 3, 4]})
    grouped = df.groupby('category').agg({'values': 'sum'})  # Groups by 'category' and sums 'values'
    
    
  • Merge and Join: Combine DataFrames using various types of joins to enrich data.

    • Code: Demonstrates an inner join between two DataFrames on a common column.
    pythonCopy code
    df1 = pd.DataFrame({'id': [1, 2, 3], 'value1': ['A', 'B', 'C']})
    df2 = pd.DataFrame({'id': [1, 2, 4], 'value2': ['X', 'Y', 'Z']})
    merged_df = pd.merge(df1, df2, on='id', how='inner')  # Inner join on 'id'
    
    

Advanced Data Visualization

  • Plotly: Create interactive plots with the Plotly library. It supports a wide range of visualizations.

    • Code: Creates an interactive scatter plot.
    pythonCopy code
    import plotly.express as px
    fig = px.scatter(df, x='column1', y='column2', color='category')  # Creates an interactive scatter plot
    fig.show()  # Displays the plot
    
    
  • Bokeh: Create interactive visualizations for modern web browsers using Bokeh.

    • Code: Generates a simple scatter plot and displays it in a Jupyter Notebook.
    pythonCopy code
    from bokeh.plotting import figure, show
    from bokeh.io import output_notebook
    output_notebook()
    
    p = figure(title="Bokeh Plot", x_axis_label='X', y_axis_label='Y')
    p.circle([1, 2, 3], [4, 5, 6], size=10, color="navy", alpha=0.5)  # Creates a scatter plot
    show(p)  # Displays the plot in a Jupyter Notebook
    
    

Advanced Machine Learning

  • Hyperparameter Tuning: Optimize model parameters using Grid Search and Random Search to improve model performance.

    • Code: Uses GridSearchCV to perform grid search for hyperparameter tuning.
    pythonCopy code
    from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
    from sklearn.ensemble import RandomForestClassifier
    
    param_grid = {'n_estimators': [50, 100, 200], 'max_depth': [None, 10, 20]}
    grid_search = GridSearchCV(RandomForestClassifier(), param_grid, cv=5)
    grid_search.fit(X_train, y_train)  # Performs grid search to find the best parameters
    print(grid_search.best_params_)  # Prints the best parameters found
    
    
  • Ensemble Learning: Combine predictions from multiple models to improve overall performance.

    • Code: Demonstrates the use of VotingClassifier to combine predictions from a random forest and gradient boosting model.
    pythonCopy code
    from sklearn.ensemble import VotingClassifier
    from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
    
    model1 = RandomForestClassifier(n_estimators=100)
    model2 = GradientBoostingClassifier(n_estimators=100)
    ensemble_model = VotingClassifier(estimators=[('rf', model1), ('gb', model2)], voting='soft')
    ensemble_model.fit(X_train, y_train)  # Trains the ensemble model
    
    

Deep Learning

  • Transfer Learning: Use pre-trained models to leverage existing knowledge for new tasks. This is useful for tasks with limited data.

    • Code: Uses the pre-trained VGG16 model from TensorFlow and adds custom layers for a new task.
    pythonCopy code
    from tensorflow.keras.applications import VGG16
    from tensorflow.keras.models import Model
    from tensorflow.keras.layers import Dense, Flatten
    
    base_model = VGG16(weights='imagenet', include_top=False, input_shape=(224, 224, 3))
    x = base_model.output
    x = Flatten()(x)
    x = Dense(1024, activation='relu')(x)
    predictions = Dense(10, activation='softmax')(x)
    model = Model(inputs=base_model.input, outputs=predictions)
    
    for layer in base_model.layers:
        layer.trainable = False  # Freeze the layers of the base model
    
    model.compile(optimizer='adam', loss='categorical_crossentropy')
    model.fit(train_data, train_labels, epochs=10, batch_size=32)  # Trains the model
    
    

Natural Language Processing (NLP)

  • Word Embeddings: Represent words in a continuous vector space using gensim. This helps in capturing semantic relationships between words.

    • Code: Trains a Word2Vec model and retrieves the vector for a specific word.
    pythonCopy code
    from gensim.models import Word2Vec
    
    sentences = ['this', 'is', 'a', 'sentence'], ['another', 'sentence'](/ElleCoding/Data_Science_Cheatsheets/wiki/'this',-'is',-'a',-'sentence'],-['another',-'sentence')
    model = Word2Vec(sentences, vector_size=100, window=5, min_count=1, workers=4)  # Trains a Word2Vec model
    vector = model.wv['sentence']  # Gets the vector representation of the word 'sentence'
    
    
  • Transformers: Use transformer models like BERT for advanced NLP tasks with transformers. These models are powerful for understanding context in text.

    • Code: Loads a pre-trained BERT model and tokenizer, and processes input text.
    pythonCopy code
    from transformers import BertTokenizer, BertModel
    
    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
    model = BertModel.from_pretrained('bert-base-uncased')
    inputs = tokenizer("Hello, world!", return_tensors="pt")
    outputs = model(**inputs)  # Gets the BERT model outputs for the input text
    
    

Time Series Analysis

  • Seasonal Decomposition: Decompose time series data into trend, seasonal, and residual components using statsmodels. This helps in understanding underlying patterns in the data.

    • Code: Decomposes a time series and plots the components.
    pythonCopy code
    from statsmodels.tsa.seasonal import seasonal_decompose
    
    result = seasonal_decompose(ts_data, model='additive')
    result.plot()  # Plots the decomposed components of the time series
    
    
  • Prophet: Perform time series forecasting with Prophet, a tool designed for reliable forecasting of time series data.

    • Code: Fits a Prophet model to time series data and makes future predictions.
    pythonCopy code
    from fbprophet import Prophet
    
    df = pd.DataFrame({'ds': dates, 'y': values})  # Prepares the data for Prophet
    model = Prophet()
    model.fit(df)  # Fits the Prophet model to the data
    future = model.make_future_dataframe(periods=365)
    forecast = model.predict(future)  # Predicts future values
    model.plot(forecast)  # Plots the forecast
    
    

Big Data Technologies

  • Dask: Scale data processing workflows using Dask, which allows parallel computing.

    • Code: Reads a large CSV file into a Dask DataFrame and performs filtering.
    pythonCopy code
    import dask.dataframe as dd
    
    df = dd.read_csv('large_file.csv')  # Reads a large CSV file into a Dask DataFrame
    df = df[df['column1'] > 1]  # Filters rows where column1 is greater than 1
    df.compute()  # Triggers the computation and returns a pandas DataFrame
    
    
  • Hadoop with PySpark: Interact with Hadoop using PySpark for distributed data processing.

    • Code: Reads a CSV file from HDFS into a Spark DataFrame and displays it.
    pythonCopy code
    from pyspark.sql import SparkSession
    spark = SparkSession.builder.appName("example").getOrCreate()
    df = spark.read.csv("hdfs:///path/to/file.csv", header=True, inferSchema=True)  # Reads a CSV file from HDFS
    df.show()  # Displays the first few rows of the DataFrame
    
    

Data Engineering

  • Airflow: Automate and schedule workflows using Apache Airflow. It helps in managing complex data pipelines.

    • Code: Defines a simple Airflow DAG with a Python task.
    pythonCopy code
    from airflow import DAG
    from airflow.operators.python_operator import PythonOperator
    from datetime import datetime
    
    def my_function():
        print("Hello from Airflow!")
    
    default_args = {'owner': 'airflow', 'start_date': datetime(2023, 1, 1)}
    dag = DAG('my_dag', default_args=default_args, schedule_interval='@daily')
    
    task = PythonOperator(task_id='my_task', python_callable=my_function, dag=dag)
    
    

Cloud Computing for Data Science

  • Google Cloud Platform (GCP): Interact with Google Cloud Storage using google-cloud-storage.

    • Code: Downloads and uploads a file to Google Cloud Storage.
    pythonCopy code
    from google.cloud import storage
    
    client = storage.Client()
    bucket = client.get_bucket('my_bucket')
    blob = bucket.blob('file.txt')
    blob.download_to_filename('local_file.txt')  # Downloads a file from GCS
    blob.upload_from_filename('local_file.txt')  # Uploads a file to GCS
    
    

Data Visualization

  • Interactive Dashboards: Create interactive dashboards with Dash, a web-based application framework.

    • Code: Defines a simple Dash app with a bar plot.
    pythonCopy code
    import dash
    import dash_core_components as dcc
    import dash_html_components as html
    
    app = dash.Dash(__name__)
    
    app.layout = html.Div([
        dcc.Graph(
            id='example-graph',
            figure={
                'data': [{'x': [1, 2, 3], 'y': [4, 1, 2], 'type': 'bar', 'name': 'SF'},
                         {'x': [1, 2, 3], 'y': [2, 4, 5], 'type': 'bar', 'name': 'NYC'}],
                'layout': {'title': 'Dash Data Visualization'}
            }
        )
    ])
    
    if __name__ == '__main__':
        app.run_server(debug=True)  # Runs the Dash app
    
    

Version Control and Collaboration

  • Docker: Containerize applications for consistent environments using Docker.

    • Code: Example of a Dockerfile to create a Docker image.
    DockerfileCopy code
    # Dockerfile example
    FROM python:3.8-slim
    
    WORKDIR /app
    COPY requirements.txt requirements.txt
    RUN pip install -r requirements.txt
    COPY . .
    
    CMD ["python", "app.py"]
    
    
  • Kubernetes: Orchestrate containerized applications using kubectl, a command-line tool for Kubernetes.

    • Code: Deploys an application to a Kubernetes cluster and lists running pods.
    shCopy code
    kubectl create -f deployment.yaml  # Deploys an application to a Kubernetes cluster
    kubectl get pods  # Lists all running pods in the cluster
    
    

Additional Topics

  • API Development: Create RESTful APIs with Flask, a micro web framework for Python.

    • Code: Defines a simple Flask API with a prediction endpoint.
    pythonCopy code
    from flask import Flask, jsonify, request
    
    app = Flask(__name__)
    
    @app.route('/predict', methods=['POST'])
    def predict():
        data = request.get_json(force=True)
        prediction = model.predict(data['input'])
        return jsonify({'prediction': prediction.tolist()})
    
    if __name__ == '__main__':
        app.run(debug=True)  # Runs the Flask app
    
    

Helpful Libraries

  • Advanced Data Manipulation: pandas, numpy, dask for complex data transformations and handling large datasets.
  • Advanced Data Visualization: plotly, bokeh, dash for creating interactive and complex visualizations.
  • Advanced Machine Learning: scikit-learn, xgboost, lightgbm for building and tuning complex models.
  • Deep Learning: tensorflow, keras, pytorch for advanced neural network architectures and transfer learning.
  • NLP: nltk, spacy, gensim, transformers for sophisticated text processing and language models.
  • Big Data: pyspark, dask, hadoop for big data processing and distributed computing.
  • Cloud Computing: boto3, google-cloud-storage, azure-storage-blob for cloud-based data storage and processing.