7.1.1.Importing Datasets - sj50179/IBM-Data-Science-Professional-Certificate GitHub Wiki

The Problem

Why Data Analysis?

  • Data is everywhere
  • Data analysis/data science helps us answer questions from data
  • Data analysis plays an important role in:
    • Discovering useful information
    • Answering questions
    • Predicting future or the unknown

Python Packages for Data Science

  1. Scientifics Computing Libraries
    1. Pandas (Data structures & tools)
    2. NumPy (Arrays & matrices)
    3. SciPy (Integrals, solving differential equations, optimizations)
  2. Visualization Libraries
    1. Matplotlib (plots & graphs, most popular)
    2. Seaborn (plots: heat maps, time series, violin plots)
  3. Algorithmic Libraries
    1. Scikit-learn (Machine Learning: regression, classification, ... )
    2. Statsmodels (Explore data, estimate statistical models, and perform statistical tests)

Question

What description best describes the library Pandas?

  • Uses arrays as their inputs and outputs. It can be extended to objects for matrices, and with a little change of coding, developers perform fast array processing.
  • Includes functions for some advanced math problems as listed in the slide as well as data visualization.
  • Offers data structure and tools for effective data manipulation and analysis. It provides fast access to structured data. The primary instrument of Pandas is a two-dimensional table consisting of columns and rows labels which are called a DataFrame. It is designed to provide an easy indexing function.

Correct.

Importing and Exporting Data in Python

Importing a CSV into Python

import pandas as pd
url = "https://www......"
df = pd.read_csv(url)

Importing a CSV without a header

import pandas as pd
url = "https://www......"
df = pd.read_csv(url, header = None)

Printing the dataframe in Python

  • df prints the entire dataframe (not recommended for large datasets)
  • df.head(n) to show the first n rows of dataframe
  • df.tail(n) shows the bottom n rows of dataframe

Adding headers

  • Replace default header (by df.columns = headers)
headers = ["header_1", "header_2", ... "header_n"]
df.columns = headers

Exporting a Pandas dataframe to CSV

  • Preserve progress anytime by saving modified dataset using
path = "c:/Windows/.../data_file.csv"
df.to_csv(path)

Exporting to different formats in Python

Data Format Read Save
csv pd.read_csv() df.to_csv()
json pd.read_json() df.to_json()
Excel pd.read_excel() df.to_excel()
sql pd.read_sql() df.to_sql()

Getting Started Analyzing Data in Python

Basic insights from the data

  • Understand your data before you begin any analysis
  • Should check:
    • Data Types
      • why check data types?
        1. potential info and type mismatch
        2. compatibility with python methods
    • Data Distribution
  • Locate potential issues with the data

Basic Insights of Dataset - Data Types

  • In pandas, we use dataframe.dtypes to check data types

dataframe.describe()

  • Returns a statistical summary
    • count, mean, std, min, 25%, 50%, 75%, max ...

dataframe.describe(include="all")

  • Provides full summary statistics
    • count, unique, top, freq, mean, std, min, 25%, 50%, 75%, max ...

Basic Insights of Dataset - Info

  • dataframe.info() provides a concise summary(top 30 rows & bottom 30 rows) of data frame

Accessing Databases with Python

What is a SQL API?

What is a DB-API?

Concepts of the Python DB API

  • Connection Objects
    • Database connections
    • Manage transaction
  • Cursor Objects
    • Database Queries

What are Connection methods?

  • cursor(): returns a new cursor object using the connection
  • commit(): is used to commit any pending transaction to the database
  • rollback(): causes the database to roll back to the start of any pending transaction
  • close(): is used to close a database connection.

Writing code using DB-API

from dbmodule import connect

# Create connection object
connection = connect('databasename', 'username', 'pswd')

# Create a cursor object
cursor = connection.cursor()

# Run queries
cursor.execute('select * from mytable')
results = cursor.fetchall()

# Free resources
Cursor.close()
connection.close()