7.1.1.Importing Datasets - sj50179/IBM-Data-Science-Professional-Certificate GitHub Wiki

The Problem

Why Data Analysis?

Data is everywhere
Data analysis/data science helps us answer questions from data
Data analysis plays an important role in:
- Discovering useful information
- Answering questions
- Predicting future or the unknown

Python Packages for Data Science

Scientifics Computing Libraries
1. Pandas (Data structures & tools)
2. NumPy (Arrays & matrices)
3. SciPy (Integrals, solving differential equations, optimizations)
Visualization Libraries
1. Matplotlib (plots & graphs, most popular)
2. Seaborn (plots: heat maps, time series, violin plots)
Algorithmic Libraries
1. Scikit-learn (Machine Learning: regression, classification, ... )
2. Statsmodels (Explore data, estimate statistical models, and perform statistical tests)

Question

What description best describes the library Pandas?

~~Uses arrays as their inputs and outputs. It can be extended to objects for matrices, and with a little change of coding, developers perform fast array processing.~~
~~Includes functions for some advanced math problems as listed in the slide as well as data visualization.~~
Offers data structure and tools for effective data manipulation and analysis. It provides fast access to structured data. The primary instrument of Pandas is a two-dimensional table consisting of columns and rows labels which are called a DataFrame. It is designed to provide an easy indexing function.

Correct.

Importing and Exporting Data in Python

Importing a CSV into Python

import pandas as pd
url = "https://www......"
df = pd.read_csv(url)

Importing a CSV without a header

import pandas as pd
url = "https://www......"
df = pd.read_csv(url, header = None)

Printing the dataframe in Python

df prints the entire dataframe (not recommended for large datasets)
df.head(n) to show the first n rows of dataframe
df.tail(n) shows the bottom n rows of dataframe

Adding headers

Replace default header (by df.columns = headers)

headers = ["header_1", "header_2", ... "header_n"]
df.columns = headers

Exporting a Pandas dataframe to CSV

Preserve progress anytime by saving modified dataset using

path = "c:/Windows/.../data_file.csv"
df.to_csv(path)

Exporting to different formats in Python

Data Format	Read	Save
csv	pd.read_csv()	df.to_csv()
json	pd.read_json()	df.to_json()
Excel	pd.read_excel()	df.to_excel()
sql	pd.read_sql()	df.to_sql()

Getting Started Analyzing Data in Python

Basic insights from the data

Understand your data before you begin any analysis
Should check:
- Data Types
  - why check data types?
    1. potential info and type mismatch
    2. compatibility with python methods
- Data Distribution
Locate potential issues with the data

Basic Insights of Dataset - Data Types

In pandas, we use dataframe.dtypes to check data types

`dataframe.describe()`

Returns a statistical summary
- count, mean, std, min, 25%, 50%, 75%, max ...

`dataframe.describe(include="all")`

Provides full summary statistics
- count, unique, top, freq, mean, std, min, 25%, 50%, 75%, max ...

Basic Insights of Dataset - Info

dataframe.info() provides a concise summary(top 30 rows & bottom 30 rows) of data frame

Accessing Databases with Python

What is a SQL API?

What is a DB-API?

Concepts of the Python DB API

Connection Objects
- Database connections
- Manage transaction
Cursor Objects
- Database Queries

What are Connection methods?

cursor(): returns a new cursor object using the connection
commit(): is used to commit any pending transaction to the database
rollback(): causes the database to roll back to the start of any pending transaction
close(): is used to close a database connection.

Writing code using DB-API

from dbmodule import connect

# Create connection object
connection = connect('databasename', 'username', 'pswd')

# Create a cursor object
cursor = connection.cursor()

# Run queries
cursor.execute('select * from mytable')
results = cursor.fetchall()

# Free resources
Cursor.close()
connection.close()