7.1.1.Importing Datasets - sj50179/IBM-Data-Science-Professional-Certificate GitHub Wiki
The Problem
Why Data Analysis?
- Data is everywhere
- Data analysis/data science helps us answer questions from data
- Data analysis plays an important role in:
- Discovering useful information
- Answering questions
- Predicting future or the unknown
Python Packages for Data Science
- Scientifics Computing Libraries
- Pandas (Data structures & tools)
- NumPy (Arrays & matrices)
- SciPy (Integrals, solving differential equations, optimizations)
- Visualization Libraries
- Matplotlib (plots & graphs, most popular)
- Seaborn (plots: heat maps, time series, violin plots)
- Algorithmic Libraries
- Scikit-learn (Machine Learning: regression, classification, ... )
- Statsmodels (Explore data, estimate statistical models, and perform statistical tests)
Question
What description best describes the library Pandas?
Uses arrays as their inputs and outputs. It can be extended to objects for matrices, and with a little change of coding, developers perform fast array processing.Includes functions for some advanced math problems as listed in the slide as well as data visualization.- Offers data structure and tools for effective data manipulation and analysis. It provides fast access to structured data. The primary instrument of Pandas is a two-dimensional table consisting of columns and rows labels which are called a DataFrame. It is designed to provide an easy indexing function.
Correct.
Importing and Exporting Data in Python
Importing a CSV into Python
import pandas as pd
url = "https://www......"
df = pd.read_csv(url)
Importing a CSV without a header
import pandas as pd
url = "https://www......"
df = pd.read_csv(url, header = None)
Printing the dataframe in Python
df
prints the entire dataframe (not recommended for large datasets)df.head(n)
to show the first n rows of dataframedf.tail(n)
shows the bottom n rows of dataframe
Adding headers
- Replace default header (by
df.columns = headers
)
headers = ["header_1", "header_2", ... "header_n"]
df.columns = headers
Exporting a Pandas dataframe to CSV
- Preserve progress anytime by saving modified dataset using
path = "c:/Windows/.../data_file.csv"
df.to_csv(path)
Exporting to different formats in Python
Data Format | Read | Save |
---|---|---|
csv | pd.read_csv() | df.to_csv() |
json | pd.read_json() | df.to_json() |
Excel | pd.read_excel() | df.to_excel() |
sql | pd.read_sql() | df.to_sql() |
Getting Started Analyzing Data in Python
Basic insights from the data
- Understand your data before you begin any analysis
- Should check:
- Data Types
- why check data types?
- potential info and type mismatch
- compatibility with python methods
- why check data types?
- Data Distribution
- Data Types
- Locate potential issues with the data
Basic Insights of Dataset - Data Types
- In pandas, we use
dataframe.dtypes
to check data types
dataframe.describe()
- Returns a statistical summary
- count, mean, std, min, 25%, 50%, 75%, max ...
dataframe.describe(include="all")
- Provides full summary statistics
- count, unique, top, freq, mean, std, min, 25%, 50%, 75%, max ...
Basic Insights of Dataset - Info
dataframe.info()
provides a concise summary(top 30 rows & bottom 30 rows) of data frame
Accessing Databases with Python
What is a SQL API?
What is a DB-API?
Concepts of the Python DB API
- Connection Objects
- Database connections
- Manage transaction
- Cursor Objects
- Database Queries
What are Connection methods?
cursor()
: returns a new cursor object using the connectioncommit()
: is used to commit any pending transaction to the databaserollback()
: causes the database to roll back to the start of any pending transactionclose()
: is used to close a database connection.
Writing code using DB-API
from dbmodule import connect
# Create connection object
connection = connect('databasename', 'username', 'pswd')
# Create a cursor object
cursor = connection.cursor()
# Run queries
cursor.execute('select * from mytable')
results = cursor.fetchall()
# Free resources
Cursor.close()
connection.close()