7.1.2.Lab 1: Importing Datasets - sj50179/IBM-Data-Science-Professional-Certificate GitHub Wiki
Data Acquisition
There are various formats for a dataset: .csv, .json, .xlsx etc. The dataset can be stored in different places, on your local machine or sometimes online.
In this section, you will learn how to load a dataset into our Jupyter Notebook.
In our case, the Automobile Dataset is an online source, and it is in a CSV (comma separated value) format. Let's use this dataset as an example to practice data reading.
- Data source: https://archive.ics.uci.edu/ml/machine-learning-databases/autos/imports-85.data
- Data type: csv
The Pandas Library is a useful tool that enables us to read various datasets into a dataframe; our Jupyter notebook platforms have a built-in
Pandas Library
so that all we need to do is import Pandas without installing.
*# import pandas library*
**import** pandas **as** pd
**import** numpy **as** np
Read Data
We use pandas.read_csv()
function to read the csv file. In the brackets, we put the file path along with a quotation mark so that pandas will read the file into a dataframe from that address. The file path can be either an URL or your local file address.
Because the data does not include headers, we can add an argument headers = None
inside the read_csv()
method so that pandas will not automatically set the first row as a header.
You can also assign the dataset to any variable you create.
This dataset was hosted on IBM Cloud object. Click HERE for free storage.
*# Import pandas library*
**import** pandas **as** pd
*# Read the online file by the URL provides above, and assign it to variable "df"*
other_path **=** "https://cf-courses-data.s3.us.cloud-object-storage.appdomain.cloud/IBMDeveloperSkillsNetwork-DA0101EN-SkillsNetwork/labs/Data%20files/auto.csv"
df **=** pd.read_csv(other_path, header**=None**)
After reading the dataset, we can use the dataframe.head(n)
method to check the top n rows of the dataframe, where n is an integer. Contrary to dataframe.head(n)
, dataframe.tail(n)
will show you the bottom n rows of the dataframe.
*# show the first 5 rows using dataframe.head() method*
print("The first 5 rows of the dataframe")
df.head(5)
Question #1: Check the bottom 10 rows of data frame "df".
*# Write your code below and press Shift+Enter to execute*
print("The last 10 rows of the dataframe\n")
df.tail(10)
Add Headers
Take a look at our dataset. Pandas automatically set the header with an integer starting from 0.
To better describe our data, we can introduce a header. This information is available at: https://archive.ics.uci.edu/ml/datasets/Automobile.
Thus, we have to add headers manually.
First, we create a list "headers" that include all column names in order. Then, we use dataframe.columns = headers
to replace the headers with the list we created.
*# create headers list*
headers **=** ["symboling","normalized-losses","make","fuel-type","aspiration", "num-of-doors","body-style",
"drive-wheels","engine-location","wheel-base", "length","width","height","curb-weight","engine-type",
"num-of-cylinders", "engine-size","fuel-system","bore","stroke","compression-ratio","horsepower",
"peak-rpm","city-mpg","highway-mpg","price"]
print("headers\n", headers)
headers
['symboling', 'normalized-losses', 'make', 'fuel-type', 'aspiration', 'num-of-doors',
'body-style', 'drive-wheels', 'engine-location', 'wheel-base', 'length', 'width',
'height', 'curb-weight', 'engine-type', 'num-of-cylinders', 'engine-size', 'fuel-system',
'bore', 'stroke', 'compression-ratio', 'horsepower', 'peak-rpm', 'city-mpg',
'highway-mpg', 'price']
We replace headers and recheck our dataframe:
df.columns **=** headers
df.head(10)
We need to replace the "?" symbol with NaN so the dropna() can remove the missing values:
df1**=**df.replace('?',np.NaN)
We can drop missing values along the column "price" as follows:
df**=**df1.dropna(subset**=**["price"], axis**=**0)
df.head(20)
Now, we have successfully read the raw dataset and added the correct headers into the dataframe.
Question #2: Find the name of the columns of the dataframe.
*# Write your code below and press Shift+Enter to execute*
df.columns
Index(['symboling', 'normalized-losses', 'make', 'fuel-type', 'aspiration',
'num-of-doors', 'body-style', 'drive-wheels', 'engine-location',
'wheel-base', 'length', 'width', 'height', 'curb-weight', 'engine-type',
'num-of-cylinders', 'engine-size', 'fuel-system', 'bore', 'stroke',
'compression-ratio', 'horsepower', 'peak-rpm', 'city-mpg',
'highway-mpg', 'price'],
dtype='object')
Save Dataset
Correspondingly, Pandas enables us to save the dataset to csv. By using the dataframe.to_csv()
method, you can add the file path and name along with quotation marks in the brackets.
For example, if you would save the dataframe df as automobile.csv to your local machine, you may use the syntax below, where index = False
means the row names will not be written.
df.to_csv("automobile.csv", index=False)
We can also read and save other file formats. We can use similar functions like pd.read_csv()
and df.to_csv()
for other data formats. The functions are listed in the following table:
Basic Insight of Dataset
After reading data into Pandas dataframe, it is time for us to explore the dataset.
There are several ways to obtain essential insights of the data to help us better understand our dataset.
Data Types
Data has a variety of types.
The main types stored in Pandas dataframes are object, float, int, bool and datetime64. In order to better learn about each attribute, it is always good for us to know the data type of each column. In Pandas:
df.dtypes
symboling int64
normalized-losses object
make object
fuel-type object
aspiration object
num-of-doors object
body-style object
drive-wheels object
engine-location object
wheel-base float64
length float64
width float64
height float64
curb-weight int64
engine-type object
num-of-cylinders object
engine-size int64
fuel-system object
bore object
stroke object
compression-ratio float64
horsepower object
peak-rpm object
city-mpg int64
highway-mpg int64
price object
dtype: object
A series with the data type of each column is returned.
*# check the data type of data frame "df" by .dtypes*
print(df.dtypes)
symboling int64
normalized-losses object
make object
fuel-type object
aspiration object
num-of-doors object
body-style object
drive-wheels object
engine-location object
wheel-base float64
length float64
width float64
height float64
curb-weight int64
engine-type object
num-of-cylinders object
engine-size int64
fuel-system object
bore object
stroke object
compression-ratio float64
horsepower object
peak-rpm object
city-mpg int64
highway-mpg int64
price object
dtype: object
As shown above, it is clear to see that the data type of "symboling" and "curb-weight" are int64
, "normalized-losses" is object
, and "wheel-base" is float64
, etc.
These data types can be changed; we will learn how to accomplish this in a later module.
Describe
If we would like to get a statistical summary of each column e.g. count, column mean value, column standard deviation, etc., we use the describe method:
dataframe.describe()
This method will provide various summary statistics, excluding NaN
(Not a Number) values.
df.describe()
This shows the statistical summary of all numeric-typed (int, float) columns.
For example, the attribute "symboling" has 205 counts, the mean value of this column is 0.83, the standard deviation is 1.25, the minimum value is -2, 25th percentile is 0, 50th percentile is 1, 75th percentile is 2, and the maximum value is 3.
However, what if we would also like to check all the columns including those that are of type object?
You can add an argument include = "all"
inside the bracket. Let's try it again.
*# describe all the columns in "df"*
df.describe(include **=** "all")
Now it provides the statistical summary of all the columns, including object-typed attributes.
We can now see how many unique values there, which one is the top value and the frequency of top value in the object-typed columns.
Some values in the table above show as "NaN". This is because those numbers are not available regarding a particular column type.
Question #3:
You can select the columns of a dataframe by indicating the name of each column. For example, you can select the three columns as follows:
dataframe[' column 1 ',column 2', 'column 3'](/sj50179/IBM-Data-Science-Professional-Certificate/wiki/'-column-1-',column-2',-'column-3')
Where "column" is the name of the column, you can apply the method ".describe()" to get the statistics of those columns as follows:
dataframe[[' column 1 ',column 2', 'column 3'] ].describe()
Apply the method to ".describe()" to the columns 'length' and 'compression-ratio'.
*# Write your code below and press Shift+Enter to execute*
df['length', 'compression-ratio'](/sj50179/IBM-Data-Science-Professional-Certificate/wiki/'length',-'compression-ratio').describe()
Info
Another method you can use to check your dataset is:
dataframe.info()
It provides a concise summary of your DataFrame.
This method prints information about a DataFrame including the index dtype and columns, non-null values and memory usage.
*# look at the info of "df"*
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 201 entries, 0 to 204
Data columns (total 26 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 symboling 201 non-null int64
1 normalized-losses 164 non-null object
2 make 201 non-null object
3 fuel-type 201 non-null object
4 aspiration 201 non-null object
5 num-of-doors 199 non-null object
6 body-style 201 non-null object
7 drive-wheels 201 non-null object
8 engine-location 201 non-null object
9 wheel-base 201 non-null float64
10 length 201 non-null float64
11 width 201 non-null float64
12 height 201 non-null float64
13 curb-weight 201 non-null int64
14 engine-type 201 non-null object
15 num-of-cylinders 201 non-null object
16 engine-size 201 non-null int64
17 fuel-system 201 non-null object
18 bore 197 non-null object
19 stroke 197 non-null object
20 compression-ratio 201 non-null float64
21 horsepower 199 non-null object
22 peak-rpm 199 non-null object
23 city-mpg 201 non-null int64
24 highway-mpg 201 non-null int64
25 price 201 non-null object
dtypes: float64(5), int64(5), object(16)
memory usage: 52.4+ KB