explanation each code line by line experiment 2 - FarhaKousar1601/DATA-SCIENCE-AND-ITS-APPLICATION-LABORATORY-21AD62- GitHub Wiki
Experiment - 2
Aim
Consider the books dataset BL-Flickr-Images-Book.csv
from Kaggle (https://www.kaggle.com/adeyoyintemidayo/publication-of-books) which contains information about books. Write a program to demonstrate the following:
- Import the data into a DataFrame.
- Find and drop the columns which are irrelevant for the book information.
- Change the Index of the DataFrame.
- Tidy up fields in the data such as the date of publication with the help of a simple regular expression.
- Combine
str
methods with NumPy to clean columns.
Code Explanation
import pandas as pd
import numpy as np
# Import the data into a DataFrame
df = pd.read_csv('BL-Flickr-Images-Book.csv')
# Display the first few rows of the DataFrame
print("Original DataFrame:")
print(df.head())
# Find and drop the columns which are irrelevant for the book information
irrelevant_columns = ['Edition Statement', 'Corporate Author', 'Corporate Contributors', 'Former owner', 'Engraver', 'Contributors', 'Issuance type', 'Shelfmarks']
df.drop(columns=irrelevant_columns, inplace=True)
# Change the Index of the DataFrame
df.set_index('Identifier', inplace=True)
# Tidy up fields in the data such as date of publication with the help of simple regular expression
df['Date of Publication'] = df['Date of Publication'].str.extract(r'^(\d{4})', expand=False)
# Combine str methods with NumPy to clean columns
df['Place of Publication'] = np.where(df['Place of Publication'].str.contains('London'), 'London', df['Place of Publication'].str.replace('-', ' '))
# Display the cleaned DataFrame
print("\nCleaned DataFrame:")
print(df.head())
Explanation of Each Step
Importing the Required Libraries
import pandas as pd
import numpy as np
- Pandas (
pd
) is a powerful data manipulation and analysis library for Python. - NumPy (
np
) is a fundamental package for scientific computing in Python, providing support for large, multi-dimensional arrays and matrices.
Import the Data into a DataFrame
df = pd.read_csv('BL-Flickr-Images-Book.csv')
pd.read_csv()
reads a CSV file and loads its contents into a DataFrame.'BL-Flickr-Images-Book.csv'
is the filename of the dataset.
Display the First Few Rows of the DataFrame
print("Original DataFrame:")
print(df.head())
print(df.head())
displays the first 5 rows of the DataFrame to give an overview of the data.
Find and Drop Irrelevant Columns
irrelevant_columns = ['Edition Statement', 'Corporate Author', 'Corporate Contributors', 'Former owner', 'Engraver', 'Contributors', 'Issuance type', 'Shelfmarks']
df.drop(columns=irrelevant_columns, inplace=True)
irrelevant_columns
is a list of column names that are not relevant for the book information.df.drop(columns=irrelevant_columns, inplace=True)
removes these columns from the DataFrame.
Change the Index of the DataFrame
df.set_index('Identifier', inplace=True)
df.set_index('Identifier', inplace=True)
sets the 'Identifier' column as the new index of the DataFrame.
Tidy Up Fields in the Data
df['Date of Publication'] = df['Date of Publication'].str.extract(r'^(\d{4})', expand=False)
df['Date of Publication'].str.extract(r'^(\d{4})', expand=False)
extracts the first four digits from the 'Date of Publication' column using a regular expression to represent the year only.
str
Methods with NumPy to Clean Columns
Combine df['Place of Publication'] = np.where(df['Place of Publication'].str.contains('London'), 'London', df['Place of Publication'].str.replace('-', ' '))
np.where(df['Place of Publication'].str.contains('London'), 'London', df['Place of Publication'].str.replace('-', ' '))
checks if 'Place of Publication' contains 'London'. If true, it replaces it with 'London'; otherwise, it replaces hyphens with spaces.
Display the Cleaned DataFrame
print("\nCleaned DataFrame:")
print(df.head())
print(df.head())
displays the first 5 rows of the cleaned DataFrame to show the results after cleaning.
Questions and Answers
import pandas as pd
do?
What does - It imports the Pandas library and assigns it the alias
pd
for easier usage in the code.
import numpy as np
do?
What does - It imports the NumPy library and assigns it the alias
np
for easier usage in the code.
How do we load the dataset into a DataFrame?
- By using
pd.read_csv('BL-Flickr-Images-Book.csv')
.
How do we display the first few rows of the DataFrame?
- By using
print(df.head())
.
How do we drop irrelevant columns from the DataFrame?
- By using
df.drop(columns=irrelevant_columns, inplace=True)
.
How do we change the index of the DataFrame?
- By using
df.set_index('Identifier', inplace=True)
.
How do we tidy up the 'Date of Publication' field?
- By using
df['Date of Publication'].str.extract(r'^(\d{4})', expand=False)
.
How do we clean the 'Place of Publication' field?
- By using
np.where(df['Place of Publication'].str.contains('London'), 'London', df['Place of Publication'].str.replace('-', ' '))
.
How do we display the cleaned DataFrame?
- By using
print(df.head())
.