explanation each code line by line experiment 2 - FarhaKousar1601/DATA-SCIENCE-AND-ITS-APPLICATION-LABORATORY-21AD62- GitHub Wiki

Experiment - 2

Aim

Consider the books dataset BL-Flickr-Images-Book.csv from Kaggle (https://www.kaggle.com/adeyoyintemidayo/publication-of-books) which contains information about books. Write a program to demonstrate the following:

Import the data into a DataFrame.
Find and drop the columns which are irrelevant for the book information.
Change the Index of the DataFrame.
Tidy up fields in the data such as the date of publication with the help of a simple regular expression.
Combine str methods with NumPy to clean columns.

Code Explanation

import pandas as pd
import numpy as np

# Import the data into a DataFrame
df = pd.read_csv('BL-Flickr-Images-Book.csv')

# Display the first few rows of the DataFrame
print("Original DataFrame:")
print(df.head())

# Find and drop the columns which are irrelevant for the book information
irrelevant_columns = ['Edition Statement', 'Corporate Author', 'Corporate Contributors', 'Former owner', 'Engraver', 'Contributors', 'Issuance type', 'Shelfmarks']
df.drop(columns=irrelevant_columns, inplace=True)

# Change the Index of the DataFrame
df.set_index('Identifier', inplace=True)

# Tidy up fields in the data such as date of publication with the help of simple regular expression
df['Date of Publication'] = df['Date of Publication'].str.extract(r'^(\d{4})', expand=False)

# Combine str methods with NumPy to clean columns
df['Place of Publication'] = np.where(df['Place of Publication'].str.contains('London'), 'London', df['Place of Publication'].str.replace('-', ' '))

# Display the cleaned DataFrame
print("\nCleaned DataFrame:")
print(df.head())

Explanation of Each Step

Importing the Required Libraries

import pandas as pd
import numpy as np

Pandas (pd) is a powerful data manipulation and analysis library for Python.
NumPy (np) is a fundamental package for scientific computing in Python, providing support for large, multi-dimensional arrays and matrices.

Import the Data into a DataFrame

df = pd.read_csv('BL-Flickr-Images-Book.csv')

pd.read_csv() reads a CSV file and loads its contents into a DataFrame.
'BL-Flickr-Images-Book.csv' is the filename of the dataset.

Display the First Few Rows of the DataFrame

print("Original DataFrame:")
print(df.head())

print(df.head()) displays the first 5 rows of the DataFrame to give an overview of the data.

Find and Drop Irrelevant Columns

irrelevant_columns = ['Edition Statement', 'Corporate Author', 'Corporate Contributors', 'Former owner', 'Engraver', 'Contributors', 'Issuance type', 'Shelfmarks']
df.drop(columns=irrelevant_columns, inplace=True)

irrelevant_columns is a list of column names that are not relevant for the book information.
df.drop(columns=irrelevant_columns, inplace=True) removes these columns from the DataFrame.

Change the Index of the DataFrame

df.set_index('Identifier', inplace=True)

df.set_index('Identifier', inplace=True) sets the 'Identifier' column as the new index of the DataFrame.

Tidy Up Fields in the Data

df['Date of Publication'] = df['Date of Publication'].str.extract(r'^(\d{4})', expand=False)

df['Date of Publication'].str.extract(r'^(\d{4})', expand=False) extracts the first four digits from the 'Date of Publication' column using a regular expression to represent the year only.

Combine `str` Methods with NumPy to Clean Columns

df['Place of Publication'] = np.where(df['Place of Publication'].str.contains('London'), 'London', df['Place of Publication'].str.replace('-', ' '))

np.where(df['Place of Publication'].str.contains('London'), 'London', df['Place of Publication'].str.replace('-', ' ')) checks if 'Place of Publication' contains 'London'. If true, it replaces it with 'London'; otherwise, it replaces hyphens with spaces.

Display the Cleaned DataFrame

print("\nCleaned DataFrame:")
print(df.head())

print(df.head()) displays the first 5 rows of the cleaned DataFrame to show the results after cleaning.

Questions and Answers

What does `import pandas as pd` do?

It imports the Pandas library and assigns it the alias pd for easier usage in the code.

What does `import numpy as np` do?

It imports the NumPy library and assigns it the alias np for easier usage in the code.

How do we load the dataset into a DataFrame?

By using pd.read_csv('BL-Flickr-Images-Book.csv').

How do we display the first few rows of the DataFrame?

By using print(df.head()).

How do we drop irrelevant columns from the DataFrame?

By using df.drop(columns=irrelevant_columns, inplace=True).

How do we change the index of the DataFrame?

By using df.set_index('Identifier', inplace=True).

How do we tidy up the 'Date of Publication' field?

By using df['Date of Publication'].str.extract(r'^(\d{4})', expand=False).

How do we clean the 'Place of Publication' field?

By using np.where(df['Place of Publication'].str.contains('London'), 'London', df['Place of Publication'].str.replace('-', ' ')).

How do we display the cleaned DataFrame?

By using print(df.head()).