Lab 1 - PavankumarManchala/CS5542_BigDataAnalyticsAppsLab GitHub Wiki

Lab Assignment 1

Submitted by: Pavankumar Manchala Class Id: 16 Team - 5

Objectives:

Download the Dataset related to project theme form list of Project Datasets in the Google Spreadsheet.
Performing the NLP operations (Tokenization and Lemmatization) on the data.
Reporting the image statistics from the caption dataset extracted from downloaded dataset.
Applying SIFT algorithm on the images extracted to extract features.

Platform used:

PyCharm IDE

Packages installed:

nltk
numpy
matplotlib
opencv-python
Tensorflow

Dataset:

There are different datasets available online and the dataset used is SBU Image Captions dataset which consists of image URLs and Captions for them. The dataset image URLs and Captions are extracted in txt files.

NLP operations:

1. Tokenization:

    Tokenization is performed over the Captions data to make either sentence separation or word separation on each image caption. The Sentence separation is done with sentence_tokenize and word separation by word_tokenize. We used word_tokenize for extracting different keywords for extracting different datasets. The tokenization makes to extract captions related to project and stores in new text file by using line cache. Both the image URLs corresponding to the Captions are saved in respective files.

2. Lemmatization:

    Lemmatization is performed to get the root word from the given keyword helps in pre-processing the data. The stop words are eliminated after performing the Lemmatization.

A word count used to track the count of keywords and used for plotting them.

SIFT Algorithm:

The SIFT algorithm is used to extract the image features. The output of the SIFT algorithm on given input images is shown.