LAB ASSIGNMENT 3 - Sreelakshmi-N/CS5560SreelakshmiLabAssignment GitHub Wiki

CS5560 Knowledge Discovery Management

Name: Sreelakshmi Nandanamudi

Classid: 17

Tf-idf

Tf-idf stands for term frequency-inverse document frequency, and the tf-idf weight is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus. Variations of the tf-idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query

N-Gram

An n-gram is a sequence of n tokens (typically words) for some integer n.

•The NGramclass can be used to transform input features into n-grams.

•NGramtakes as input a sequence of strings (e.g. the output of a Tokenizer).

•The output will consist of a sequence of n-grams where each n-gram is represented by a space-delimited string of n consecutive words

Example:

[input,output] if n = 2

[(go, sample, text),(go sample, sample text)]

[(run, good, health),(run good, good health)]

[(computer, science, study),(computer science, science study)]

In Class Question

Generate theoutput (changes or transformations in the data) manually when the following tasks are applied on the input text. Show your output in details.