NLP Notes - doraithodla/notes GitHub Wiki

Source: NLTK Cookbook

These POS tags will be referenced more in the Using WordNet for tagging recipe in Chapter 4, Part-of-speech Tagging.

Wup-Similarity

The wup_similarity method is short for Wu-Palmer Similarity, which is a scoring method based on how similar the word sens- es are and where the Synsets occur relative to each other in the hypernym tree. One of the core metrics used to calculate simi- larity is the shortest path distance between the two Synsets and

their common hypernym:

Embeddings (https://supabase.com/blog/openai-embeddings-postgres-vector)

Embeddings capture the “relatedness” of text, images, video, or other types of information. This relatedness is most commonly used for:

  • Search: how similar is a search term to a body of text?

  • Recommendations: how similar are two products?

  • Classifications: how do we categorize a body of text?

  • Clustering: how do we identify trends?

  • produces high dimensional data, High dimVectorizationensional data is hard to explore and visualize. A technique to reduce the number of variables in a data set, while preserving as much information as possible as possible is called Principle Component Analysis. https://statisticsglobe.com/principal-component-analysis-pca