Short review of VSM based systems from STS'12 - STS-NTNU/STS13 GitHub Wiki
Proceedings: http://ixa2.si.ehu.es/starsem/proc/program.semeval.html
BUAP: A First Approximation to Relational Similarity Measuring
(STARSEM-SEMEVAL071.pdf)
Task #: 2 - Measuring Degrees of Relational Similarity (not STS!).
Rank: Low, below the gold standard (...).
Method(s) for building the VSM(s): For each word pair, contextual features are extracted from sentences containing both words, found using Google search (100 samples). Many features are extracted (NE, distance, order, etc.). Also WordNet is used for extracting features for word pairs (meronymy, hyponymy, hypernymy, and more). Total features are 42, then reduced to 25.
Other similarity measures:
Vector similarity measure: Cosine similarity function, between the test feature vectors and the average semantic prototype vector for each class.
LIMSI: Learning Semantic Similarity by Selecting Random Word Subsets
(STARSEM-SEMEVAL078.pdf)
Task #: 6, STS.
Rank: 22 (of 89).
Method(s) for building the VSM(s):
Random Indexing – word context vectors through sliding window.
Using JavaSDM.
Window size = 4+4.
Constant weighting scheme.
Dimensionality = 100, then using a subset of 10 for similarity measure.
Random degree = 10.
Sentences (s) are built from word (w) context vectors by summing them into a sentence vector s, then dividing by the total length of s --> |s|.
Sentence pairs, s1 and s2 are combined into a single feature vector x, which in turn is used as input to the feature selection function H.
x was created using different combination methods:
sumdiff, concat, product, crossprod, crossdiff, absdiff.
Used two different algorithms (RankBoost and RtRank tested separately) for only including a subset of features from the combined sentence vectors that where of importance to a given sentence comparison, using the H function.
Text resources for training VSM: GigaWord English corpus.
Preprocessing: Stripping for tags. Removing punktuations. Lowercasing.
Other similarity measures: No
Vector similarity measure: Cosine similarity function, limiting to the features selected by H.
sranjans : Semantic Textual Similarity using Maximal Weighted Bipartite Graph Matching
(STARSEM-SEMEVAL085.pdf)
Task #: 6, STS.
Rank, best of their three systems: 24 with the All Rank (of 89) 8 with the Weighted Mean Rank (of 89).
Concept: Find maximal weithed pipartite match between the tokens (terms) of two sentences.
Method(s) for building the VSM(s): DISCO tool, Term-term matrix ~> second order co-occurrence ("LSI-like behaviour").
Text resources for training VSM: DISCO tool with pre-trained word space from English Wikipedia word space.
Preprocessing: NER (entities, times, dates, monetary, etc.). Stopword removal using the list from the NLTK Toolkit. Lemmatization using Stanford CoreNLP Toolkit.
Other similarity measures: WordNet + VSM – Edge weight between every token/term pair in the two sentences. WordNet similarity measure: Lin word-sense similarity measure. Maximal weighted bipartite match is found for the pipartite graph using the Hungarian Algorithm (Kuhn, 1955). <--? With normalization. sim(s1, s2) = MaximalBipartiteMatchSum(s1,s2) / max(tokens(s1), tokens(s2))
Vector similarity measure: Cosine similarity function.
Other notes: WordNet performs bad on cross-PoS similarity matching
Weiwei: A Simple Unsupervised Latent Semantic based Approach for Sentence Similarity
(STARSEM-SEMEVAL086.pdf)
Task #: 6, STS.
Rank: 20 (of 89).
Concept: Using BOW – exploit that missing words from a sentence can tell something about what the sentence is not about...
Method(s) for building the VSM(s): Weighted matrix factorization approach (Srebro and Jaakkola, 2003). Weighted Textual Matrix Factorization - WTMF (novel model) TF-IDF on the word-sentence matrix. TF-IDF value estimates the importance of a word in a sentence. Creates sentence vectors of the task-related sentences by using a specific transformation and weighting equation.
Text resources for training VSM: WordNet. Wiktionary. Brown corpus.
Preprocessing: Tokenization. PoS tagging. Lemmatization. Then the lemma that is most frequent according to WordNet::Data is used.
Other similarity measures:
Vector similarity measure: Cosine similarity function.
Other notes: Dataset for word/sense (sentence?) similarity measure: LI06 (Li et al., 2006).
UNIBA: Distributional Semantics for Textual Similarity
(STARSEM-SEMEVAL087.pdf)
Task #: 6, STS.
Rank: 41 (RI only) (of 89).
Concept: Comparing RI, RI_permutation (RP) and LSA. Sentence-vectors are built using word context vector summarization.
Method(s) for building the VSM(s): RI RI + RP, using RP to find words that a target word is related to by inverting the shifting operation. This applies only to words defined as dependent words (output from dependency parsing..)? LSA RI package: built on top of Semantic Vectors package (Java-code). LSA package: SVDLIBC (C-code). Parameters: dimensionality = 500 non-zeros = 10 (5+'s and 5-'s) Sentence vectors are created by summing word context vectors.
Text resources for training VSM: WaCkypedia_EN corpus (http://wacky.sslmit.unibo.it/doku.php?id=corpora)
Preprocessing: Dependency parsing.
Other similarity measures:
Vector similarity measure: Cosine similarity function on sentence vectors, multiplied with 5.
Other notes:
Saarland: Vector-based models of semantic textual similarity
(STARSEM-SEMEVAL089.pdf)
Task #: 6, STS.
Rank: 43 (of 89).
Concept: Using a combination of a few (two?) VSMs.
Method(s) for building the VSM(s): BoW – word-word matrix (window of 5+5). Thater et al. (2011) (a bit simplified), limiting the word-word vector to words that are deemed dependent (from a dependency parser). Summing word context vectors into sentence vectors, for both models separate. Aligning the two sentence vectors for each sentence, and computes similarity between two sentences for each vector set separetely. Then least square regression.
Text resources for training VSM: GigaWord corpus.
Preprocessing:
Other similarity measures:
Vector similarity measure:
Other notes:
BUAP: Three Approaches for Semantic Textual Similarity
(STARSEM-SEMEVAL093.pdf)
Task #: 6, STS.
Rank: 25 (RI) (of 89).
Concept: Three different systems: Jaccard coefficient with term expansion through synonyms (captured from a dictionary). Semantic similarity method by Mihalcea (Mihalcea et al., 2006) (using NLTK). Random Indexing (bag of concepts --> document index vectors).
Method(s) for building the VSM(s): Random Indexing Parameters: Dimensionality = 2048 Non-zeros: 10 +1's and 10 -1's Window size = No sliding window used, only document index vectors. Sentence/document vectors are created by summing term/word context vectors multiplied with TF*IDF weight.
Text resources for training VSM: Only the STS test dataset...(?)
Preprocessing: Stemming (Porter).
Other similarity measures:
Vector similarity measure: Cosine similarity value * 5
Other notes: Dictionary used to find synonyms for words.
UNT: A Supervised Synergistic Approach to Semantic Text Similarity
(STARSEM-SEMEVAL094.pdf)
Task #: 6, STS.
Rank: (three systems) 5, 9 and 14 (of 89).
Concept: Dependency graph, enriched with 64 features, among these 32 are based on BoW semantic similarity. Features are assigned to dependency subgraphs, and similarity is calculated between each node in the two graphs representing the two sentences.
Method(s) for building the VSM(s): LSA
Text resources for training VSM: Wikipedia (from October 2008).
Preprocessing: Stopword removal.
Other similarity measures: See Table 1 for an overview of the scoring by individual features. LSA scored best of all features on the MSRvid data set.
Vector similarity measure: Cosine similarity function. Align: See the subsubsubsection "Best Alignment Strategy". Calculate the single strongest semantic similarity between non-identical word-pairs in the two sentences (a word can only belong to one cross-sentence word-pair). Calculates the sum of identical words + the semantic similarity scores of each word-pair.
Other notes: A support vector regression with a Pearson VII function-based kernel was used on all the different features/similarities produced by the system. Concludes that they find it interesting that data-driven methods maintain strong results on all training datasets compared to knowledge-based methods.
University_Of_Sheffield: Two Appraches to Semantic Text Similarity
(STARSEM-SEMEVAL097.pdf)
Task #: 6, STS.
Rank: (three systems) 17, 34 and 48 (of 89).
Concept: One approach combining VSM + WordNet Second approach uses superviced ML on n-grams + WordNet.
Method(s) for building the VSM(s): Binary vectors are created for each sentence. No higher order co-occurrence information was extracted/used.
Text resources for training VSM:
Preprocessing: (NLTK package) Tokenization. Stopword removal. Lemmatization.
Other similarity measures: Also explores different addition similarity measures that are combined with the cosine similarity between the boolean vectors. A feature used is a corpus label of the test data (is this relevant when looking at the evaluation data?..).
Vector similarity measure: Cosine similarity * 5.
Other notes: N-gram overlap performed better than the standard VSM model used.
Penn: Using Word Similarities to better Estimate Sentence Similarity
(STARSEM-SEMEVAL0101.pdf)
Task #: 6, STS.
Rank: ? (of 89).
Concept: Explores the contribution of three different vector models: Collobert and Weston embeddings – Neural models of word representation. Eigenwords. Selectors.
Method(s) for building the VSM(s): Their models are doing operations seemingly very similar to those used in 1) LSA (SVD), 2) term-term matrix, e.g. as input to LSA, and 3) something that looks very similar to what one obtains in a VSM trained using sliding window (e.g. RI). ...
To be continued ...
Text resources for training VSM:
Preprocessing:
Other similarity measures:
Vector similarity measure:
Other notes: