Latent Semantic Analysis - xavierfeltin/mtg_data_mining GitHub Wiki
Latent Semantic Analysis
The objective of this approach is to identify if two documents are similar based on their contents. Two similar documents can be articles treating the same subject or a clumsy student's presentation and a Wikipedia page.
What is a Magic card ?
In this project, the documents will be Magic cards. A card has several information:
- Title
- Cost
- Types
- Extension
- Effects in game
- Descriptive text
- Attack / Defense score
- Artist
Which information to use?
All the information on a card may not be interesting. To recommend card for building a deck, the description text and the artist may not be relevant.
To build its deck, an user is more interested for example to know if the recommended cards have similar effects, same types or identical cost as the cards he has already selected.
Title
The title may seem an interesting information to work with. However, cards titles need to sound fantastic, heroic, ... A piece of the title can sometimes refer to a family of card but most of the time it is not. Finally, to avoid introducing more text that does not bring misleading information, the title is not used in this approach.
Types / Subtypes
The types and the subtypes of a card are categories. These categories can have special rules applied to them or can regroup a set of cards together. These types are usually used in the game effects to apply common effects.
This data is well codified in Magic. It brings a lot of information and two cards sharing some common types will have a higher score of similarity.
Extension
To avoid to group too much cards from a same extension together, the extension is not taken into account.
Cost
The cost is defined by two data. The first one is the mana color. The second one is the quantity of mana to pay. The information here is pretty well codified. There may be some subtilities such as 'X' meaning the player can pay an indefinite amount.
Cost is an important information to integrate. It is codified (so short) and it is very meaningful for a player. A player may search cards with the same color, or equivalent price to pay.
Attack / Defense
The same as the Cost. It is a well codified score. All the cards does not have this information. Meaning, cards with this field filled will have more in common between them than with a card without this field.
Game effect
This part is the biggest and the more heteregenous part. Usually it takes the form of a bloc of text decribing special effects applying on the game.
Hopefully, there are some key words that have been introduced overtime. These keywords refers to specific rules avoiding as such to write everything on the card. These keywords are really well codified and if two cards share a same keyword, their similarity score should be higher.
The game effect can provide as well special Attack / Defense information (ex: +1/+1). As seen before, they are well coded information. So two cards indicating this kind of pattern should have a higher score.
There are others patterns that may have an influence on the similarity score (rules applying to a specific family of card, mention of a card's title, special rules using mana, ...).
However, since it is a bloc of text there are lot of words that may be less meaningful for us, such as articles, punctuation, ... But in this kind of game, it is important to stay careful because a simple word can change everything about a rule.
Global approach
Principle
The idea behind LSA is to encode the text into a vector of descriptors. A descriptor is a float value. Then these vectors are multiplied two by two. A score of 1.0 means the two vectors are the same, as such the two texts is the same.
Text pre-processing
The text that will be encoded is the concatenation of different piece of informations from a card: the cost, the types and subtypes, the game effect and the attack/defense score when available. The encoding will be done through the bag of words model.
To apply the bag of words model, some pre-processing needs to be done:
- First step, it is to remove all specific formatting applied to the text. For example, from the Magic cards Json used in this project, mana color are encoded with {} characters ({G} for Green). The G letter can be kept as such but the curly braces are cumbersome.
- Second step, all the words are isolated one from the others. It is called tokenization. It is one of many approaches possible for LSA. For a first, it is a pretty straightforward approach. The main drawback is the lost of sentence structures. Since Magic is based a lot on keywords and the concatenated text remains relatively short, sentence structures have a less impact than in full redacted press articles.
- Third step, stop words and punctuations are removed since there is no longer a sentence structure. The remaining words are stemmed. All affixes are removed from the word (plural form, past form, ...).
Encoding
The words from the text of all Magic cards are capitalized inside a set called vocabulary.
The encoding consists to reduce the dimension of the text into X clusters (here 50) based on the common vocabulary. To do so, the Truncate SVD (singular value decomposition) model is used.
The resulting vector containing the 50 descriptors is finally normalized to ensure that the similarity score will be later no greater than 1.0.
This encoding is valid for a given vocabulary. Meaning if new cards are added (ex: new Magic extension), the common vocabulary will change and so the encoding has be performed again over all existing cards.
Results
First example, two cards sharing 96% of similarity. The cards are from different colors (White and Red) but their cost with incolor mana to be played and their effects are really similar (special rule and attacke / defense score):
Second example, two cards sharing 87% of similarity. The cards needs both one white mana and are the same type (bird). The difference is coming from the field of application of the special rule (all creatures versus a specific creature) and the difference of Attack/Defense.
Last example, two cards with 86% of similarity. The cards are different types (Land versus Artifact) and their cost is different. Their similarity is coming from their rules effects on discarding and getting new cards. The bag of words approach does not take into account the order in which words appear in a card. It lets the possibility to find similarities even if the effects do not appear in the same order.
Conclusion
LSA allows to identitfy similar cards based on their content. However, Magic players will have a hard time to use these raw results since cards from different colors, types, ... are put together as long as their content is similar.
To be useful to Magic players additional filters are needed. These filters have for role to reduce the set of considered cards only to the ones meaningful to the player.
For example, a player wishing to create a deck based only on red sorcery cards will not be interested in similar cards using blue mana or red creature cards. The filters on the search user interface will be there to constraint the results only to red sorcery cards even if they are not the most similar.
LSA is based only on the content. Information about if a card is usually played with another card are not taken into account. As such, LSA should be used to suggest alternative cards to a particular card. It will help the player to discover new cards that may replace the ones that he usually plays.