Association Rules - xavierfeltin/mtg_data_mining GitHub Wiki

Association rules

Association rules are a field of study on their own. A lot of research are done on this field (still now).

This family of algorithms determines relations between set of items depending of their frequencies in the transactions of the database. It is a statistic approach. Example: A + B ---> C, if the user has choosen A and B, he will probably choose C as well.

Approach

Step 1: find frequent items

The different approaches are working with frequent items. Frequent items are simply items (or set of items) appearing often enough to be considered to determine new rules. This frequency threshold is determined by developers. A smaller frequency will induce more rules since the algorithm will consider more frequent items.

The main differences between approaches are concerning usually:

The memory required
The processing time

The following approaches have been implemented to find frequent items:

A Priori (very common approach, very straightforward)
FPGrowth(very common approach, using a tree with common nodes to reduce memory usage to find frequent items)
GenClose (based on generators to generate frequent items. The use of generators reduce the memory usage and they are useful to generate association rules in a second step)

Step 2: discover and generate association rules

Once the frequent items are found (or their generators), it is possible to build association rules. All the association rules are not interesting. Statistic indicators can be processed to determine if we can have enough condifence into a rule to use it.

The main differences between approaches are concerning usually:

The memory required
The processing time
The unicity of each rule
A rule format (small left hand side, big left hand side, ...)

The following approaches have been implemented to generate association rules:

A Priori
Algorithms based on generators from the work of Ahn Tran and Tin Truong from the university of Dalat:
- MinMax (generate rules with a minimum left side and a maximum right side)
- MinMin (generate rules with a minimum left side and a minimum right side)
- Minimals (generate basic rules and deduced all the rules from them)
- With constraint on items (generate rules within a set of predetermined items)

The algorithms based on generators are more complex to understand but they limit redundancy between the rules. They are faster than A Priori.

In Magic ?

In a Magic context, a deck will be assimilated to a transaction and a card to an item. The objective is to find rules to help to choose the new card depending of the ones already selected.

Limitations and complexity

Despite being attractive, approaches based on association rules have many pitfalls. Below some of them met during this particular project.

Card repartition among several decks

The repartition of cards in Magic the Gathering does not synergize really well with Association Rules. Association Rules are based on frequent items. To be frequent an item (or a set of items) need to be present a certain number of times across the records of the database (its support).

In the Magic world, most of the cards are seldom present across several deck. Here a graph of cards repartition across a small deck database (around 2 500 decks matching 2 400 cards given by MTGDecks). Card repartition

In this database, 1400 cards are present only in less than 10% of the decks. Only a few cards (around 30, maybe less) are present in more than 30% of the decks. This particularity is due to the depth of Magic with more than 20 000 published cards.

A few cards are often selected for a particular decks colors. They are like the frame of a house. But the remaining cards, the interesting ones, can be very different from a deck to another depending of extensions used, current meta-game, player preference, cards available, ...

This particularity requires to set a very low support (between 0.01 and 0.05) in order to extract rules linked to these cards of interest. A low support implies a lot of processing for first finding the frequent itemsets then second generating the rules. In a production environment, this is troublesome. A webmaster may want to get external ressources (like AWS, ...)

Quantity and quality of association rules

Association rules approaches tend to generate a lot of rules (more than hundred of thousands).

It is particular true in a Magic's decks building context. Indeed, a deck can be considered to have a numerous items (40-100 cards by deck) and the support of most of cards is very low (less than 10%).

These two specifities implies:

The presence of many frequent itemsets with a big number of items (all low frequent items are grouped around the same support)
Rules with a lot of items present on the left and/or the right side

Algorithms generating rules (espcially the ones from Dalat university) generates unique basic rules with a set of items deduced from the frequent itemsets (more exactly their generators) found in the database. From these basic rules, all consequent rules are deduced by swapping side the different items and by making sure there are no duplicates. With rules containing sometimes 20 to 40 items, the number of permutations for each rule can be important.

Turnarounds

Limit the number of frequent itemsets

One way to limit the number of frequent itemsets is to define a range on support instead of only a minimum value. This will avoid including high frequency items.

High frequency items are the backbone of your deck but it is the low frequency items that are really containing the player experience.

Sadly, there may be interactions between high frequency and low frequency items. To do so, will cause to lose these interections when generating the association rules.

However, this turnaround is delicate to configure due to the data that are really concentrate around a small range of support values.

Limit the number and length of rules

To limit the number of rules, as for the frequent itemsets, there is the possibility to set a range for support and a range for the confidence. On Magic, players are looking for interactions between cards with a low support but with a high confidence (if a card is present it is often played with another card). A range of support between like [0.01, 0.05] and a confidence of [0.8, 0.9] can be a beginning.

Length of rules is harder to implement. Some algorithms from Dalat university like minMax, minMin, maxMin generates rules considering the minimum or the maximum itemsets on the left or on the right side. These algorithms depending of the frequent itemsets may still generate long rules.

Another solution is to force a maximum length on association rules (should be done with A Priori algorithm easier to adapt). For example, no more than three items on the left and no more than two on the right side. It is necessary to be careful to avoid losing or duplicating rules when splitting them into smaller rules.

How to parameter the number of items on each side of a rule can be troublesome as well. Do I need rules under the format:

1 card ---> 1 card
1 card ---> several cards
several cards --> 1 card
several cards --> several cards

Results

These methods generates rules based on the association of cards in the database's decks.

For example, this is a rule with:

A support of 0.05 (the right hand side of the rule appears in 5% of the decks)
A confidence of 0.78 (when the right hand side cards appears, 78% of time they are combined with the left hand side) :

Another example, the generated rule contains more cards on the left and on the right side. The rule has a supportof 0.06 and a confidence of 0.98: Example 2

Conclusion

Association rules method looks like a promising approach. Indeed, it put in relation several cards to find new suggestions.

Yet, how to exploit these rules in a recommendation system for helping to build Magic decks is less obvious. The length of rules and their numbers are making this approach delicate to exploit.

APriori and FPGrowth are well documented algorithms so it is not difficult to start generating rules on a small database. When the database is growing, the processing time and the number of rules tend to explode.

Webography

Data-Mining:

Machine Learning In Action, Peter Harrington, Manning, 2012
http://www.borgelt.net

Association rules:

https://en.wikipedia.org/wiki/Lift_(data_mining)

Apriori:

FPGrowth: - http://blog.khaledtannir.net/2012/07/lalgorithme-fp-growth-les-bases-13/ - https://www.singularities.com/blog/2015/08/apriori-vs-fpgrowth-for-frequent-item-set-mining - http://www.borgelt.net/doc/fpgrowth/fpgrowth.html

Publications: