3. Clustering - adriannaziel/EmojiProject GitHub Wiki

Clustering

By clustering we wanted to analyse how similar emojis are in terms of their usage in tweets. Emojis that are used in similar context can be seen as synonyms. Clusters were built using embedding vectors created in the previous step.

Results of clustering shows that there are some groups of emojis that can be seen as synonyms like ['💕', '💞', '🌸', '💗', '💓', '💖'] because there are often together in one cluster. Generally, emojis are not very well coupled together, there are many emojis that are assigned as outliers in dbscan or put to one general group in k-means.

Here we present some examples of emoji clusters obtained by k-means and dbscan methods with various parameters:

K-Means

Clustering with k-means method and different number of clusters.

number of clusters: 4

  • ['😭', '😔', '💔', '😢']
  • ['❤', '😊', '🙏', '😘', '😎', '😇', '🌹', '🎉', '✌', '😌', '🙌', '😋', '🙂', '👍', '👏', '😁', '🔥', '💪', '😉', '👌', '🤗']
  • ['💕', '💜', '💞', '✨', '🌸', '💗', '💚', '💛', '💓', '💖']
  • ['😂', '🤣', '🤦', '🤷', '😱', '😏', '😅', '🤔', '😆', '🙄']

number of clusters: 5

  • ['😂', '🤣', '😊', '😘', '😎', '😇', '✌', '😱', '😌', '😋', '😏', '🙂', '👍', '😅', '😁', '🤔', '😆', '🙄', '😉', '👌', '🤗']
  • ['😭', '😔', '💔', '😢']
  • ['❤', '💕', '💜', '🌹', '💞', '✨', '🌸', '💗', '💚', '💛', '💓', '💖']
  • ['🤦', '🤷']
  • ['🙏', '🎉', '🙌', '👏', '🔥', '💪']

number of clusters: 6

  • ['😊', '😘', '😇', '😌', '😋', '😏', '🙂', '😁', '😉', '🤗']
  • ['❤', '💕', '💜', '🌹', '🎉', '💞', '✨', '🌸', '💗', '💚', '💛', '💓', '💖']
  • ['🤦', '🤷', '😱', '🤔', '🙄']
  • ['😭', '😔', '💔', '😢']
  • ['😂', '🤣', '😅', '😆']
  • ['🙏', '😎', '✌', '🙌', '👍', '👏', '🔥', '💪', '👌']

number of clusters: 7

  • ['😊', '😘', '😎', '😇', '✌', '😌', '😋', '😏', '🙂', '👍', '😁', '🤔', '🙄', '😉', '👌', '🤗']
  • ['🤦', '🤷']
  • ['❤', '💕', '🌹', '💞', '✨', '🌸', '💗', '💓', '💖']
  • ['🙏', '🎉', '🙌', '👏', '🔥', '💪']
  • ['😭', '😔', '💔', '😢']
  • ['💜', '💚', '💛']
  • ['😂', '🤣', '😱', '😅', '😆']

number of clusters: 8

  • ['💕', '💜', '💞', '✨', '🌸', '💗', '💚', '💛', '💓', '💖']
  • ['😂', '🤣', '😅', '😆']
  • ['😭', '😔', '💔', '😢']
  • ['❤', '😊', '🙏', '😘', '😇', '🌹', '🙂', '😁', '😉', '🤗']
  • ['🤦', '🤷']
  • ['😎', '✌', '🙌', '👍', '👏', '🔥', '💪', '👌']
  • ['🎉']
  • ['😱', '😌', '😋', '😏', '🤔', '🙄']

number of clusters: 9

  • ['😂', '🤣']
  • ['😊', '😘', '😎', '😇', '✌', '😌', '😋', '🙂', '👍', '😁', '😉', '👌', '🤗']
  • ['💕', '💞', '🌸', '💗', '💓', '💖']
  • ['🎉', '🙌', '👏', '🔥', '💪']
  • ['😱', '😏', '😅', '🤔', '😆', '🙄']
  • ['😭', '😔', '💔', '😢']
  • ['🤦', '🤷'] = ['🙏', '🌹']
  • ['❤', '💜', '✨', '💚', '💛']

number of clusters: 10

  • ['💜', '💚', '💛']
  • ['😎', '✌', '👍', '🔥', '💪', '👌']
  • ['❤', '😊', '😘', '😇', '✨', '😌', '🤗']
  • ['💕', '💞', '💗', '💓', '💖']
  • ['😋', '😏', '🙂', '😁', '🤔', '🙄', '😉']
  • ['🤦', '🤷']
  • ['😭', '😔', '💔', '😢']
  • ['😂', '🤣', '😱', '😅', '😆']
  • ['🌹', '🌸']
  • ['🙏', '🎉', '🙌', '👏']

![]https://github.com/gabirelik/emoji/blob/master/images/kmeans_metrics.png)

DBSCAN

Clustering with dbscan method different parameters and distance metric euclidean.

max dst: 0.9, min samples: 2

  • ['😂', '🤣']
  • ['😊', '🙂']
  • ['💕', '💞', '💗', '💓', '💖']
  • ['🤦', '🤷']
  • ['😁', '😆', '😉']
  • ['💔', '😢']
  • outliers['❤', '🙏', '😭', '😘', '💜', '😔', '😎', '😇', '🌹', '🎉', '✌', '✨', '😱', '😌', '🌸', '🙌', '😋', '💚', '😏', '💛', '👍', '😅', '👏', '🔥', '🤔', '🙄', '💪', '👌', '🤗']

max dst: 0.95, min samples: 2

  • ['😂', '🤣']
  • ['😊', '🙂', '🤗']
  • ['💕', '💞', '💗', '💓', '💖']
  • ['🤦', '🤷']
  • ['😅', '😁', '😆', '😉']
  • ['💔', '😢']
  • outliers:['❤', '🙏', '😭', '😘', '💜', '😔', '😎', '😇', '🌹', '🎉', '✌', '✨', '😱', '😌', '🌸', '🙌', '😋', '💚', '😏', '💛', '👍', '👏', '🔥', '🤔', '🙄', '💪', '👌']

max dst: 0.93, min samples: 2

  • ['😂', '🤣']
  • ['😊', '🙂']
  • ['💕', '💞', '💗', '💓', '💖']
  • ['🤦', '🤷']
  • ['😁', '😆', '😉']
  • ['💔', '😢']
  • outliers ['❤', '🙏', '😭', '😘', '💜', '😔', '😎', '😇', '🌹', '🎉', '✌', '✨', '😱', '😌', '🌸', '🙌', '😋', '💚', '😏', '💛', '👍', '😅', '👏', '🔥', '🤔', '🙄', '💪', '👌', '🤗']

max dst: 0.95, min samples: 3

  • ['😊', '🙂', '🤗']
  • ['💕', '💞', '💗', '💓', '💖']
  • ['😅', '😁', '😆', '😉']
  • outliers ['😂', '❤', '🤣', '🙏', '😭', '😘', '💜', '😔', '😎', '😇', '🌹', '🤦', '🎉', '✌', '✨', '🤷', '😱', '😌', '🌸', '🙌', '😋', '💚', '😏', '💛', '👍', '👏', '🔥', '💔', '😢', '🤔', '🙄', '💪', '👌']

max dst: 1, min samples:2

  • ['😂', '🤣', '😅', '😁', '😆', '😉']
  • ['😊', '🙂', '🤗']
  • ['💕', '💞', '💗', '💓', '💖']
  • ['🤦', '🤷']
  • ['👍', '👌']
  • ['💔', '😢']
  • outliers ['❤', '🙏', '😭', '😘', '💜', '😔', '😎', '😇', '🌹', '🎉', '✌', '✨', '😱', '😌', '🌸', '🙌', '😋', '💚', '😏', '💛', '👏', '🔥', '🤔', '🙄', '💪']