Comparison between Human‐Derived and Modern Transformer Based Folklore Motif Identification - minalee-research/cs257-students GitHub Wiki

#task #application #analysis

Jennifer Spinoglio

Abstract

Motif identification provides opportunities to compare topic identification approaches and results between human cognition and machine models on important pieces of literary text. In the 20th century, folklore scholars created an index of folklore motifs and categorized a number of tales based on key motifs. The goal of this project is to determine the extent to which several models (tf-idf, L-LDA, BERTopic) are able to recreate the motif identification and classification of this prior human work. Initial findings indicate that BERTopic, a transformer based model, performs better than the others. Further analysis reaffirms this finding, indicating that BERTopic comes closest to mimicking the human-generated motif classifications of the ATU Index. However, future work is needed to optimize results and further explore the potentials of BERTopic and other transformer-based models on literary tasks such as this one.

Introduction

This project aims to answer the question of how well modern transformer-based language models perform on motif recognition tasks on a corpus of international folktales. Motifs in folktales are reoccurring themes, characters, and stories that appear in a variety of cultures and tales. Folklore texts across cultures tend to share common themes and motifs. In the twentieth century, folklore researchers attempted to take stock of typical folklore motifs and tale types, culminating in the development of the Thompson Motif Index (TMI), which consists of a hierarchical catalog of motifs and themes that appear in folklore across cultures and time periods. The TMI was later refined into the Aarne-Thompson-Uther Index, which is similar. The work of motif identification offers interesting insights into how modern language models designed for topic modeling perform on literary texts. This project's secondary goal is to compare the motifs recognized by the model to that of the ATU Index, a human-generated catalog of motifs. This analysis offers the opportunity to compare human and machine learning outcomes, which is relevant to cognitive science and computational linguistics. The modern model, BERTopic, will be compared to Labeled Latent Dirichlet Allocation (L-LDA) and a tf-idf model. The data consists of a dataset originally developed by Ashliman and titled "Annotated Folk Tales" that contains the text of each folktale along with the associated ATU (Aarne-Thompson-Uther Index) codes for each motif present in it. The performance of the models on the folktales will be quantitatively assessed using clustering accuracy metrics like homogeneity, completeness, and v-measure. The performance of the models will be qualitatively assessed by taking the clusters of generated motifs and comparing them to the motifs that make up the ATU Index. If large discrepancies should arise between the clusters created by any of the models and the motif categories described by humans in the ATU Index, they will be analyzed further.

Approach

This project aims to build upon the work conducted by Karsdorp and Bosch in their 2013 paper "Identifying motifs in folktales using topic models", which can be accessed here.

The project uses topic modeling and clustering techniques to measure the differences between machine generated motifs and human generate motifs in global folklore. The three models are tf-idf, L-LDA, and BERTopic. Each model is first used to vectorize the corpus and perform feature extraction. Next, k-means clustering is used to group the features into larger categories of motifs. Then, three metrics, homogeneity, completeness, and v-measure, are computed to evaluate the similarity of the model-identified motifs and the human-identified ATU motifs.

The methods used as the baseline will be the tf-idf and the L-LDA model. The tf-idf model provides a good baseline as it uses more simple statistical methods. The Karsdorp and Bosch paper established that modeling could be used for such a task, and I have set a baseline that is a simple model. The L-LDA model provides a good additional baseline as it was the previous standard of topic modeling when Karsdorp and Bosch conducted their research. Tf-idf modeling presents another good baseline as it uses pure word frequency to determine topic relevance and does no learning. The two previous methods will be contrasted with the modern BERTopic model and its performance. By doing so, we will gain a better understanding of how modern, more advanced models perform on motif identification and classification tasks.

This is a modification of the work done by Karsdorp and Bosch and incorporated to include the transformer based topic models that had not yet been invented when this earlier research was conducted. At the time, Karsdorp and Bosch used only tf-idf and L-LDA models to identify literary motifs on a corpus of Dutch and Frissian folktales. I have chosen to use a different dataset, compiled by Hagedorn and Darányi that is titled "Annotated Folk Tales" and contains the text of a number of global folktales along with the associated ATU (Aarne-Thompson-Uther Index) codes for each motif present in it.

Experiments

Data: I used the folklore dataset developed by Hagedorn and Darányi (2022). The authors updated a dataset originally developed by Ashliman and titled "Annotated Folk Tales" that contained the text of each folktale along with the associated ATU (Aarne-Thompson-Uther Index) codes for each motif present in it. The dataset can be found here.This is not the same dataset used by Karsdorp and Bosch, and instead contains folktales from different cultures. No other data was collected or used. The dataset is used for the task of comparing the model identified motif clusters to those determined by human folklore scholars in the ATU Index. In the figure below, we can see that most motifs have between 1-10 different tales in them, with a few having over 10, up to around 30.

download-2

Figure 1: The distribution of motifs contained in the Annotated Folk Tales dataset by the number of tales associated with each motif.

Evaluation method: The evaluation metrics used are homogeneity, completeness, and v-measure. These metrics were also used by Karsdorp and Bosch in their 2013 experiments on the same subject. Homogeneity can be described as the extent to which all the data points/folktales in the cluster share the same label/motif. Completeness can be described as the extent to which all data points/folktales who share the same motif/label can be found in the same cluster. V-measure was created by Rosenberg and Hirschberg in "V-measure: A conditional entropy-based external cluster evaluation measure" (2007), which can be found here. It uses the concepts of homogeneity and completeness, and takes the harmonic mean of the two. A qualitative comparison of clusters associated with each model versus the labeled motif is also used to further assess the accuracy and performance of the models on specific topics.

Experimental details: The tf-idf model and the Labeled Latent Dirichlet Allocation (L-LDA) model will be used as a baseline for comparison with BERTopic. These were the two methods used by Karsdorp andf Bosch in 2013. The L-LDA model provides a good additional baseline as it was the previous standard of topic modeling when Karsdorp and Bosch conducted their research. L-LDA was created by Rampage et al. in 2007. tf-idf modeling presents another good baseline as it uses pure word frequency to determine topic relevance and does no learning. The tf-idf model originates from scikit-learn's TfidfVectorizer model, which was fit and transformed upon the corpus. All default parameters were used, to include as many tokens as possible for the sake of exploration. To determine the clusters following the vectorization, scikit-learn's KMeans function was used, with parameters such that the number of clusters was equal to the number of true motifs in the dataset, the maximum number of iterations was 100, and the model was run 5 times with different centroids. The L-LDA model originates from scikit-learn's LDA model, which was fit and transformed upon the corpus, after it was vectorized using scikit-learn's CountVectorizer function. The hyperparameter for number of topics was set to the number of motifs idnetified by humans in the dataset (182). To determine the clusters, scikit-learn's KMeans function was used, with parameters such that the number of clusters was equal to the number of true motifs in the dataset, the maximum number of iterations was 100, and the model was run 5 times with different centroids. The BERTopic model was used to determine clusters of motifs from the folklore corpora and downloaded from the internet here. BERTopic is a BERT-based topic model developed by Grootendorst in 2022. BERTopic uses BERT and tf-idf to create clusters, thus making it ideal for testing if modern models perform better on motif identification than the L-LDA and pure tf-idf models tested by Karsdorp and Bosch. All default parameters for BERTopic were used, except for specification that clustering would be conducted using k-means, where the number of clusters is equivalent to the number of motifs identified by the ATU Index in the corpus (182). All coding was conducted Google CoLab. Once all the clusters had been calculated, homogeneity, completeness, and v-measure were calculated using scikit-learn, for tf-idf and L-LDA and built in metrics feature for BERTopic.

Results and Analysis

Quantitative Results: The resulting scores for each of the models can be seen in the table below. The L-LDA model never performed better than the tf-idf model in homogeneity, completeness, or v-measure scores. This indicates that tf-idf does a better job of matching human motif identification. The tf-idf model does not take word order into account and is not designed to take into account the context of a word, while L-LDA is. However, the BERTopic model, which captures the relationships between words, order, and word meaning, outperformed both older models. The only one that got somewhat close was tf-idf on completeness, indicating that the two have similar approachs to including all members of a label in a shared cluster. This provides strong evidence that modern, transformer-based topic models do a better job in matching human motif identifications. All of these values are at around 0.70, indicating that there is still room for improvement and that the model can be closer to human motif identification.

Model Name	Homogeneity Score	Completeness Score	V-Measure Score
tf-idf	0.532	0.642	0.582
L-LDA	0.532	0.562	0.547
BERTopic	0.695	0.684	0.690

Qualitative Results: Further analysis of the models was undertaken through visualization and qualitative analysis. Topic maps of the L-LDA model and the BERTopic model were created to further enable visual analysis (see figures 2 and 3 below). Overall, the LDA model performed much worse than the BERTopic model, as the topics tend to be dominated by common words. However, several distinct motifs can clearly be identified in the L-LDA model, including the Little Red Riding Hood archetype (ATU 333). This further reaffirms that the L-LDA model performed worse than the BERTopic model. Upon closer viewing of Figure 3, there are several large clusters in that are dominated by common words, indicating the L-LDA model's difficulties with clustering effectively and accurately.

Topic Distance Analysis: For additional analysis, we looked at which human-identified motifs the models correctly sorted into only one cluster. In other words, these are the motifs that the models always correctly identified and clustered as a distinct motif, instead of grouping them with other tales of different labels. The L-LDA model identifies no motifs of this manner that are not identified by tf-idf or BERTopic. BERTopic identifies seven motifs not determined by the other two models, including motifs like "Hunter Turns Animal Inside Out" (1889B). tf-idf identifies two motifs not detected by the other models, including the motif of "Sleeping Beauty" (410). This motif gets split into many clusters containing other tales of princesses and royals in the BERTopic model.

Figure 2: An intertopic distance map of the BERTopic model. Here, topics form distinct clusters, indicating relative closeness and similarity between them.

Figure 3: An intertopic distance map of the L-LDA model. Here, large clusters are dominated by common words, indicating issues with recognizing what is important to a topic and what is not.

Document Distribution Analysis: Upon further examination of the makeup of the clusters generated by the three models, we can clearly see that BERTopic yields a distribution of documents per motif most similar to that of the original ATU classification, which can be seen in Figures 1 and 4. This further reaffirms the strength of the BERTopic model in replicating the motif classification determined in the ATU Index. The baseline tf-idf model has a large number of motifs with only one document, and a few very large motifs with over 100 documents. This indicates that the model is over-generalizing, creating overly large motifs with a variety of outliers. The baseline L-LDA model sees similar results, with most motifs having very few documents and a few having almost 50. Once again, it is evident that the the model overgeneralizes some motifs, resulting in clusters that are double the size of the largest ATU ones, and misses others, instead creating many groups with few members. However, the BERT model distribution does not perfectly match that of the ATU classification, further emphasizing the need for additional fine tuning.

Figure 4: A comparison of the distribution of number of documents per motif across the three models. The graphs were created using matplotlib.

Conclusions

From these results, it is apparent that topic modeling can be used to identify folktale motifs and classify tales according to them. All three models returned results indicating that they work to some extent for this task. Overall, it seems clear that the BERTopic model resoundingly outperforms the tf-idf and L-LDA models when it comes to motif identification and classification in folktales. There was not a single metric that saw any other model outperform BERTopic. The fact that it outperformed both baseline models on both homogeneity and completeness, is truly remarkable. However, these evaluation metrics still come out to around 0.68, indicating that there is room for improvement. Further analysis affirms the findings that BERTopic outperforms the two baselines in matching the ATU index in distribution of documents per motif. BERTopic also outperforms the two baseline models in correctly sorting one motif into only one cluster, indicating better understanding of the motifs and the distinctions between them.

Future Work

In future iterations of this work, it might be valuable to experiment more with including addition pre-processing, especially removing common words and phrases that do not contribute much to the meaning of a sentence or piece of text. However, the presence of common words only seem to have affected the results of the tf-idf and L-lDA models, indicating that modern models may be able to more appropriately handle these sorts of situations. Future iterations of this project might choose to focus on increasing this score through fine tuning or additional training of the BERTopic model or focusing on emerging topic modeling techniques. Additional work on the ethical behavior of the model could also be interesting. Some folklore motifs may also be culture specific. For example, the motif of "Sunken Bells" is only found in tales originating from regions of Great Britain. As an additional analysis, I would be interested in looking into whether cultural aspects of a tale, including region-specific names or other proper nouns, affect the ways in which the model identifies motifs (i.e. assigning tales with German names to motifs more common in German folklore).

Code

See Code Here (In Google CoLab Notebook)