Feature Engineering - rudydesplan/book_rating GitHub Wiki
We dedicated a significant amount of time to feature engineering on our dataset. This process allowed us to gain a deeper understanding of our data, uncover the internal structure of the dataset, and provide our models with powerful new tools for learning. We primarily used the OpenLibrary API to enrich our dataset.
In this section, we briefly discuss our most interesting findings. For the complete code, detailed explanations, and interactive plots, please refer to the accompanying notebook.
Works
By examining the work ID through OpenLibrary's API, we gained new insights into the underlying structure of the dataset. A work is a collection of editions related to a unique and original book written by an author. For instance, "Harry Potter and the Philosopher's Stone" has over a hundred different editions. Each edition shares the same author, but other attributes (title, language, publisher, number of ratings, etc.) may vary. One such edition is "Harry Potter à l'école des sorciers," a French version published by Gallimard with x number of reviews and y number of ratings.
The entries in our dataset represent these editions. However, the issue arises because the average_rating
, our target variable, is associated with the work rather than the individual edition. As a result, all editions from the same work share the same average_rating
.
Addressing this problem was not straightforward:
- Removing all duplicates from the same work would lead to a substantial loss of information.
- We considered using the work ID as a grouping variable and encoding it in a way that allows the model to distinguish these groups and understand that they share the same rating. We hypothesized that this approach would make it easier to predict the average rating of a new edition if the model already knows another edition of the same work. However, this means that the information provided by all the features is "diluted" within the group. Since the rating is computed at the group level and influenced by all other editions, it might have significantly confused the model when learning different patterns in the features that are not directly correlated with the target variable. For instance, various editions of the Harry Potter books may have distinct features, but they will all result in the same value for the target variable.
Due to time constraints, we chose the latter option.
Categorical Variables
Our dataset contains a wealth of information that can be extracted from categorical features, particularly the author, publisher, and work ID. Capturing this information was a crucial challenge.
- One method to encode this information was to use a label encoder. However, based on our research, this approach was not recommended as it imposed an artificial numerical hierarchy on a feature.
- One-hot encoding was not feasible due to the high cardinality of these features.
- An alternative approach involved capturing information about these variables through other numerical features. For example, we could represent them using the number of text reviews or the number of ratings. An author with a higher number of text reviews or ratings compared to another would convey more "meaningful" information than an arbitrary encoding.
- Lastly, there is the possibility of using CatBoost or LightGBM, which are supposedly able to handle categorical features directly.
We opted for the third option.
Popularity of an Author Over Time
We conducted a brief analysis to observe if the popularity of an author affected the ratings of their books. We defined an author's popularity by the sum of ratings counts they have received over time. We did not observe clear trends in the evolution of an author's book ratings as their popularity increased. In other words, regardless of an author's popularity, the ratings of their books depend more on their intrinsic qualities than the author themselves.
Series
We noticed that it was possible to extract the series a book belongs to from its title, using this format: Title (Series #Series number)
.
Engagement
We computed an "engagement" score, which is the number of text reviews divided by the number of ratings. We observed that the higher the engagement score, the fewer the ratings count, indicating that the book is less popular. Conversely, the more popular a book is, the lower its engagement score. In other words, the most popular books disproportionately attract readers to rate them, but not necessarily to leave a text review.
Tags
Retrieving the work identifier also allowed us to fetch the tags associated with a work via the OpenLibrary API (refer to the get_data_openlibrary.ipynb notebook).
The tags associated with a work are of four types: themes, people, places, and times. The full procedure of retrieving and processing the tags is complex and can be found in the process_tags.ipynb notebook.
Here is a brief explanation:
- There can be numerous tags associated with a single work, which posed some computational challenges. We ignored tags that were not present in other works since we were interested in learning from the proximity between works that share similar tags.
- Once all the tags were fetched for each work, we first used one-hot encoding and then the tf-idf procedure to better capture the relevance of the tags for a work.
- We then used a dimensionality reduction algorithm: UMAP. We ended up with 8 features, 2 for each type.
We conducted various visualizations to observe the proximity of some works through their associated tags.