Data cleaning - rudydesplan/book_rating GitHub Wiki

We propose a brief overview of our most interesting findings and the methods we used for cleaning the dataset. More in-depth explanations along with the code can be found in the notebook cleaning_eda_fe.ipynb.

Duplicates

Controlling duplicates was an easy task since every book has a unique identifier: its isbn (10 or 13). We found no duplicates.

Authors

We noticed that the authors column regularly contained several names. We found out that only the first person is the real author of the work, while the following people are illustrators, voice actors, translators...

We believed that this caused the model to lose its ability to generalize its learning from an author, so we have retained only the first name.

Publishers

We observed that many publisher names referred to the same entity but with slight variations. We believed that this inconsistency hindered the model's ability to learn effectively.

To address this issue, we employed an algorithm based on the Levenshtein distance to merge names representing the same publisher. After conducting several experiments, we discovered that a 90% similarity score caused the model to lose valuable information. The model would merge distinct "specialized brands" from the same publishing house, such as Gallimard and Gallimard Jeunesse, into a single entity. Additionally, some publishers have separate "brands" for audio content, like HarperAudio.

To preserve this crucial information while still allowing the model to consolidate publishers with minor spelling differences, we opted for a 93% similarity score.