Final report - lenkaB/objects_to_data GitHub Wiki
When we first started doing this project, I knew almost nothing about readability. I had the idea that it could be relevant, though, the simple count of characters per word, or word per sentence. These numbers say nothing about the content of the text, but have one advantage as a grading tool for literature - they are objective. They are values that we can calculate for any book, article or piece of text, and use it to compare these texts justly. So, at first, I was in awe by how much is actually been done in the field. I saw that there are many more or less complex formulas, some very old and some improved, but they all take into account 2 or maybe 3 parameters which are actually the ones I naively thought of in the beginning - just multiplied with coefficients which were, I suppose, calculated in a very complicated way. However, since these formulas are man-made, now I see how much can further be done in this field using linear regression and machine learning algorithms. With enough data a machine learning algorithm could, I think, produce a far better and accurate formula then the ones we use now. This actually makes me quite excited to see what could be done next. During this research, I came across a few articles regarding scientific attempts to compare and classify literature - based on genre, main topics, or pace of the plot. So far, these researches are inconclusive because there is a boundary of what is the computer capable of doing. Simply put, some things only a human can do. But slowly, these boundaries are expanding, with artificial intelligence and machine learning we can make algorithms that could, in the end, actually do more than any human could and see regularities in literature that we cannot perceive. At the moment, there are some commercial tools that calculate a grade of text complexity for you, and I suppose they use more advanced algorithms and formulas, but unfortunately they are a 'black box' so we don't know how they produce the number you get. (for example https://lexile.com/ )
In this research, my part was mostly 'tech support', since I am the only one with IT background. Firstly, with installing all the required components (most issues were with installing Treetagger) and then later with parsing and analyzing data. I had a lot of fun, actually, while doing this research. All the programming we did is actually a few simple lines of code, but it needed time to get to them. Using koRpus package was quite simple with the right manual pages, of course. It was a bit slow, analysing one book for one formula took about 15 minutes. In the end, the grades we've got for books were checked for correlation using cor function from RStudio, and those results were tabulated also. I was often pleasantly surprised by how simple things can be in RStudio. If there is no function to do something, you simply need to install the package and there you have it. It worked more flawlessly when it came to adding new packages and functionalities then previous programming languages I've used. So it was easy to install a package that converts .cvs into RStudio dataframe, then export is a nicely formatted table in a .pdf in a few lines of code. Also these packages easily overcame the differences we've had since I am working on Linux environment and Nicole and Linda have Windows - tables in Linux are stored not as a.cvs but as .ods, and it was no problem with the right package for RStudio. In the end, when we've seen which of the formulas gave more correlation to success parameters, we've decided to demonstrate some of the stronger correlations, again, plotting graphs in RStudio. The graph and the correlation table show that there is a relatively high connection between number of editions and Dale-Chall value of a book. In my opinion, that doesn't have to be coincidence - the "simpler" a book is, with less unfamiliar words, higher are chances for a person to recommend it to someone with lower reading skills, a child or someone who is learning English as a foreign language, so more copies are sold and more editions are printed. This however, has no connection with the book's ratings- once people start reading the book they care more about style and content then the mere percentage of difficult words. On the other hand, Flesch-Kincaid formula for example, which doesn't use a word list showed no correlation with Number of reviews, ratings or editions, and slight correlation with the rating of the book - which could mean that people do take into account length of words and sentences when they judge overall style of the book.
I have added excerpts of code as .txt files in our group project that show the technical side of what we've done (and what I did personally). When it comes to my first idea of a research I decided to check whether or not a particular writer wrote more or less complexly through time. Since Jane Austen wrote only a few novels and died young, I made a few graphs that represents different formulas for readability for Charles Dickens' novels, sorted by their publish year. It is clear, however, that mr. Dickens' style did not change in sense of readability, as he got older.