Big Data Manipulation for Fun and Profit Part 1 - JohnHau/mis GitHub Wiki
These days everyone seems to be speaking about Big Data – but what does it really mean? The term is used quite ambiguously in a variety of situations. For the purposes of this article, and the series, we will refer to big data whenever we mean ‘a large amount of textual data, in any format (for example plain ASCII text, XML, HTML, or any other human-readable or semi-human-readable format). Some techniques shown may work well for binary data also, when used with care and knowledge.
So, why fun (ref title)?
Handling gigabytes of raw textual data in a quick and efficient script, or even using a one-liner command (see Linux Complex Bash One Liner Examples to learn more about one-liners in general), can be quite fun, especially when you get things to work well and are able to automate things. We can never learn enough about how to handle big data; the next challenging text parse will always be around the corner.
And, why profit?
Many of the world’s data is stored in large textual flat files. For example, did you know you can download the full Wikipedia database? The problem is that often this data is formatted in some other format like HTML, XML or JSON, or even proprietary data formats! How do you get it from one system to another? Knowing how to parse big data, and parse it well, puts all the power at your fingertips to change data from one format to another. Simple? Often the answer is ‘No’, and thus it helps if you know what you are doing. Straightforward? Idem. Profitable? Regularly, yes, especially if you become good at handling and using big data.
Handling big data is also referred to as ‘data wrangling’. I started working with big data over 17 years ago, so hopefully there is a thing or two you can pickup from this series. In general, data transformation as a topic is semi-endless (hundreds of third-party tools are available for each particular text format), but I will focus on one specific aspect which applies to textual data parsing; using the Bash command line to parse any type of data. At times, this may not be the best solution (i.e. a pre-created tool may do a better job), but this series is specifically for all those (many) other times when no tool is available to get your data ‘just right’.
Let us assume that you have the following ready; – A: Your source data (textual) input file, in any format (JSON, HTML, MD, XML, TEXT, TXT, CSV, or similar) – B: An idea of how the target data should look for your target application or direct use
You have already researched any available tools relevant to the source data format, and have not located any pre-existing tool which may help you to get from A to B.
For many online entrepreneurs, this is the point where often, perhaps regrettably, the adventure ends. For people experienced with big data handling, this is the point where the fun big data manipulation adventure begins :-).
It is important to understand what tool may help you do what, and how you can use each tool for achieving your next step in the data transformation process, so to kick of this series, I will be looking, one-by-one, at many of the tools available in Bash which may help. We’ll do this in the form of examples. We’ll start with easy examples, so if you have some experience already, you may want to skim over these and move ahead to further articles in this series.
Example 1: file, cat, head and tail I did say we would start simple, so let’s get the basics right first. We need to understand how our source data is structured. For this, we use the fools file, cat, head and tail. For this example, I downloaded a random part of the Wikipedia database.
$ ls enwiki-latest-pages-articles-multistream-index19.txt-p30121851p31308442.bz2 $ bzip2 -d enwiki-latest-pages-articles-multistream-index19.txt-p30121851p31308442.bz2 $ ls enwiki-latest-pages-articles-multistream-index19.txt-p30121851p31308442 $ file enwiki-latest-pages-articles-multistream-index19.txt-p30121851p31308442 enwiki-latest-pages-articles-multistream-index19.txt-p30121851p31308442: UTF-8 Unicode text
After unzipping the download bz2 (bzip2) file, we use the file command to analyze the contents of the file. The file is text based, UTF-8 Unicode format, as confirmed by the UTF-8 Unicode text output after the filename. Great, we can work with this; it’s ‘text’ and that all we need to know for the moment. Let’s have a look at the contents using cat, head and tail:
$ cat enwiki-latest-pages-articles-multistream-index19.txt-p30121851p31308442 | head -n296016 | tail -n1 269019710:31197816:Linux Is My Friend
I wanted to exemplify how to use cat, but this command could have also been constructed more simply as:
$ head -n296016 enwiki-latest-pages-articles-multistream-index19.txt-p30121851p31308442 | tail -n1 269019710:31197816:Linux Is My Friend
We sampled a, ehrm, random… (or not so random for those that know me ;)… line from the file to see what sort of text is there. We can see that there seem to be 3 fields, separated by :. The first two look numeric, the third one text based. This is a good moment to raise the point that one has to be careful with these sort of assumptions. Assumption (and/or presumption) is the mother of all error. It often makes sense to take the following steps, especially if you are less familiar with data;
Research the data structure online – is there some official data legend, data structure definition? Research an example online if the source data is available online. As an example, for the above example, one could search Wikipedia for ‘269019710’, ‘31197816’ and ‘Linux Is My Friend’. Are the references to these numbers? Are these numbers used in URL’s and/or article ID’s, or do they refer to something else etc. The reason for these is to basically learn more about the data, and specifically it’s structure. With this example, all looks fairly easy, but if we’re honest with ourselves, we do not know what the first two numbers mean and we do not know if the text ‘Linux Is My Friend’ refers to an article title, DVD title, or book cover etc. You can start to see how big data handling can be adventure, and data structures can and do get a whole lot more complex then this.
Let us say for a moment that we actions items 1 and 2 above and we learned more about the data and it’s structure. We learned (fictively) that the first number is a classification group for all literary works, and the second is a specific and unique article ID. We also learned from our research that : is indeed a clear and established field separator which cannot be used except for field separation. Finally, the text in the third field lists the actual title of the literary work. Again, these are made-up definitions, which will help us to continue exploring tools we can use for big data handling.
If no data is available on the data or it’s structure, you can start by making some assumptions about the data (through research), and write them down, then verify the assumptions against all data available to see if the assumptions stand. Regularly, if not often, this is the only way to really start processing big data. At times, a combination of both is available; some lightweight syntax description combined with research and lightweight assumptions about the data, for example field separators, termination strings (often \n, \r, \r\n, \0) etc. The more right you get it, the easier and more accurate your data wrangling work will be!
Next, we will be verifying how accurate our discovered rules are. Always verify your work with the actual data!
Example 2: grep and wc In example 1, we concluded that the first field was the classification group for all literary works. Let’s logically try to check this…
$ grep '269019710' enwiki-latest-pages-articles-multistream-index19.txt-p30121851p31308442 | wc -l 100 $ wc -l enwiki-latest-pages-articles-multistream-index19.txt-p30121851p31308442 329956 enwiki-latest-pages-articles-multistream-index19.txt-p30121851p31308442
Hmmm. We have 100 literary works in total in a file with about 330k lines. That doesn’t seem very right. Still, as we downloaded only a small portion of the Wikipedia database, it’s still possible… Let’s check the next item; a unique ID second field.
$ grep '31197816' enwiki-latest-pages-articles-multistream-index19.txt-p30121851p31308442 269019710:31197816:Linux Is My Friend Very cool. At first glance, that would seem to be accurate as there is only a single line which matches.
The third field would not be so easy to verify, though we could check if the text is unique at least:
$ grep --binary-files=text 'Linux Is My Friend' enwiki-latest-pages-articles-multistream-index19.txt-p30121851p31308442 269019710:31197816:Linux Is My Friend OK, so the title seems unique.
Note also that a new option was added to the grep namely --binary-files=text, which is a very important option to use on all grep commands, starting today, for every grep command you write hereafter, in all your data mangling (yet another applicable term) works. I did not use it in the previous grep commands to save on complexity. So why is it so important you may ask? The reason is that often, when textual files contain special characters, especially tools like grep may see the data as binary whereas it is actually text.
At times, this leads to grep not working correctly, and the results become undefined. Whenever I write a grep, almost always (unless I am quite confident the data is not binary) --binary-files=text will be included. It simply ensures that if the data looks binary, or even at times is binary, the grep will still work correctly. Note that this is less of a concern for some other tools like sed which seem to be more aware/capable by default. Summary; always use --binary-files=text for your grep commands.
In summary, we have found a concern with our research; the number in the first field does not by any means seem to be all literary works listed on Wikipedia, even if this is a subset of the total data, though it is possible.
This then highlights the need for a back-and-forth process which is often part of big data munging (yes… another term!). We could refer to this as ‘big data mapping’ and introduce yet another term for more or less the same overall process; manipulating big data. In summary, the process of going back and forth between the actual data, the tools you’re working with, and the data definition, legend or syntax is an integral part of the data manipulation process.
The better we understand our data, the better we can handle it. At some point, the learning curve towards new tools gradually declines, and the learning curve towards better understanding each new data set being handled increases. This the point at which you know you’re a big data transformation expert, as your focus is no longer on the tools – which you know by now – but on the data itself, leading to faster and better end results overall!
In the next part of the series (of which this is the first article), we will look at more tools which you can use for big data manipulation.
You may also be interested in reading our short semi-related Retrieving Webpages Using Wget Curl and Lynx article, which shows how to retrieve webpages in both HTML and TEXT/TXT based format. Always use this knowledge responsibly (i.e. don’t overload servers and only retrieve public domain, no copyright, or CC-0 etc. data/pages), and always check if there is a downloadable database/dataset of the data you are interested in, which is much preferred to individually retrieving webpages.
Conclusion In this first article in the series, we defined big data manipulation in as far as it relates to our article series and discovered why big data manipulation can be both fun and rewarding. One could, for example, take – within applicable legal boundaries! – a large public domain textual dataset, and use Bash utilities to transform it into the desired format and publish the same online. We started looking at various Bash tools which may be used for big data manipulation and explored examples based on the publicly available Wikipedia database.
Enjoy the journey, but always remember that big data has two sides to it; a side where you’re in control, and… well… a side where the data is in control. Keep some valuable time available for family, friends and more (31197816!), before getting lost parsing the myriads of big data out there!