Big Data Manipulation for Fun and Profit Part 3 - JohnHau/mis GitHub Wiki
There have been two previous article in this series, which you may want to read first if you have not read them yet; Big Data Manipulation for Fun and Profit Part 1 and Big Data Manipulation for Fun and Profit Part 2.
In this series, we discuss various ideas and practical approaches for handling big data, or more specifically handling, transforming, mangling, munging, parsing, wrangling, transforming and manipulating the data at the Linux command line.
This third article in the series will continue explore Bash tools which can help us when processing and manipulating text-based (or in some cases binary) big data. As mentioned in the previous articles, data transformation in general is an semi-endless topic as there are hundreds of tools for each particular text format. Remember that at times using Bash tools may not be the best solution, as an off-the-shelf tool may do a better job. That said, this series is specifically for all those (many) other times when no tool is available to get your data in the format of your choice.
Finally, if you want to learn more about why big data manipulation can be both fun and profitable… please read Part 1 first.
In this tutorial you will learn:
Additional big data wrangling / parsing / handling / manipulation / transformation techniques What Bash tools are available to assist you, specifically for text based applications Various examples, showing different methods and approaches
Example 1: wc, head and vi – exploring data For this example, we will work with a JSON status file, created by Wikipedia as part of their Data Dumps (ref any folder in https://dumps.wikimedia.org/enwiki/)
wget https://dumps.wikimedia.org/enwiki/20201020/dumpstatus.json $ head -c100 dumpstatus.json {"version": "0.8", "jobs": {"pagerestrictionstable": {"status": "done", "files": {"enwiki-20201020-p $ wc -l dumpstatus.json 1
The wget command retrieves the file for us (this command is also handy if you have to download a large set of data files and want to automate it on your command line), and the head -c100 shows the first 100 characters of the file. This is a great way to quickly check the top head of the file.
If the file was somehow binary data, using the head -c100 command will not make too much of a mess in your terminal, and if the lines are very long (as is the case for this file), this command ensures we are not going to see many pages of scrolling text passing by.
The wc -l command shows us the number of lines.
Before starting to work with any big data, it is always a good idea to check out the contents of the file you are working with. I personally use and prefer vi, but you can any text editor which feels comfortable for you. One of the benefits of vi is that it excellent at opening and editing very large files. Open the file, and have a look around: how long are lines, what sort of data is this etc.?
It is interesting to note here that vi, even though it has a large learning curve, is also very powerful when it comes to bulk operations. For example, it can be quicker to generate a one million line file by simply executing a few vi commands inside vi then to write a little script to do the same. One great aspect about the learning curve of vi is that it tends grow with you, as and when you require additional methods or procedures.
Also, using just two commands (head -c100 and wc -l), noting the filename, and checking quickly with vi we have already learned a myriad of things:
This is a JSON file (.json extension) This file has very long line(s) (vi, press end key and note counter on bottom right, present on many vi installations): 110365 characters This file has a single line (wc -l) The file is highly structured (head -c100) While this is an simple example, the idea is to highlight that if we spent a little researching our source data, we can more easily work with it and understand how to transform or manipulate it better into the format we would like it to be in. This approach or methodology should become second nature for the data engineer.
The next important part of the big data manipulation process is to discern which tool will help most with the task at hand. If we were making generic extractions from or manipulations to this data, we would likely want to first search for a JSON compatible tool, or even a tool specifically made for JSON. There are many such tools, including many free and open source ones.
Two good starting places are the search on github.com (for example type ‘JSON edit’ to see what generic tools are out there, or something more specific like ‘JSON tree’ to find a tool specific to JSON tree revision), and any major search engine. There are more then 100 million repositories on GitHub and you will almost always find at least one or two things which directly relate to, and potentially help with, your task or project at hand.
For GitHub specifically, you will want to keep the keywords short and generic to have the maximum number of relevant matches. Remember that while GitHub has indeed more then 100 million repositories, it is very small when compared with major search engines and thus too specific a search (more then 2-3 words, or detailed words to any extent) will often result in poor or no results.
‘JSON’ (for a generic impression of the free ‘marketplace’), ‘JSON edit’ and ‘JSON tree’ are all good examples. ‘JSON tree builder’ and ‘JSON tree edit’ are borderline, and more specific then this may return no helpful results.
For this project, we will pretend to have analyzed all available JSON tools and found none to be suitable for what we wanted to do: we want to change all { to _ and " to =, and remove all spaces. We will then feed this data to our fictive AI robot who is programmed to fix mistakes in JSON. We want to have broken JSON to see if the robot performs well.
Let’s next transform some of this data and modify the JSON syntax using sed.
Example 2: sed The Stream Editor (sed) is a powerful utility which can be used for a wide variety of big data manipulation tasks, especially by using Regular Expressions (RegEx). I propose to start by reading our article Advanced Bash RegEx With Examples, or Bash RegExps for Beginners With Examples if you are just starting out with sed and regular expressions. To learn a bit more about regular expressions in general, you may also find Python Regular Expressions with Examples to be of interest.
As per our plan of approach, we will change all { to _ and " to =, and remove all spaces. This is easy to do with sed. To start off, we will take small sample from the larger data file to test our solution. This is a common practice when handling large amounts of data, as one would want to 1) make sure the solution accurately works, 2) before changing the file at hand. Let’s test:
$ echo ' {"status": "done' | sed 's|{|_|g;s|"|=|g' _=status=: =done Great, it looks like our solution partially works. We have changed { to _ and " to =, but have not as yet removed the spaces. Let’s look at the sed instruction first. The s command in the overall sed command (encapsulated by single quotes) substitutes one bit of text with another, and it is regular expression aware. We thus changed the two characters we wanted to change in a from-to based approach. We also made the change across the whole input using the g (global) option to sed.
In other words one could write this sed instruction as: substitute|from|to|global, or s|f|t|g (in which case f would be replaced by t). Let’s next test the removal of spaces:
$ echo ' {"status": "done' | sed 's|{|_|g;s|"|=|g;s| *||g' _=status=:=done
Our final substitute command (s| ||g) includes a regular expression which will take any number () of spaces and replace it to ‘nothing’ (corresponding to the empty ‘to’ field).
We now know our solution works correctly, and we can use this on the full file. Let’s go ahead and do so:
$ sed -i 's|{|_|g;s|"|=|g' dumpstatus.json Here we use the -i option to sed, and passed the file (dumpstatus.json) as an option at the end of the line. This will do an inline (-i) sed command execution directly on the file. No temporary or in-between files are required. We can then use vi again to verify that our solution worked correctly. Our data is now ready for our fictive AI robot to show it’s JSON mending skills!
It is also often a good idea to quickly grab a copy of the file before you start working on it, or to work with a temporary file if necessary, though in that case you may prefer a sed 's|...|...|' infile > outfile based command.
Learning how to use sed and regular expressions well has many benefits, and one of the main benefits is that you will be able to more easily handle big textual data by using sed to transform / manipulate it.
Conclusion If you haven’t read our previous two articles in this series, and find the topic interesting, I strongly encourage you to do so. The links for these are in the introduction above. One reason for this is the warning highlighted in the first two articles to manage your time and engagement with technology when it comes to handling big data, and/or other complex IT topics in general, like complex AI system. Straining the mind on a continual basis tends to yield poor long-term results, and (overly) complex projects tend towards this. Reviewing these articles, you can also learn about other tools to use for big data manipulation.
For this article, we explained how data engineers should be seeking to understand the data they are working on well, so that transformation and mangling is easier and more straightforward. We also looked at various tools which may help us learn more about the data as well as transform it.
Have you found interesting large data sets or developed great big data handling strategies (technical and/or lifestyle/approach)? If so, leave us a comment!