Big Data Manipulation for Fun and Profit Part 2 - JohnHau/mis GitHub Wiki

In the first part of this big data manipulation series – which you may want to read first if you haven’t read it yet; Big Data Manipulation for Fun and Profit Part 1 – we discussed at some length the various terminologies and some of the ideas surrounding big data, or more specifically as it relates to handling, transforming, mangling, munging, parsing, wrangling, transforming and manipulating the data. Often these terms are use interchangeably and often their use overlaps. We also looked at the first set of Bash tools which may help us with work related to these terms.

This article will explore a further set of Bash tools which can help us when processing and manipulating text-based (or in some cases binary) big data. As mentioned in the previous article, data transformation in general is an semi-endless topic as there are hundreds of tools for each particular text format. Remember that at times using Bash tools may not be the best solution, as an off-the-shelf tool may do a better job. That said, this series is specifically for all those (many) other times when no tool is available to get your data in the format of your choice.

And, if you want to learn why big data manipulation can be both profitable and fun… please read Part 1 first.

In this tutorial you will learn:

More big data wrangling / parsing / handling / manipulation / transformation techniques What Bash tools are available to help you, specifically for text based applications Examples showing different methods and approaches

et’s change the separator by using the -F option to awk: $ grep '31197816' enwiki-latest-pages-articles-multistream-index19.txt-p30121851p31308442 | awk -F':' '{print $2}' 31197816

Exactly what we need. -F is described in the awk manual as the input field separator. You can see how using awk to print various columns perceived in the data (you can simply swap the $2 to $3 to print the third column, etc.), so that we can process it further into the format we like. Let’s, to round up, change the order of the fields and drop one field we don’t think we need: $ grep '31197816' enwiki-latest-pages-articles-multistream-index19.txt-p30121851p31308442 | awk -F':' '{print $3"\t"$2}' > out $ cat out Linux Is My Friend 31197816 Great! We changed the order of columns 2 and 3, and sent the output to a new file, and changed the separator to a tab (thanks to the "\t" insert in the print statement). If we now simply process the whole file: $ awk -F':' '{print $3"\t"$2}' enwiki-latest-pages-articles-multistream-index19.txt-p30121851p31308442 > out $

The whole input data is structurally changed to the new format! Welcome to the fun world of big data manipulation. You can see how with a few simple Bash commands, we are able to substantially restructure/change the file as we deem fit. I have always found Bash to come the closest to the ideal toolset for big data manipulation, combined with some off-the-shelf tools and perhaps Python coding. One of the main reasons for this is the multitude of tools available in Bash which make big data manipulation easier.

Let’s next verify our work

wc -l enwiki-latest-pages-articles-multistream-index19.txt-p30121851p31308442 329956 enwiki-latest-pages-articles-multistream-index19.txt-p30121851p31308442 $ wc -l out 329956 out $ grep '31197816' out Linux Is My Friend 31197816 Great – the same number of lines are there in the original and the modified file. And the specific example we used previously is still there. All good. If you like, you can dig a little further with commands like head and tail against both files to verify the lines look correctly changed across the board.

You could even try and open the file in your favorite text editor, but I would personally recommend vi as the number of lines may be large, and not all text editors deal well with this. vi takes a while to learn, but it’s a journey well worth taking. Once you get good with vi, you’ll never look back – it grows on you so to speak.

Example 2: tr We can use the tr utility to translate or delete some characters:

$ grep '31197816' enwiki-latest-pages-articles-multistream-index19.txt-p30121851p31308442 | tr ':' '\t' 269019710 31197816 Linux Is My Friend

Here we change our field separator colon (:) to tab (\t). Easy and straightforward, and the syntax speaks for itself.

You can also use tr to delete any character:

$ grep '31197816' enwiki-latest-pages-articles-multistream-index19.txt-p30121851p31308442 | tr -d ':' | tr -d '[0-9]' Linux Is My Friend

You can see how we first removed : from the output by using the delete (-d) option to tr, and next we removed – using a regular expression – any number in the 0-9 range ([0-9]).

Note how changing the :: to \t still does not enable us to use awk without changing the field separator, as there are now both tabs (\t) and spaces in the output, and both are seen by default (in awk) as field separators. So printing $3 with awk leads to just the first word (before a space is seen):

$ grep '31197816' enwiki-latest-pages-articles-multistream-index19.txt-p30121851p31308442 | tr ':' '\t' | awk '{print $3}' Linux

This also highlights why it is always very important to test, retest and test again all your regular expressions and data transforming/manipulating command statements.

Conclusion The multitude of tools in Bash make big data manipulation fun and in some cases very easy. In this second article in the series, we continued to explore Bash tools which may help us with big data manipulation.

Enjoy the journey, but remember the warning given at the end of the first article… Big data can seem to have a mind of it’s own, and there are inherent dangers in working with a lot of data (or with input overload, as in daily life), and these are (mainly) perception overload, perfection overreach, time lost and prefrontal cortex (and other brain areas) overuse. The more complex the project, source data or target format, the larger the risk. Speaking from plenty of experience here.

A good way to counteract these dangers is to set strict time limits to working with complex and large data sets. For example, 2 hours (at max) per day. You’ll be surprised what you can achieve if you set your mind to a dedicated two hours, and don’t go over it, consistently. Don’t say I didn’t warn you 🙂

Let us know your thoughts below – interesting large data sets, strategies (both technical and lifestyle/approach), and other ideas are welcome!