Useful Command Lines - AtlasOfLivingAustralia/documentation GitHub Wiki

Useful Command Lines

Introduction

(...)

pipes

Something that makes terminal shell commands powerful is concatenation of command with the pipe |. Lets see it in cation:

Let's see the content of a meta.xml file. Let's move to some dr location:

cd /data/biocache-load/dr603

and we can show the contents with cat:

cat meta.xml 

but if we want to see the contents of a occurrence.txt file, if this file is too long, maybe it's more useful to just count the lines of that file:

cat occurrence.txt | wc
3540297 134473483 1548283140

the output shows the number of lines (3M), words, and bytes in that file. Here we concatenate the output of cat to wc (word count command) with the pipe |.

We can do many concatenation of commands until we get the thing we need. For instance this command get only the 10th column (the ids), sort all the ids and redirect the output to a file in /tmp directory.

cat occurrence.txt | awk -F $'\t' '{print $10}' | sort > /tmp/dr-603-ids-load.txt

But let's explain this step by step.

Useful shell commands

head and tail

Instead of use cat many times is useful to use only a part of a file. In our previous example, with a file of 3M of lines, this is quite useful. So if we do:

head -50 occurrence.txt 

we'll see the first 50 lines of that file. This is interesting to see the header of a file.

The same with tail:

tail -50 occurrence.txt 

we'll show you the end of that file.

This is also useful to test commands with a portion of a big file. In the previous long cat command we can test our command with head instead of cat

head -5 occurrence.txt | awk -F $'\t' '{print $10}'

will print 5 lines of the 10th column of occurrence.txt with columns separated with TABs (\t). (explanation in detail). Something like:

occurrenceID
3084007342
1090938898
1090938908
3015196328

this is useful to see if we are selecting the correct column we are interested in, without having to process all the 3M lines.

cat, sort, uniq

When we are sure that his is what we want we can continue concatenating the output with the pipe | with other commands, and substitute the head with cat for process all the file:

cat occurrence.txt | awk -F $'\t' '{print $10}' | sort -n | uniq > /tmp/dr603-sorted-ids.txt

In this case we have all the occurrenceID of the 3M records, sorted, removed duplicates and the output in the file /tmp/dr603-sorted-ids.txt. More details of that command.

wget, curl

Let's compare this with a download for biocache. We can download some file using wget https://Some-URL or curl https://Some-URL:

curl -o /tmp/records-2021-06-30.zip https://registros-ws.gbif.es/biocache-download/0cb552f1-6421-3df8-a8bc-7573e6a584f9/1625070232900/records-2021-06-30.zip

With -o we indicate where to save the download. Now we move (cd) to the /tmp directory and we can unzip the output:

cd /tmp/
unzip /tmp/records-2021-06-30.zip

Lets get the occurrenceID of that download also. As that CSV is separated but commas, we can do like this:

cat records-2021-06-30.csv | awk -F '","' '{print $17}' | sort -n | uniq > /tmp/dr-603-ids-reg-sorted.txt 

we use '","' as separator, as all fields are double quoted in the CSV.

comparing two files of ids

Let's compare the two files or ids we have generated previously:

comm -23 /tmp/dr-603-ids-reg-sorted.txt /tmp/dr-603-ids-load-sorted.txt > /tmp/ids-only-in-reg.txt

This compares the two files and removes ids that are in both files, and only in the second file. More details.

grep

We want to remove old ids that are not present in our loaded dr from our IPT. But we want the LA uuuid instead of the OccurrenceID.

First we'll add the double quotes again in the ids:

cat /tmp/ids-only-in-reg.txt | sed 's/^/"/g' | sed 's/$/"/g' > /tmp/ids-only-in-reg-quoted.txt 

more explained here, and later we can use that ids quoted to search again in our records CSV. For this we use grep:

grep -Ff /tmp/ids-only-in-reg-quoted.txt /tmp/records-2021-06-30.csv > /tmp/to-delete.csv

more explained here.

Now we can get the ids (field 8th):

cat /tmp/to-delete.csv |awk -F '","' '{print $8}' > /tmp/dr603-ids-to-delete.txt
head /tmp/dr603-ids-to-delete.txt

We obtain:

8a863029-f435-446a-821e-275f4f641165
264e6a66-9c9e-4115-9aec-29d694c68097
8a863029-f435-446a-821e-275f4f641165
8a863029-f435-446a-821e-275f4f641165
8a863029-f435-446a-821e-275f4f641165
8a863029-f435-446a-821e-275f4f641165

And now we can delete that ids with biocache from solr and cassandra:

biocache delete-records -f /tmp/dr603-ids-to-delete.txt

ls

TODO

More

Useful command line Cheat-Sheet from Git Tower