Spark hadoop commands for pipelines - AtlasOfLivingAustralia/documentation GitHub Wiki

Intro

We try to describe here some hdfs and spark commands useful for pipelines.

Make a directory

sudo -u spark /data/hadoop/bin/hdfs dfs -mkdir -p /dwca-imports

Copy to local some file or directory

sudo -u spark /data/hadoop/bin/hdfs dfs -copyToLocal -f /pipelines-data/dr251/ /tmp

Delete some dr

sudo -u spark /data/hadoop/bin/hdfs dfs -rm -r /pipelines-data/dr*

or bigger clean

sudo -u spark /data/hadoop/bin/hdfs dfs -rm -r /pipelines-data/dr* /pipelines-all-datasets/* pipelines-clustering/* /pipelines-species/* /dwca-exports/* /pipelines-jackknife/*

If you are trying to remove everything, perhaps:

  1. shutdown the hadoop cluster
  2. use:
hdfsadmin -dfs format

(this was suggested by Dave in Slack).

copy all duplicateKeys.csv files into /tmp

If some dr has duplicate keys it cannot be indexed and you can see a log like:

The dataset can not be indexed. See logs for more details: HAS_DUPLICATES

In this case a duplicateKeys.csv file is generated with details of the duplicate records. You can copy these files into local filesystem with:

for i in `sudo -u spark /data/hadoop/bin/hdfs dfs -ls -S /pipelines-data/dr*/1/validation/duplicateKeys.csv | grep -v "     0 " | cut -d "/" -f 3` ; do sudo -u spark /data/hadoop/bin/hdfs dfs -copyToLocal -f /pipelines-data/$i/1/validation/duplicateKeys.csv /tmp/duplicateKeys-$i.csv; done

Remove some orphans occurrences from biocache-store

Durante the migration of uuids you can find occurrences of drs that not longer exist in your collectory. In this case you will have some indexing error for that missing drs with the message NOT_AVAILABLE

image

And in hdfs you only have that uuids in identifiers:

-image

So we'll delete from biocache-store.

You have to install yq and avro-tools and follow these steps:

  • Create a file with all this drs, let's call it /tmp/missing
  • Copy the avro files of that drs:
for i in `cat /tmp/missing` ; do mkdir /tmp/missing-uuids/$i/; sudo -u spark /data/hadoop/bin/hdfs dfs -copyToLocal -f /pipelines-data/$i/1/identifiers/ala_uuid/* /tmp/missing-uuids/$i/; done
  • join all uuids to delete in some file:
for i in `ls /tmp/missing-uuids/dr*/*avro`; do avrocat $i | jq .uuid.string | sed 's/"//g' >> /tmp/del_uuids; done
  • scp that /tmp/del_uuids file to your biocache-store.
  • Delete in biocache store with biocache-store delete-records -f /tmp/del_uuids.

image

Restart spark & hadoop

sudo -u spark /data/spark/sbin/stop-slaves.sh
sudo -u spark /data/spark/sbin/stop-master.sh
sudo -u spark rm -Rf /data/spark-tmp/*
sudo -u hdfs /data/hadoop/sbin/stop-dfs.sh
sudo -u hdfs /data/hadoop/sbin/start-dfs.sh
sudo -u spark /data/spark/sbin/start-master.sh
sudo -u spark /data/spark/sbin/start-slaves.sh 

Remove old backup of identifiers

⚠️ Double check this before executing it to be sure (for instance adding a echo to the rm command to verify what will do)

#!/bin/bash

# Find and delete all 'ala_uuid_backup' in any sub-directory of '/pipelines-data/*/1/identifiers/'
sudo -u spark /data/hadoop/bin/hdfs dfs -ls /pipelines-data/*/1/identifiers/ | grep 'ala_uuid_backup' | awk '{print $8}' | while read -r file
do
    if [ -n "$file" ](/AtlasOfLivingAustralia/documentation/wiki/--n-"$file"-); then
        sudo -u spark /data/hadoop/bin/hdfs dfs -rm -r "$file"
    fi
done

Remove all except identifiers/uuids

If you want to delete all except the identifiers info to preserver the same uuids.

⚠️ Double check this before executing it to be sure (for instance adding a echo to the rm command to verify what will do)

#!/bin/bash

# List all the base directories in /pipelines-data
sudo -u spark /data/hadoop/bin/hdfs dfs -ls /pipelines-data | grep '^d' | awk '{print $8}' | while read -r base_dir
do
    # Ensure the base_dir variable is not empty
    if [ -z "$base_dir" ](/AtlasOfLivingAustralia/documentation/wiki/--z-"$base_dir"-); then
        #echo "The base_dir variable is empty, skipping deletion."
        continue
    fi

    # List all subdirectories except the '1' directory which should contain the 'identifiers'
    sudo -u spark /data/hadoop/bin/hdfs dfs -ls "$base_dir" | grep -v '^d.*\/1$' | awk '{print $8}' | while read -r sub_dir_to_delete
    do
        # Ensure the sub_dir_to_delete variable is not empty
        if [ -z "$sub_dir_to_delete" ](/AtlasOfLivingAustralia/documentation/wiki/--z-"$sub_dir_to_delete"-); then
            continue
        fi
        # Delete the subdirectory
        sudo -u spark /data/hadoop/bin/hdfs dfs -rm -r "$sub_dir_to_delete"
    done

    # If the '1' directory exists, list and delete all contents except the 'identifiers' subdirectory
    if sudo -u spark /data/hadoop/bin/hdfs dfs -test -e "$base_dir/1"; then
        sudo -u spark /data/hadoop/bin/hdfs dfs -ls "$base_dir/1" | grep -v 'identifiers' | awk '{print $8}' | while read -r content_to_delete
        do
            # Ensure the content_to_delete variable is not empty
            if [ -z "$content_to_delete" ](/AtlasOfLivingAustralia/documentation/wiki/--z-"$content_to_delete"-); then
                continue
            fi
            # Delete the content
            sudo -u spark /data/hadoop/bin/hdfs dfs -rm -r "$content_to_delete"
        done
    fi
done