Spark hadoop commands for pipelines - AtlasOfLivingAustralia/documentation GitHub Wiki
Intro
We try to describe here some hdfs and spark commands useful for pipelines.
Make a directory
sudo -u spark /data/hadoop/bin/hdfs dfs -mkdir -p /dwca-imports
Copy to local some file or directory
sudo -u spark /data/hadoop/bin/hdfs dfs -copyToLocal -f /pipelines-data/dr251/ /tmp
Delete some dr
sudo -u spark /data/hadoop/bin/hdfs dfs -rm -r /pipelines-data/dr*
or bigger clean
sudo -u spark /data/hadoop/bin/hdfs dfs -rm -r /pipelines-data/dr* /pipelines-all-datasets/* pipelines-clustering/* /pipelines-species/* /dwca-exports/* /pipelines-jackknife/*
If you are trying to remove everything, perhaps:
- shutdown the hadoop cluster
- use:
hdfsadmin -dfs format
(this was suggested by Dave in Slack).
copy all duplicateKeys.csv files into /tmp
If some dr has duplicate keys it cannot be indexed and you can see a log like:
The dataset can not be indexed. See logs for more details: HAS_DUPLICATES
In this case a duplicateKeys.csv file is generated with details of the duplicate records. You can copy these files into local filesystem with:
for i in `sudo -u spark /data/hadoop/bin/hdfs dfs -ls -S /pipelines-data/dr*/1/validation/duplicateKeys.csv | grep -v " 0 " | cut -d "/" -f 3` ; do sudo -u spark /data/hadoop/bin/hdfs dfs -copyToLocal -f /pipelines-data/$i/1/validation/duplicateKeys.csv /tmp/duplicateKeys-$i.csv; done
Remove some orphans occurrences from biocache-store
Durante the migration of uuids you can find occurrences of drs that not longer exist in your collectory. In this case you will have some indexing error for that missing drs with the message NOT_AVAILABLE
And in hdfs
you only have that uuids in identifiers
:
-
So we'll delete from biocache-store
.
You have to install yq
and avro-tools and follow these steps:
- Create a file with all this drs, let's call it
/tmp/missing
- Copy the avro files of that drs:
for i in `cat /tmp/missing` ; do mkdir /tmp/missing-uuids/$i/; sudo -u spark /data/hadoop/bin/hdfs dfs -copyToLocal -f /pipelines-data/$i/1/identifiers/ala_uuid/* /tmp/missing-uuids/$i/; done
- join all uuids to delete in some file:
for i in `ls /tmp/missing-uuids/dr*/*avro`; do avrocat $i | jq .uuid.string | sed 's/"//g' >> /tmp/del_uuids; done
scp
that/tmp/del_uuids
file to yourbiocache-store
.- Delete in biocache store with
biocache-store delete-records -f /tmp/del_uuids
.
Restart spark & hadoop
sudo -u spark /data/spark/sbin/stop-slaves.sh
sudo -u spark /data/spark/sbin/stop-master.sh
sudo -u spark rm -Rf /data/spark-tmp/*
sudo -u hdfs /data/hadoop/sbin/stop-dfs.sh
sudo -u hdfs /data/hadoop/sbin/start-dfs.sh
sudo -u spark /data/spark/sbin/start-master.sh
sudo -u spark /data/spark/sbin/start-slaves.sh
Remove old backup of identifiers
⚠️ Double check this before executing it to be sure (for instance adding a echo to the rm command to verify what will do)
#!/bin/bash
# Find and delete all 'ala_uuid_backup' in any sub-directory of '/pipelines-data/*/1/identifiers/'
sudo -u spark /data/hadoop/bin/hdfs dfs -ls /pipelines-data/*/1/identifiers/ | grep 'ala_uuid_backup' | awk '{print $8}' | while read -r file
do
if [ -n "$file" ](/AtlasOfLivingAustralia/documentation/wiki/--n-"$file"-); then
sudo -u spark /data/hadoop/bin/hdfs dfs -rm -r "$file"
fi
done
Remove all except identifiers/uuids
If you want to delete all except the identifiers info to preserver the same uuids.
⚠️ Double check this before executing it to be sure (for instance adding a echo to the rm command to verify what will do)
#!/bin/bash
# List all the base directories in /pipelines-data
sudo -u spark /data/hadoop/bin/hdfs dfs -ls /pipelines-data | grep '^d' | awk '{print $8}' | while read -r base_dir
do
# Ensure the base_dir variable is not empty
if [ -z "$base_dir" ](/AtlasOfLivingAustralia/documentation/wiki/--z-"$base_dir"-); then
#echo "The base_dir variable is empty, skipping deletion."
continue
fi
# List all subdirectories except the '1' directory which should contain the 'identifiers'
sudo -u spark /data/hadoop/bin/hdfs dfs -ls "$base_dir" | grep -v '^d.*\/1$' | awk '{print $8}' | while read -r sub_dir_to_delete
do
# Ensure the sub_dir_to_delete variable is not empty
if [ -z "$sub_dir_to_delete" ](/AtlasOfLivingAustralia/documentation/wiki/--z-"$sub_dir_to_delete"-); then
continue
fi
# Delete the subdirectory
sudo -u spark /data/hadoop/bin/hdfs dfs -rm -r "$sub_dir_to_delete"
done
# If the '1' directory exists, list and delete all contents except the 'identifiers' subdirectory
if sudo -u spark /data/hadoop/bin/hdfs dfs -test -e "$base_dir/1"; then
sudo -u spark /data/hadoop/bin/hdfs dfs -ls "$base_dir/1" | grep -v 'identifiers' | awk '{print $8}' | while read -r content_to_delete
do
# Ensure the content_to_delete variable is not empty
if [ -z "$content_to_delete" ](/AtlasOfLivingAustralia/documentation/wiki/--z-"$content_to_delete"-); then
continue
fi
# Delete the content
sudo -u spark /data/hadoop/bin/hdfs dfs -rm -r "$content_to_delete"
done
fi
done