Data Transfer - shawfdong/hyades GitHub Wiki
I've installed a variety of tools on Hyades, to facilitate large data transfer between Hyades and other Supercomputing centers[1]. Please note that due to limited resources, I've only installed the client parts of those tools on Hyades. For example, I only installed GridFTP client tools (globus-data-management-client), but not the servers. As a result, we can't use Globus Online to move data to/from Hyades. But this setup is more than sufficient for our main goal, which is to move large data between Hyades and other Supercomputing centers, as detailed in this guide.
There are a few Data Transfer Nodes at NERSC, dedicated to performing transfers between NERSC data storage resources and storage resources at other sites. For smaller files, you can use Secure Copy (SCP) or Secure FTP (SFTP) to transfer them to/from NERSC nodes. For larger files, GridFTP and BBCP provide better transfer rates. Here are a few illustrative examples of transferring data between Hyades and NERSC nodes. NOTE my NERSC username is shawdong, but my Hyades username is dong; and those are used in the examples. Please adjust them accordingly.
On Hyades, you must first load the globus module:
$ module load globus
Get a short-lived NERSC certificate[2]
For the very first time you request a NERSC certificate, issue the following command:
$ myproxy-logon -T -l shawdong -s nerscca.nersc.gov
NOTE shawdong is my NERSC username. Change it to yours; but if your usernames are the same on Hyades and NERSC, it is unnecessary to use the -l username option. The -T flag will pick up the necessary trust anchors, to be stored in $HOME/.globus/certificates/, so that the GridFTP client on Hyades can trust NERSC certificates.
When prompted Enter MyProxy pass phrase:, enter your NERSC NIM/LDAP password. A short-lived NERSC certificate, along with its private key, will be generated and save at /tmp/x509up_u${UID}. By default, the certificate is valid for 12 hours. When it expires, you can request a new one with:
$ myproxy-logon -l shawdong -s nerscca.nersc.gov
Now we can use the certificate to transfer files to/from NERSC during its lifetime, without the hassle of password!
List the files in my Global scratch directory at NERSC[3]:
$ globus-url-copy -list \ gsiftp://[email protected]/global/scratch2/sd/shawdong/
Copy a 10GB file from Hyades to dtn01 at NERSC:
$ globus-url-copy -vb -fast -p 8 \ file:/scratch/tmp/10G.dat \ gsiftp://[email protected]/global/scratch2/sd/shawdong/10G.datNOTE here we employ the option -p 8 to use 8 parallel data connections.
Copy a 100GB file from dtn01 at NERSC to Hyades:
$ globus-url-copy -vb -fast -p 8 \ gsiftp://[email protected]/global/scratch2/sd/shawdong/100G.dat \ file:/scratch/tmp/100G.dat
Copy all files in a directory, using the -r flag:
$ globus-url-copy -vb -fast -p 8 -r \ file:/scratch/ \ gsiftp://[email protected]/global/scratch2/sd/shawdong/scratch/
The systems at NERSC on which GridFTP is available are listed at http://www.nersc.gov/users/software/grid/data-transfer/. Among them, a particularly noteworthy one is garchive.nersc.gov, for access to High Performance Storage System (HPSS) archive at NERSC.
List my files on HPSS:
$ globus-url-copy -list \ gsiftp://[email protected]/home/s/shawdong/
Copy a 10GB file from Hyades directly to HPSS:
$ globus-url-copy -vb -fast -p 8 \ 10G.dat \ gsiftp://[email protected]/home/s/shawdong/10G.dat
BBCP is a point-to-point network file copy application written by Andy Hanushevsky at SLAC. It is capable of transferring files at approaching line speeds in the WAN.
On Hyades, send a file to dtn01 at NERSC, using BBCP[4]:
$ bbcp -T "ssh -x -a -oFallBackToRsh=no %I -l %U %H /usr/common/usg/bin/bbcp" \ /scratch/tmp/10G.dat \ [email protected]:/global/scratch2/sd/shawdong/NOTE because bbcp is at a nonstandard location (/usr/common/usg/bin/) on dtn01, the -T flag is required to specify its location on the target of the file transfer.
Get a file from dtn01 at NERSC, using BBCP:
$ bbcp -z -S "ssh -x -a -oFallBackToRsh=no %I -l %U %H /usr/common/usg/bin/bbcp" \ [email protected]:/global/scratch2/sd/shawdong/10G.dat \ /scratch/tmp/NOTE the -S option is required to specify the location of bbcp on the source of the file transfer; the -z option is used to reverse connection initiation, in order to get around firewall on Hyades. We can also add an option -P sec (.e.g., -P 2) to show progress messages every sec seconds[5].
I've installed all the recommended Remote File Transfer tools, for data transfer between Hyades and the NASA Advanced Supercomputing (NAS) systems. However, I no longer have an account at NAS, so I am unable to test the commands. You are encouraged to follow their instructions. For your convenience, links to relevant instructions are listed below. Please do let me know if you run into any problem.
The Secure Unattended Proxy (SUP) allows users to perform remote operations on specific hosts within the NAS enclave (currently the the Pleiades front-ends/bridge nodes and Lou) without the use of SecurID at the time of the operation. SUP is a Perl script. To use it, load the perl module first:
$ module load perl
shift is a framework for Self-Healing Independent File Transfer that provides high performance and resilience for local and remote transfers through a variety of techniques. shift is a Perl script. To use it, the perl module must be loaded first:
$ module load perl
bbFTP is a file transfer software, optimized for large files (larger than 2GB). Only the bbFTP client is installed on Hyades. It is in your PATH, so you can simply invoke it.
bbSCP is a NAS-developed wrapper of bbFTP that provides an SCP-like command line interface. bbSCP only encrypts usernames and passwords, it does not encrypt the data being transferred. bbSCP is a Perl script. To use it, the perl module must be loaded first:
$ module load perl
Ranch (ranch.tacc.utexas.edu), is the long-term mass storage solution at Texas Advanced Computing Center (TACC). TACC supports three transfer mechanisms: scp, rsync and GridFTP (globus-url-copy)[6]. The first 2 options, scp & rsync, are too slow to transfer large amounts of data over long distance, e.g., between Hyades and TACC; thus should be avoided. globus-url-copy is the tool of choice. To use it, first load the globus module on Hyades:
$ module load globus
Unfortunately, I don't have an account at TACC; so the following instructions are based on my educated guess. Please give me your feedback; and let me know what work and what don't.
Ranch is a XSEDE GridFTP Endpoint[7], the URL of which is gsiftp://gridftp.ranch.tacc.utexas.edu:2811/. The MyProxy server for the endpoint is myproxy.teragrid.org, from which you'll get a short-lived proxy certificate.
For the very first time you request a proxy certificate, issue the following command on Hyades:
$ myproxy-logon -T -l TACCusername -s myproxy.teragrid.org
Replace TACCusername with your real TACC/XSEDE username; but if your usernames are the same on Hyades and TACC, it is unnecessary to use the -l username option. The -T flag will pick up the necessary trust anchors, to be stored in $HOME/.globus/certificates/, so that the GridFTP client on Hyades can trust XSEDE certificates.
When prompted Enter MyProxy pass phrase:, enter your TACC/XSEDE password. A short-lived proxy certificate, along with its private key, will be generated and save at /tmp/x509up_u${UID}. By default, the certificate is valid for 12 hours. When it expires, you can request a new one with:
$ myproxy-logon -l TACCusername -s myproxy.teragrid.org
Now we can use the certificate to transfer files to/from Ranch during its lifetime, without the hassle of password!
List the files in a directory:
$ globus-url-copy -list \ gsiftp://[email protected]:2811/YourDestinationDirectory/
Copy a 10GB file from Hyades to Ranch:
$ globus-url-copy -vb -fast -p 8 -stripe -tcp-bs 8M \ file:/scratch/tmp/10G.dat \ gsiftp://[email protected]:2811/YourDestinationDirectory/10G.datNOTE here we employ the option -p 8 to use 8 parallel data connections; -stripe to use multiple service nodes; and -tcp-bs 8M to set ftp data channel buffer size to 8MB.
For more usage examples of globus-url-copy, see the GridFTP subsection above.
OLCF (Oak Ridge Leadership Computing Facility) provides nodes dedicated to data transfer that are available via dtn.ccs.ornl.gov[8]. These nodes have been tuned specifically for wide-area data transfers, and also perform well on local-area transfers. There are currently four interactive nodes named, dtn01 – dtn04 each with the .css.ornl.gov suffix. Only dtn03 and dtn04 are setup to use Office of Science Grid (OSG) Certificate authentication and only dtn01 and dtn02 are set up to use DOE grid certificate authentication.
NOTE only dtn03 and dtn04 appear to be working.
We can use the SSHFTP (GridFTP-over-SSH) protocol to transfer files between Hyades and OLCF. The scheme is sshftp, which leverages SSH to form control channel connections (the data channel is not authenticated)[9].
First, make sure the globus module is loaded on Hyades:
$ module load globus
List the files in my Work directory at OLCF[10]:
$ globus-url-copy -list \ sshftp://dtn.ccs.ornl.gov/lustre/atlas1/ast006/scratch/dong/When prompted Enter PASSCODE:, enter your PIN + 6-digit token code shown on your SecurID key fob.
NOTE if your OLCF username is different from your hyades one, you must specify it, e.g.:
$ globus-url-copy -list \ sshftp://[email protected]/lustre/atlas/scratch/OLCFusername/
On Hyades, transfer a 10GB file from Hyades to dtn at OLCF[11]:
$ globus-url-copy -tcp-bs 12M -bs 12M -p 4 -v -vb \ file:/scratch/tmp/10G.dat \ sshftp://dtn.ccs.ornl.gov/lustre/atlas1/ast006/scratch/dong/10G.datNOTE here we employ the option -p 4 to use 4 parallel data connections; and -tcp-bs 12M to set ftp data channel buffer size to 12MB.
On Hyades, transfer a 100GB file from dtn at OLCF to Hyades:
$ globus-url-copy -tcp-bs 12M -bs 12M -p 4 -v -vb \ sshftp://dtn.ccs.ornl.gov/lustre/atlas1/ast006/scratch/dong/100G.dat \ file:/scratch/tmp/100G.dat
GridFTP-over-SSH works fine, albeit with a little inconvenience that you must enter your OTP (one-time password) every time you run globus-url-copy.
GridFFTP with GSI (Grid Security Infrastructure) allows us to use a proxy certificate for authentication, which can save a lot of keystrokes for our passwords. However, it takes a few convoluted steps to set up GSI at OLCF. Please follow the instructions in OLCF's guide on Obtaining an Open Science Grid Certificate. The whole procedure may take a few days; fortunately we only need to do it once.
OLCF's security is fairly hardened. Even after your certificate has been successfully registered with OLCF, if you simply try to obtain a short-lived proxy certificate on Hyades, without taking the extra step below, you'll get an error like the following:
$ myproxy-logon -s myproxy.ccs.ornl.gov Enter MyProxy pass phrase: Failed to receive credentials. ERROR from myproxy-server: No credentials exist for username "dong".
You must first generate a proxy certificate from your OSG certificate at OLCF. Log on to dtn03 or dtn04 at OLCF, the run:
[dtn03]$ module load globus [dtn03]$ myproxy-init -n Your identity: /DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=People/CN=Shawfeng Dong 2826 Enter GRID pass phrase for this identity: Creating proxy .......................................................................................................................................................... Done Proxy Verify OK Your proxy is valid until: Mon Nov 3 18:33:46 2014 A proxy valid for 168 hours (7.0 days) for user dong now exists on myproxy.ccs.ornl.gov.NOTE here the GRID pass phrase is the pass phrase for your OSG certificate. This step will generate a 7-day proxy certificate from your OSG certificate. The 7-day proxy certificate will then be distributed on the OLCF Myproxy server.
Now you are finally ready to get a proxy certificate on Hyades:
$ myproxy-logon -s myproxy.ccs.ornl.gov Enter MyProxy pass phrase: A credential has been received for user dong in /tmp/x509up_u1000.NOTE here the MyProxy pass phrase is your OTP (PIN + 6-digit token code shown on your SecurID key fob). By default, you'll get a 12-hour proxy certificate, from your 7-day proxy certificate at OLCF. Now we can use the proxy certificate to move data between Hyades and OLCF, without the hassle of passwords.
Don't forget to load the globus module on Hyades:
$ module load globus
List the files in my Work directory at OLCF:
$ globus-url-copy -list \ gsiftp://dtn.ccs.ornl.gov/lustre/atlas1/ast006/scratch/dong/
NOTE if your OLCF username is different from your Hyades one, you must specify it, e.g.:
$ globus-url-copy -list \ gsiftp://[email protected]/lustre/atlas/scratch/OLCFusername/
On Hyades, transfer a 10GB file from Hyades to dtn at OLCF:
$ globus-url-copy -tcp-bs 12M -bs 12M -p 4 -v -vb \ file:/scratch/tmp/10G.dat \ gsiftp://dtn.ccs.ornl.gov/lustre/atlas1/ast006/scratch/dong/10G.datNOTE here we employ the option -p 4 to use 4 parallel data connections; and -tcp-bs 12M to set ftp data channel buffer size to 12MB.
On Hyades, transfer a 10GB file from dtn at OLCF to Hyades:
$ globus-url-copy -tcp-bs 12M -bs 12M -p 8 -v -vb \ gsiftp://dtn.ccs.ornl.gov/lustre/atlas1/ast006/scratch/dong/10G.dat \ file:/tmp/10G.dat
For moving larger files, OLCF also supports the multistreaming transfer utility BBCP[12].
On Hyades, transfer a local 10GB file to dtn03 at OLCF:
$ bbcp -P 2 -V -w 8m -s 16 \ /scratch/tmp/10G.dat \ dtn03.ccs.ornl.gov:/lustre/atlas1/ast006/scratch/dong/10G.datwhere
- -P 2 produces progress messages every 2 seconds.
- -V produces verbose output, including detailed transfer-speed statistics.
- -w 8m sets the size of the disk input/output (I/O) buffers.
- -s 16 sets the number of parallel network streams to 16.
$ bbcp -z -P 2 -V -w 8m -s 16 \ dtn03.ccs.ornl.gov:/lustre/atlas1/ast006/scratch/dong/10G.dat \ /tmp/10G.datwhere
- -z reverses connection initiation, in order to get around firewall on Hyades.
$ bbcp -r -P 2 -V -w 8m -s 16 \ /scratch/tmp/ \ dtn03.ccs.ornl.gov:/lustre/atlas1/ast006/scratch/dong/10G.datwhere
- -r performs a recursive copy.
Although Globus Online is the officially supported, preferred method, at NCSA (National Center for Supercomputing Applications), for moving files of significant to/from Blue Waters system[13], we can still use GridFTP command-line tools to move data between Hyades and Blue Waters.
On Hyades, you must first load the globus module:
$ module load globus
Get a proxy certificate for Blue Waters[14]
For the very first time you request a proxy certificate for Blue Waters, issue the following command:
$ myproxy-logon -T -l shawdong -s tfca.ncsa.illinois.edu
NOTE shawdong is my Blue Waters username. Change it to yours; but if your usernames are the same on Hyades and Blue Waters, it is unnecessary to use the -l username option. The -T flag will pick up the necessary trust anchors, to be stored in $HOME/.globus/certificates/, so that the GridFTP client on Hyades can trust NCSA certificates.
When prompted Enter MyProxy pass phrase:, enter your PIN + 6-digit token code shown on your SecurID key fob. A short-lived proxy certificate, along with its private key, will be generated and save at /tmp/x509up_u${UID}. By default, the certificate is valid for 12 hours.
There is a minor issue. The signing policy of one of the CA certificates is so restrictive that we'll encounter error when running GridFTP client. Let's fix it™! Edit $HOME/.globus/certificates/ba240aa8.signing_policy, replace the line:
cond_subjects globus '"/DC=org/DC=incommon/C=US/ST=IL/L=Urbana/O=University of Illinois/OU=NCSA/OU=IGTF Server/CN=tfca.ncsa.illinois.edu"'with
cond_subjects globus '"/DC=org/DC=incommon/C=US/ST=IL/L=Urbana/O=University of Illinois/OU=NCSA/CN=*.ncsa.illinois.edu"'
In the future, if the proxy certificate has expired, you can request a new one with:
$ myproxy-logon -l shawdong -s tfca.ncsa.illinois.edu
List my Lustre Scratch directory on Blue Waters[15]:
$ globus-url-copy -nodcau -list \ gsiftp://[email protected]/scratch/sciteam/shawdong/NOTE the option -nodcau turn off data channel authentication for ftp transfers. Without the option, you'll get the error: The GSI XIO driver failed to establish a secure connection; likely because the default at NCSA is to enable data channel authentication.
NOTE You can choose a data mover between 1 and 28 for Blue Waters. The hostnames are ie[01-28].ncsa.illinois.edu.
List my home directory on Nearline tape storage system:
$ globus-url-copy -nodcau -list \ gsiftp://[email protected]/u/sciteam/shawdong/NOTE You can choose a data mover between 1 and 28 for Nearline. The hostnames are hpss-md[02-50].ncsa.illinois.edu.
Copy a 10GB file from Hyades to my Lustre Scratch directory on Blue Waters:
$ globus-url-copy -nodcau -vb -fast -p 8 \ file:/scratch/tmp/10G.dat \ gsiftp://[email protected]/scratch/sciteam/shawdong/10G.datNOTE here we employ the option -p 8 to use 8 parallel data connections.
Copy a 100GB file directly from Hyades to Nearline tape storage system:
$ globus-url-copy -nodcau -vb -fast -p 8 \ file:/scratch/tmp/100G.dat \ gsiftp://[email protected]/u/sciteam/shawdong/100G.dat
Transfer a 4GB file from Blue waters to Hyades:
$ globus-url-copy -nodcau -vb -fast -p 8 \ gsiftp://[email protected]/u/sciteam/shawdong/4G.dat \ /tmp/4G.dat
Copy all files in a directory, using the -r flag, to my home directory on Blue Waters:
$ globus-url-copy -vb -fast -p 8 -r \ file:/scratch/ \ gsiftp://[email protected]/u/sciteam/shawdong/scratch/
- ^ How to transfer large amounts of data via network
- ^ NERSC certificates
- ^ NERSC file systems
- ^ Using bbcp at NERSC
- ^ BBCP
- ^ Ranch User Guide
- ^ XSEDE - Data Transfers & Management
- ^ OLCF - Employing Data Transfer Nodes
- ^ SSHFTP (GridFTP-over-SSH)
- ^ OLCF - Data Management User Guide
- ^ OLCF - Transferring Data with GridFTP
- ^ OLCF - Transferring Data with BBCP
- ^ Blue Waters - Data Transfer
- ^ Blue Waters - Getting Started Guide
- ^ Blue Waters - Storage