Large File Transfers - nthu-ioa/cluster GitHub Wiki
This page describes the best practice and suggestions for when you are moving more than 100GB of data:
- Between filesystems on the cluster (e.g. from
/datato/lfs); - Into or out of the cluster.
[!CAUTION] If you want to move massive amounts of data (>10TB) please always let the admins know. If you are not confident in exactly what you're doing, please let us handle such transfers. Please try to run large transfers at times when the network is quiet (nights and weekends).
Between filesystems on the CICA cluster
[!IMPORTANT] Always use rsync for large file transfers. Never, ever use
mv.
The principle is to first copy the data, using the rsync tool, and then delete the original once the copy has finished.
If you are not already familiar with rsync, please take some time to learn about it and practice before starting a long-running copy.
rsync is much, much better than cp or scp. File transfers with rsync can be stopped and resumed without having to start again from the beginning. Each file copied will be automatically `checksummed' to make sure it is identical to the original. There are many other advantages and options.
First, log on to the node s01. The only purpose of this node is to manage filesystem operations. Never run jobs on this node if they're not related to moving files around. The reason for using s01 is that it is connected directly to the disks. Using any other node will result in your data doing an extra round-trip across the network for no good reason.
You probably want to start a screen session if the copy is going to take a long time.
screen -S bigcopy
Then launch rsync with the --no-compress and -WS options, plus whatever other options you want. Usually you want -a and --progress.
An example command:
rsync -a -WS --no-compress --progress -R /data/someuser/files/./source /lfs/data/someuser/newfiles
This will create a copy of all files and directories under/data/someuser/files/./source in /lfs/data/someuser/newfiles.
Do not just copy this command blindly without taking the time to learn what it does, and if necessary, adapt it to what you want. For example, ask yourself -- do you understand what the -R option does, and why there is a /./ in the middle of the source path?
[!IMPORTANT] If you are still learning
rsync, practice with small transfers using the same paths first. It is very easy to get thersyncpath specification wrong. In particular, a trailing slash on rsync paths is meaningful -- read the manual!
While the files are copying, you can disconnect/close your screen session and reconnect to it later with screen -dr bigcopy (remember, it's running on s01).
[!NOTE] Within the cluster you can expect transfer speeds of around 120MB/s if the system is quiet. This is limited by the disk read/write speed, not the network.
Into/out of the CICA cluster
The same principles apply:
- Log on to
s01and issue all data-moving commands from that node; - Use
rsyncwhenever possible.
In this case a suitable rsync command would be:
rsync -avz --partial --progress /data/someuser/files/./source [email protected]:/data/otheruser/dest
The differences with internal transfers are -z (turn on compression) and --partial. You should know what these options do and whether they are appropriate for you.
Note that rysnc has an option --bwlimit to set a maximum bandwidth. Sometimes this will help you as well as other users.
The maximum bandwidth of our connection to the outside world is 1 GB/s, but it is highly unlikely a single off-site transfer will be able to use all of that.
[!CAUTION] Do not run anything apart from data management jobs on
s01. If you crash this node the consequences will be serious.