Copying Large Data Collections - CustodesTechnologia/System GitHub Wiki

Bolter and Chainsword

Copying large data sets

The challenge is to copy a very large amount of data (files in a file system) from one host to another.

There are many options, but we want to choose one that removes as much re-copying as the date of the site upgrade nears. We want to be able to make a large initial copy, and then later as files change, only copy the files that are changed.

There's a tool (among a few) that I choose to use. But I don't want to use scp because it's too slow, and it cannot detect the files to not copy the next time we want to synchronize the source and destination directories. I don't want to use tar (well I do, since it's fast and reliable), but like scp it's also not able to really cull out the files that are not necessary to copy.

We want to use a command that can figure out what was already copied, and what needs to be copied. We want rsync.

About rsync

Use rsync

/usr/bin/screen -dmS FOO rsync -avz -e "ssh -p PORT" USER@REMOTE_HOST:/PATH/ name-of-dump

An explanation.

Forget the command screen for a moment.

We're running rsync with the flags avz

-a creates an archive

-v be verbose in output

-z compress the data during the transfer

-e defines the shell to use. Here we're using ssh and also assigning the port number on the remote -p PORT (the port number on the remote host that listens for ssh)

The next two arguments are SOURCE and then DESTINATION.

The SOURCE is a remote source, in the syntax of remote-user@remote-host:/path/to/directory/ The DESTINATION is the local directory (that will be created) to hold the contents of the files.

rsync is efficient. Re-running it later will be effective for transferring the files that have changed.

About screen

It would be a terrible waste if the command were interrupted during the transfer. A shell can be ended abruptly or some other event on your local system where you spawned the shell to run the rsync command.

So to protect against that event, run the whole command within screen.

screen will create a sub-shell, detached from the controlling terminal and it will run the command argument to screen in that sub-shell.

It's detached from the controlling terminal. The quickest way to explain "detached" is this.

Suppose you ran a command from the shell. If you type in the command, and wait for the response, the command is running in the shell that is attached to the controlling terminal (it's controlling because you're using it, it's active).

Now run the same command in the background:

$ long-command &

You get the shell prompt but the process long-command is still running, in the background, but the shell is still attached to the controlling terminal because you still have the shell open.

Now if you did this again:

$ long-command &

Again, it's in the background running, but then you decide to LOGOUT.

$ exit

The shell that is running long-command is DETACHED from a controlling terminal. It has no controlling terminal.

But if you log back in, you have no job control over the job that you put in the background. Yes, you can kill the process. But the shell has lost control because there is no longer a controlling terminal attached to the subshell that started long-commnad. Also, a bummer.

Here comes screen to the rescue.

Run the whole thing in screen. Then if you log out, or loose the controlling terminal, you can still regain (attach) to the subshell that started the command -- even if you log out and back in. Classic example is you lose power and your internet connection goes down. But you started the command in screen. No worries. The host that you were logged into is still running the long-command in the SCREEN and it's safe -- it's still running.

And best of all you can re-attach to the screen:

$ screen -r FOO

FOO is just a name you assign to the screen in case you have more than one screen.

And you can re-detach while it runs:

CTRL-A d

You can flip back and forth between attaching (-r) and detaching (CTRL-A d) as much as you want.