Backuptool

Introduction

The backuptool consists of a sqlite database holding meta-information about the files under control of the backuptool and the tool itself. The tool itself is a simple ruby script.

The tool is created for Linux systems. It was tested on a Ubuntu 16.04 LTS server and a Ubuntu 18.04 LTS desktop.

The test hardware consisted of 2 HP LTO-4 fibre channel drives and a QLE2564 fibre-channel adapter. Tests were done with one tape drive writing while the other is verifying. It is possible to run multiple writes at the same time, if your server can support the data-rates needed by both tape decks in parallel.

Intended use and goals

The backuptool is intended to make making backups on LTO tape easy from Linux systems. Windows/Mac support is not a goal of the project. It is targeted at the people that have a lot of data and no good way of backing it up to the cloud. Cloud backup becomes troublesome when data volumes exceed 2 TB.

As a secondary goal; the tapes created should be fully usable even if the tool that generated them is lost. To that end, the GNU tar tool is used in the background. To make restores as fast as possible, no spanning of data over multiple tapes is allowed by the tool itself. At the beginning of a tape is the manifest.txt file which describes the files contained on the tape together with their SHA1 checksum. If a tape is found, and the user has no idea what is on the tape, popping it in the drive and running: tar --occurrence=1 -xvf /dev/nst0 manifest.txt Should be a quick operation. The user can then scan the manifest to see what is on the tape.

The tool supports verifying the tape by reading the manifest.txt first and then matching all extracted files to the SHA1 checksum in the manifest.

The database (called catalog from now on) contains the following information about the files:

Absolute path
mtime
sha1 checksum
size
on_disk flag

The entry of note here is the on_disk flag. It is possible that a file will remain in the catalog, even if it is deleted from the server. As long as the file is on one of the tapes, the catalog will know about that file. It is therefore not deleted from the catalog. It's on_disk flag will be set to 0.

Installation

Installation is painless on Ubuntu Linux. Since the tool is built to be as lightweight as possible, you will only need Ruby and the ruby sqlite3 bindings. It is however recommended to pull the tool via git, so you can get easy updates. The recommended installation procedure is outlined below:

sudo apt install ruby ruby-sqlite git
cd /usr/local/src
sudo git clone https://github.com/erikvanhamme/filetools.git
cd /usr/local/bin
sudo ln -s /usr/local/src/filetools/backuptool backuptool
mkdir -p ~/.local/share/backuptool

The catalog is stored in ~/.local/share/backuptool/backuptool.db. It is created upon first execution of the tool.

Update

If you followed the above installation, updating the tool is as easy as:

cd /usr/local/src/filetools
git pull

Catalog

The catalog is the database backing the backuptool script. It is located in ~/.local/share/backuptool/backuptool.db. It is critical that you do NOT loose this file. Since it is much smaller then the actual data to be backed up, consider backing up that file to Google Drive, Dropbox, Crashplan, ... whatever.

Entities

There are 3 entities in the catalog:

Files: the files in the dataset.
Tapes: the tapes the user owns and uses to backup the dataset.
Tapesets: collections of tapes containing files of a common type.

Tapes are handled by the standard LTO tape labels. IBM tape label specs These types of tapes are currently supported: LTO-1-(M)8

Tapesets are intended to be used to collect files of a certain type in the dataset. It is possible to make a single tapeset for the entire dataset, or make a tapeset for movies, music, photos, ...

Command line

A commandline invocation of the backuptool contains 3 different things:

One, and only one action.
Zero to N modifiers.
Zero to N action arguments. The order in which the types appear is not important, however the order of the action arguments is. This means that:

backuptool --action --modifier1 ARG1 --modifier --modifier2 ARG2

Is the same as:

backuptool --modifier2 --modifier1 --action ARG1 ARG2

But is not the same as:

backuptool --modifier2 --modifier1 --action ARG2 ARG1

Use whatever ordering you like, as long as you keep the action arguments in order.

There is commandline help available by using -h or --help. This will produce this output:

Backuptool (version: 2, db_version: 3)
Usage help:
===========
Actions:
--------
Misc:
-h        --help                             Displays this help.
Paths:
          --add-paths PATH...                Adds paths to database.
          --update PATH...                   Updates paths in database.
          --remove-paths PATH...             Removes paths from database.
          --prune PATH...                    Checks database paths for removed files.
          --check PATH...                    Checks database paths for duplicates.
Tapes:
          --add-tapes LABEL...               Adds tapes to database.
          --remove-tapes LABEL...            Removes tapes from database.
          --list-tapes <SEARCHTERM...>       Lists tapes in database.
          --info-tapes <LABEL...>            Shows info about tapes in database.
          --manifest LABEL                   Shows the files on a tape.
          --mark-erased LABEL...             Marks tapes as erased.
          --mark-written LABEL...            Marks tapes as written.
          --mark-verified LABEL...           Marks tapes as verified.
          --erase LABEL DEVICE               Erases the tape on the device.
          --write LABEL DEVICE               Writes the tape on the device.
          --verify LABEL DEVICE              Verifies the tape on the device.
          --write-verify LABEL DEVICE        Writes and verifies the tape on the device.
          --replace LABEL LABEL              Replaces the tape in the database.
          --stage LABEL PATH                 Stages the tarball in the given path.
Tapesets:
          --add-tapeset NAME PATH...         Adds a tapeset to the database.
          --remove-tapesets NAME...          Removes tapesets from the database.
          --rename NAME NAME                 Renames a tapeset in the database.
          --list-tapesets <SEARCHTERM...>    Lists tapesets in the database.
          --info-tapesets <NAME...>          Shows info about tapesets in the database.
Modifiers:
----------
Misc:
-v        --verbose                          Run verbosely.
-q        --quiet                            Run quietly.
          --sql                              Shows SQL queries. Useful for debugging.
          --async                            Database in async mode. Unsafe if PC unstable or power cut.
Tapes:
          --rewind                           Rewinds tape after write or verify.
          --offline                          Offlines (ejects) tape after write or verify.
          --compression                      Enables the tape drives hardware compression.
Tapesets:
-2        --double                           Double redundancy for --add-tapeset.
-3        --triple                           Triple redundancy for --add-tapeset.
-i        --incremental                      Incremental tapeset creation for --add-tapeset.
          --ffd                              First Fit Decreasing bin packing for --add-tapeset.
          --no-tape-level                    Fill tape to max instead of leveling for --add-tapeset.

Actions

--help

Displays usage help.

--add-paths

This action recursively adds paths to the catalog. Each argument is treated as a path which will be scanned recursively for files. All the files that are not in the database already will be hashed (SHA1) and added to the database. If a non-existant path is supplied, the backuptool will complain.

--update

This action recursively updates paths in the catalog. Each argument is treated as a path which will be scanned recursively for files. The found files are then compared to the catalog. If a file with the exact same absolute path is in the catalog, the mtime and size will be compared.

If the mtime and size are the same, no further action is taken.
If the mtime and size are different, the file is hashed (SHA1). After the hashing, the tool checks if the old file is written on a tape. If it is written on a tape, the old file is flagged as not on disk (on_disk=0) and a new file is added to the catalog. If the old file was not on a tape, the record in the catalog is updated with the new size, mtime and sha1.

--remove-paths

This action recursively scans the given paths and removes the files it finds in them from the catalog. For each file that is removed, the tool will check if the file is on a tape or not. If the file is not on a tape, the record is simply removed from the catalog. If the file is on a tape, its on_disk flag will be set to 0.

--prune

This action will prune the catalog from files that are no longer on disk. Each path to prune is given as an argument. The backuptool will then find all the files in the catalog that are under the supplied paths. The tool will then check for each matching file in the catalog if the file is still present on disk. If it is present, no changes are made to the catalog. If it is NOT present, the tool will check if the file is on a tape or not. If the file is not on a tape, the record is deleted from the catalog. If the file is on a tape, the on_disk flag is set to 0 in the catalog.

--check

This action will find all the files recursively in the paths given as arguments. All found files will then be hashed (SHA1). The tool checks in the catalog if there are other files with the same SHA1 hash as the found file. If it finds the same SHA1 in the catalog, it will print a statement about the match. If not found there is no output.

This action is useful to check for duplicates. If you are adding a bunch of files to the dataset, you can scan first for duplicates. The output of the --check action can be easily changed to a remove script with the use of sed and awk.

--add-tapes

This action adds tapes to the catalog. It can be used in 2 ways:

Each tape label supplied as a separate argument.
Multiple tape labels grouped together in an argument. Each label is then preceded and succeeded by a *.

Option 2 is useful if you have many tapes and have bought a cheap barcode reader on ebay. A possible invocation for option 2:

backuptool --add-tapes *000000L1**000001L1*

--remove-tapes

This action removes tapes from the catalog. It can be useful if you are upgrading tapes (to remove the old ones).

The tool will refuse to remove tapes that are part of a tapeset.

The tapes arguments can be supplied in the same 2 ways as --add-tapes.

--list-tapes

Lists the tape labels known to the catalog. Can be run with or without arguments. When run without arguments, the tool will list all the tapes.

When run with an argument, all tape labels will be filtered against the given searchterm. The searchterm supports a wildcard character that can be used multiple times.

e.g. List all LTO-4 tapes starting with 0 and all LTO-6 tapes: backuptool --list-tapes 0%L4 %L6

--info-tapes

Gives detailed info about the tapes known to the catalog. Can be run with a set of labels to restrict the output only to the tapes with the given labels.

Supplying the --verbose modifier will show more columns.

--manifest

Prints the manifest of the tape with the given label. Only takes one tape label as argument. The output is the content of the manifest.txt file that is (or will be) on the tape.

--mark-erased --mark-written --mark-verified

These actions can be used to update the catalog in case you have written, verified or erased the tool externally (without calling on backuptool to do it).

External operations can be:

write: use of dd to write a tape from a staged tarball (see --stage).
verify: extracted the tape on a different system, and verified the files with sha1sum -c manifest.txt in the extraction directory.
erase: tape erased in external eraser, or by calling mt-st -f /dev/nst0 erase

Use the --mark-x methods to bring the catalog in line with reality.

--erase

Erases the tape using long erase (takes 2 hours on an LTO-4). Marks the tape as erased afterwards.

Takes the tape label and the device to erase it on as arguments.

--write

Writes the tape using tar. Marks the tape as written afterwards.

Takes the tape label and the device to write it on as arguments.

--verify

Reads the tape from the given device and matches all the files to the checksums in the manifest.txt file. Running verify is essentially the same as running tar xvf /dev/nst0 && sha1sum -c manifest.txt, except it does not use one tape's worth of diskspace to unpack first. The extracted data is filtered on the fly through the SHA1 algorithm to verify it and is not saved on disk.

Takes the tape label and the device to verify it on as arguments.

--write-verify

First writes and then verifies the tape on the device. See --write and --verify for more details.

--replace

Replaces a tape with another one. Use this in case a tape was lost, or dropped and you want to replace the old tape with a new one in the tapeset.

Replacing a tape will require a new tape that is not in the catalog yet. It will mark the new tape as erased since it has not been written yet.

--stage

The stage action stages the tarball to be put on the tape on disk somewhere. This is useful if the tarball will contain many small files and the server cannot meet the data rate required by the tape deck.

To avoid shoe-shining the tape, you can stage the tarball on disk first (assuming you have the free diskspace) and then use dd to transfer it to the actual tape. After the staging, you have a sequentially written large file that is much less stressful to the system to transfer to tape at a high enough data rate.

Please note: The backuptool assumes a blocksize of 4MB in all the operations. Make sure to use bs=4M with dd when transferring to tape.

--add-tapeset

Adds a tapeset to the catalog. It takes the name of the tapeset and a set of paths as arguments.

The paths are used as search terms to find all the files in the catalog that are descendants of the given paths. These files are collected, a number of tapes needed is calculated, and the files are optimally distributed over the tapes.

You can add multiple copies of a new tapeset at once by using the --double and --triple modifiers.

Adding tapesets will try to level the files over the tapes, preferring to have 2 tapes each loaded at 51% over one tape at 100% and one tape at 2%. This behavior can be overridden with the --no-tape-level modifier.

The bin packing algorithm used to divide the tapes over the files is 'First Fit' to use 'First Fit Decreasing' instead, use the --ffd modifier. More info on wikipedia: bin packing

These things should be considered when adding tapesets:

The selection algorithm will sort the tapes by label and take the first available one(s) from the list. if you have 8 free LTO-1 tapes in the catalog with low-numbered labels, these will be selected over the free LTO-4 tape with a higher tape label number. You can control this by guessing how much data will be in the tapeset and only adding the free tapes needed. This will force the add tapeset feature to use only the tapes you want.
The add tapeset feature cannot deal with tapes of different sizes. It will assume that the sizes of all selected tapes are the same as the first selected tape size.

The 2 above points will be fixed in a further release. See recommended use section below for more info on how to create tapesets without getting burned by the current limitations.

--remove-tapesets

Removes the tapesets given as arguments. This will delete the file->tape allocations. Files that are in the catalog and on tape, but not on disk (on_disk flag is 0) will be removed from the catalog.

All the tapes that were member of the tapesets will be marked as erased.

There is no undo. Be careful.

--rename

Renames a tapeset from old_name to new_name. Old_name and new_name are the arguments.

--list-tapesets

Lists the tapesets known in the catalog. Can be used without arguments, or with search terms as arguments. The search termscan contain the % wildcard character. See --list-tapes for more info on the wildcard.

When run without arguments, will list all the tapesets in the catalog.

--info-tapesets

Displays information about the tapesets with the given names. The names are supplied as arguments. When no name is supplied as an argument, will give info about all the tapesets in the catalog.

Modifiers

--verbose --quiet

These modifiers increase and reduce the output of data on the command line. --verbose will increase the output. --quiet will reduce it to the point of no output unless there are errors.

Enabling the --quiet modifier will automatically disable the --verbose modifier.

It makes no sense to use --quiet together with the --list-x and --info-x actions.

--sql

The --sql modifier is a debug option that will print all executed SQL queries on the terminal. It is a debug option for the database and normal users should not be needing this. If you are curious what the DB is doing, you can run with --sql.

--async

the --async modifier puts the database in async mode. This greatly increases the performance (useful when adding and removing tapesets) but is dangerous in case your pc crashes or you have power cut.

If you run on a server with ECC memory and a UPS, you can enable it and not worry. If you run on a 10 year old pc on an African power grid, leave it off.

--rewind --offline

These modifiers control if the tape will be rewound or offlined after the action is completed. It only applies to tape actions that take a device argument.

--rewind will tell the tool to rewind the tape.

--offline will tell the tool to offline the tape. (Usually offline ejects the tape, but in some cases with tape libraries, offline will not eject the tape from the machine.)

--compression

Enables the hardware compression if the tape drive has the feature.

Compression is not useful for movies and music, but if you have terabytes of documents, enabling the compression may speed up the restore.

Please note that the add tapeset algorithm assumes no compression is applied to the data. Therefore, only the pre compression file size is considered. If your data compresses at a ratio of 2:1 you may end up with 2 tapes each 50% full while the tapeset thinks both tapes are 100% full.

--double --triple

Instructs the --add-tapeset algorithm to make multiple copies of the tapeset being created. It will have 2 copies of each tape in the tapeset.

The tapesets have ' (primary)', ' (secondary)' or ' (tertiary)' attached to their name when created.

--incremental

Will make an incremental tapeset as an overlay to an existing one. Is currently not implemented.

--ffd

Uses First Fit Decreasing bin packing algorithm. See --add-tapeset.

--no-tape-level

Fills tapes to 100% without leveling. See --add-tapeset.

Real world usage examples

Stage one, getting started

Consider a server with 1 TB of pictures, 2 TB of music and 3 TB of movies on it in folder /media. The backuptool has been installed, but never been run.

To start the catalog, run the tool once. Use action help because it does nothing, enable verbose and sql to see what is going on under the hood.

backuptool --help --verbose --sql

This creates the catalog under ~/.local/share/backuptool/backuptool.db.

Now your partner says 'I will murder you if you loose the baby/wedding pictures.' Time to back em up. Add them to the catalog first:

backuptool --add-paths /media/pictures

This takes a long time because it hashes all the files and adds them to the catalog.

We will need 2 LTO-4 tapes to back up the pictures because 1TB won't fit on a 800 GB cartridge. We want 2 copies (one offsite in case the building burns). That means we need to add 4 LTO-4 tapes.

backuptool --add-tapes 000001L4 000002L4 000003L4 000004L4

Check that the tapes are in:

backuptool --list-tapes
backuptool --info-tapes --verbose

Cool, now make the tapesets:

backuptool --add-tapeset 'Pictures @ date' /media/pictures --double

This makes 2 tapesets, using 2 tapes each, which you can check with:

backuptool --list-tapesets
backuptool --info-tapesets

Time to write the first tape, we will verify it after the write. Pop in the tape and execute:

backuptool --write-verify 000001L4 /dev/nst0 --rewind --offline

Now go walk the dog for 4 hours until the tape is done, and repeat for the other tapes. (Omit the walking for the subsequent tapes if the dog is tired.)

After all tapes have been written and verified, you can check the status of the tapes with:

backuptool --info-tapes --verbose

This will tell you all you need to know about the tapes.

Stage 2, expansion

You found cheap tapes on ebay, and you want to backupt the rest of your files.

Start by adding them to the catalog:

backuptool --add-paths /media

Please note: we are adding /media here /media/pictures is already in the database, so will not be hashed again.

Then we add the tapes needed. 3 tapes for /media/music and 4 for /media/movies:

backuptool --add-tapes *000005L4**000006L4* ...

Please note the ** around the tape labels. We are now using the cheap barcode scanner to not have to type the labels. (15 bucks on ebay)

Now we can add the tapesets:

backuptool --add-tapeset 'Music @ later_date' /media/music
backuptool --add-tapeset 'Movies @ later_data' /media/movies
backuptool --info-tapesets

We only make one copy of each this time.

Then we can write all the tapes again.

Stage 3, addiction

You have been invited (reluctantly) to your borther's wedding where you made 500 new pictures. You have also been removing the red-eyes from the existing pictures since you forgot to turn on your camera's red-eye filter. There were also some duplicate pictures in the /media/pictures folder which you have eliminated. Your backup is no longer up to date and should be properly redone to avoid the murder scenario described above.

Add the new pictures to the catalog first:

backuptool --add-paths /media/pictures

Now update the modified pictures in the catalog:

backuptool --update /media/pictures

Finally remove the duplicate files from the catalog:

backuptool --prune /media/pictures

Please note that the files you pruned from the /media/pictures folder remain in the catalog since they are on the tapes. The --prune operation only flags them as not on disk anymore (on_disk=0).

Next step is to drop the old tapesets:

backuptool --remove-tapesets 'Pictures @ date (primary)' 'Pictures @ date (secondary)'

Please note: removing the tapesets marks the 4 tapes as erased and deletes all the files linked to the tapes that have on_disk=0 from the catalog.

Now we can make new tapesets for the updated /media/pictures folder.

backuptool --add-tapeset 'Pictures @ most_recent_date' /media/pictures --double

Then we write and verify the tapes again.

backuptool --write-verify 000001L4 /dev/nst0 --rewind --offline

For all tapes, 4 hours each for LTO-4.

Summary

Steps to create a backup:

--add-paths to add the files to the catalog
--add-tapes to add sufficient tapes for the desired tapeset configuration
--add-tapeset to add the tapeset
--write all tapes
--verify all tapes

Steps to update a backup:

--add-paths to add the new files to the catalog
--update to update the updated files in the catalog
--prune to update the removed files in the catalog
--remove-tapesets of the superseded tapesets
--add-tapes if needed
--add-tapeset to create the new tapeset
--write all tapes
--verify all tapes

Full restore from offsite tapes in case the house did burn down. For each tape:

cd /extract_tape makes the work folder to extract the stuff.
tar xvf /dev/nst0 extracts the tape in the work folder.
sha1sum -c manifest.txt file integrity check of extracted files (remove all that fail as the contents cannot be trusted)
rm manifest.txt removes the manifest of the tape.
mv -r * /media moves the correct files to /media
mt-st -f /dev/nst0 rewoffl