The server hosting these Trac sites and the Subversion repository is presently being backed up on a nightly basis. Backups are written to a robotic tape library. The drives are connected via FibreChannel?. This being a FreeBSD machine, the drives and library work very reliably. We wrote a simple backup catalog management system, based on shell scripts and sqlite3 database. The system is hosted by the pseudo-user named "librarian" and all the code for the system is found at http://dev.regenstrief.org/svn/ctools.

Features of this backup system are:

  • scalable to back up multiple machines directly through the same FibreChannel? network, no redundant IP based communication to a master server
  • control of resources through the sqlite3 database on a master librarian
    • Benefit: sqlite3 database is easy to save and easy to read, no dependency on a complicated system that's hard to reconstitute in the case of a disaster
  • can distinguish different jobgroups and keep them on separate tapes
  • automatically assigns fresh tapes to backup tasks ("jobgroups") and allows reserving tapes for jobgroups
  • tracks the status and success (or failure) of backup jobs
  • can span a file job across multiple tapes
    • Missing feature: avoid to eject and reload a tape between subsequent jobgroups meant for the same tape
  • easily configurable use different backup programs, including BSD dump (preferred for normal filesystem backup) or tar or direct copy (e.g., for large archive files)
  • completely transparent for recovery, i.e., if the library management system should be absent in a disaster recovery situation, one can use the reliable and ubiquitous basic Unix tools to restore
    • BUG: no continuation record is written at the end of a tape with a file job which spans across multiple tapes
    • Missing feature: the database should be saved multiple times on every tape.
  • controlled with standard cron jobs
  • apart from the shell scripts no dependency on complicated configuration files, easy to use. See example *-backup.sh scripts.

Some outdated information, which is still informative in detail, follows:

Backup Management of Aurora

This Trac and Subversion service as well as many other specific projects Trac sites are hosted on Aurora. This system is backed up by a rigorous daily backup schedule. Aurora has its own DLT IV tape drive unit, and at present, I, Gunther Schadow, manage that. But anyone can (and should) have an eye on things. This is how it's done.

We use standard trusty BSD backup (which is so simple and yet so powerful) and tape management tools and you are invited to read the manual pages to all of them:

  • /etc/crontab - the crontab file which has an entry for the nightly backup
  • /var/log/dump.log - the log file, first place to look (every few days)
  • /root/dump.sh - the shell script which manages the backup schedule, called by cron as well as by human
  • mt(1) - the tape drive management program, called by administrator to rewind and seek and monitor status
  • dump(8) - the backup program itself
  • restore(8) - the restore counterpart of dump(8)
  • /dev/nsa1 - the device node for the tape drive (upon the next reboot that might change to /dev/nsa0)
  • /etc/dumpdates - the dump inventory managed by the dump program
  • camcontrol(8) - the low level SCSI bus control program, sometimes needed

That's all the tools used.

General Approach of Incremental Backup

Always start with a level 0 backup. This should be done at set intervals, say once a month or once every two months, and on a set of fresh tapes that is saved forever.

After a level 0, dumps of active file systems (file systems with files that change, depending on your partition layout some file systems may contain only data that does not change) are taken on a daily basis, using a modified Tower of Hanoi algo- rithm, with this sequence of dump levels:

3 2 5 4 7 6 9 8 9 9 ...

For the daily dumps, it should be possible to use a fixed num- ber of tapes for each day, used on a weekly basis. Each week, a level 1 dump is taken, and the daily Hanoi sequence repeats beginning with 3. For weekly dumps, another fixed set of tapes per dumped file system is used, also on a cyclical basis.

After several months or so, the daily and weekly tapes should get rotated out of the dump cycle and fresh tapes brought in.

Specific Procedures on Aurora

Backup is done nightly at about 2am on all 3 of aurora's volumes except for tmp. The ~root/dump.sh script is used which implements a 7-day schedule described above 3 2 5 4 7 6 1 starting on Monday. Since backups run at 2 am in the morning, level 3 would be on Monday morning containing most of Sunday's changes and level 1 on Sunday morning, containing all of the changes since the previous Sunday.

Since a tape runs for approximately 10 days, we will just always keep the tape in until it is exhausted. That means that some backups fail. To make it simple, when inserting a new tape, we now always run a level 1 backup. That way, one can usually recover from the last level 0 and then only one additional tape.

All tape changes should be recorded in the dump.log file like this

echo REMOVE TAPE AURORA-1 $(date) >>/var/log/dump.log
echo INSERT TAPE AURORA-2 $(date) >>/var/log/dump.log

etc.

Run a Manual Backup

All backups should be run through the script in this way:

/root/dump.sh >> /var/log/dump.log 2>&1 &

These backups take less than an hour usually. So it's no big deal. It is critical to redirect the output to the /var/log/dump.log file so we have a good record of what was done when. May be redundant with /etc/dumpdates, but it is a bit more tangible.

Usually when I check /var/log/dump.log, and I find the backup failed because tape was full, I will rewind the tape, remove it, pop a new one in, and redo the current day's backup using the above command.

If I want to make a specific backup, such as running a sporadic level 0 backup, I can say:

/root/dump.sh full >> /var/log/dump.log 2>&1 &

or

/root/dump.sh -l 0 >> /var/log/dump.log 2>&1 &

The latter using the -l ''n'' option allows making a manual backup of any level. But usually only level 0 needs to be done that way.

Tape Drive Management

Replacing the Current Tape

Rewind the tape:

mt -f /dev/nsa1 rewoffl

and swap it with another tape. Right now I use a new tape, but soon I will start rotating. After tape is place into the machine, it is put in position and I can check with

mt -f /dev/nsa1 status
mt -f /dev/nsa1 rdhpos

I like the latter command particularly, because that is what I can use to estimate remaining capacity.

Multiple Files

It's important to note that these backups require far, far less space than one DLT IV tape can hold. We can typically go for about 10 days on a single tape. That means that in order to restore data you have to fast-forward and rewind, and you have to be sure to understand that as soon as you write, everything on the tape after the point you're writing will be lost'''

mt -f /dev/nsa1 fsf 5

fast-forwards by 5 files.

mt -f /dev/nsa1 eom

fast-forward to the end of the written tape. Always do that before trying to append to a tape. That means, should you ever rewind in order to check or restore something, you should immediately do mt eom to avoid any other write action (such as nightly backup) to cause a loss of data at the end of the tape.

Estimating Remaining Tape Capacity

You can read the current position of the tape using the following command:

mt -f /dev/nsa1 rdhpos

With my standard blocksize of 102400 (i.e., 100 kiB) the number I get returned is pretty exactly twice the number of such blocks written. So, that way I have a good idea how much of the tape is left.

Restore

Read the restore(8) manual. Use the same flags as dump(8) and usually you want the interactive mode:

restore iabf 102400 /dev/nsa1

or new style:

restore -i -a -b 102400 -f /dev/nsa1

It comes up quickly with a prompt telling you which file system you're looking at, and what date and level of incremental backup it is. You can do usual file system commands, such as cd, ls, etc. to browse around and see if the lost item can be found. Usually you'll have to search a few of these backup files. If you don't find the lost item, you can exit and with mt fsf 1 go to the next file and try again.

On The Other Hand

I am contemplating to making it easier. I can get one week of backups on one tape starting with level 0. Problem is if I start a level 0 it takes time and I won't be back to change for a level 1 dump. If I set up the schedule to simply go incrementally:

0 1 2 3 4 5 6 7 8 9 9 9 9

then I might just get all 9 days on the tape. Let's try this now. I modified the dump.sh script to add an -s flag which will run the simple series. The file /etc/dumplevel will contain the next level used (0-9). It will not advance the number beyond 9. One can call dump.sh -S 0 with capital-S argument to set the next dump level to that level (i.e., 0 or any other level specified.)

It is easy, then, to recover from a crash: simply insert the last tape and run unattended restore operations till the end of the tape.