Fast and Reliable Subversion Incremental Backups
Subversion has the ability to “dump” an entire repository in a flat, semi-parseable file. This is done using the svnadmin tool. The really cool feature, however, is the ability to incrementally dump revisions. Here I present an automated, reliable solution for performing daily backups of multiple repositories.
Appending to existing dump files
The --incremental switch on svnadmin enables incremental dumping. What this means is that you can append to an existing dump file instead of dumping the entire revision history. This is especially useful for a daily backup script as it is significantly faster to dump only the new revisions. All we need to know is the range from the last backed up revision to the newest (HEAD) revision.
There is a catch, however. SVN dump files are somewhat opaque in their nature, and telling which revisions they actually contain is not trivial. Sure, you could parse them with regexes and retrieve the revision numbers, but once your dump file reaches several hundred megabytes, this is not very practical.
Therefore, we need to keep an external “state” file that tells us the last backup’s revision numbers for every repository. My solution is to use Ruby’s Marshalling abilities and simply save the Hash as a YAML file.
Multiple repositories via a parent dir path
Our goal is to handle several repositories at once using only a “parent dir” path. We’ll be producing one dump file per repository found in that parent dir. We would also like to detect when new repositories are added to the parent dir so that we can start backing them up. This also handles our first run of the script, in which case all repositories are “new”.
Repository names vs UUIDs
In Subversion, repositories don’t really have a “name”. Their name is usually dictated by the URL at which they are accessed. This normally matches the name of the physical repository directory, but not necessarily. Repositories can be renamed at any time without altering the repository itself.
For this reason, repository names are not good identifiers for a robust backup system. Instead, we’ll use the repositories’ UUIDs which, by nature, are guaranteed to be unique and immutable. The state file will provide a UUID-name mapping, but the name will not be used for mapping repositories to dump files.
Maintaining integrity
A dump file is only useful if it can actually be used to restore a repository. Because we’re appending little chunks every day, however, many things could go wrong and result in a corrupted dump which cannot be easily recovered. For this reason, the state file also saves a checksum of the dump file. On the next run, it verifies that the dump file still matches that checksum. If it doesn’t, it means the previous backup session failed for some reason and that we can’t confidently append to the dumpfile.
The last trick is to ensure that repository dumps are atomic. This means saving the state file after each svnadmin operation and not only at the end of the session.
Pushing the dumps to a remote server using rsync
Saving the dump files on the same server than the repositories is stupid for obvious reasons. To solve this, you could mount a remote server via NFS and use that mount point as your dump directory. My server is alone on its network so I opted for a simply rsync invocation at the end of the session. This is optional so you can turn it off if you want; just set USE_RSYNC to false.
Running it periodically
You can run the script manually as much as you want, set it up in crontab, launchd or even as a post-commit hook if you’re really paranoid. I recommend saving the log output (STDOUT) somewhere in /var/log.
Download
Finally, here’s the Ruby script. Please leave some comments if you find it useful or have ideas on how to improve it!
See also
The closest script I found is a Perl script by Pierrick Le Gall. It shared many of the ideas in this post, but only supports a single repository. It also dumps all the chunks into different files, requiring you to reassemble the pieces if you ever need to restore.
hi
thanks for the article! I’ve got the script running. Could you explain, what the best way to restore the *.svndump files
thanks
benzo
benzo: you will find everything you need in the SVN book. The command is “svnadmin load [REPO_PATH]” with the dump file passed in via STDIN. See http://svnbook.red-bean.com/en/1.4/svn.ref.svnadmin.c.load.html for details.
Hi,
your script looks very interesting, under which licence (GPL, BSD, Artistic, …) can i use it?
Peet
peet: the script is available under a BSD license
how can i run the script manually?
Hi,
Your script is really awesome. I was searching for something like this and this script saved me. Thanks so much.
This works fine on a folder with a single repo, when I tried the script on a directory with multiple repo(162 projects), the script listed all the 162 projects as ‘ found repo:’. But only 20 projects were listed for action and added to the YAML file, so the dumps were created only for these projects. I am using Debian Squeeze with Ruby 1.8.7. I have not installed md5. Please help me.
Thanks
Anpl: is anything special about the 20 projects that did get dumped? Can you run svnadmin –dump manually on them? Can you post the log of the script?