Using Rsync and Hard Linked Files to Store Backup Snapshots.

Thursday, 20. August 2009

Who’s really behind the idea.

First off, let me be clear that this was not my idea. I was told about this whole concept by Mike Rubel @ Cal Tech. He’s the rocket scientist behind this whole idea. Now that we have our “Credit Where Credit is Due” portion out of the way, let’s get into the whole rsync backup concept.

Rsync, what does it do?

From the rsync man page:

Rsync is a fast and extraordinarily versatile file copying tool. It can copy locally, to/from another host over any remote shell, or to/from a remote rsync daemon. It offers a large number of options that control every aspect of its behavior and permit very flexible specification of the set of files to be copied. It is famous for its delta-transfer algorithm, which reduces the amount of data sent over the network by sending only the differences between the source files and the existing files in the destination. Rsync is widely used for backups and mirroring and as an improved copy command for everyday use.

In short, rsync rocks! It has all the tools you need to do replication of data between different systems. What does that mean? Well, it means you can use rsync to keep a remote site “mirrored” to your local site in case your local site goes down or crashes. This is known in the business as “Disaster Recovery”.

I looked at the rsync man page, and it’s too complicated.

As with most high powered utilities, the command interface can be a bit complicated. But if all you are going to do is backup a single directory to another system, here is all you need:

backup:~# rsync -av --delete --numeric-ids -e ssh \
> <source_host>:<path_to_data> <local_destination>

In simple english:

  • -av = Archive and be Verbose
  • –delete = Delete files on target that are not on source
  • –numeric-ids= Do not attempt to do uid to name translation on source (only works if you are logged in as root)
  • -e ssh = Use ssh as the working connection shell (this is the default on newer versions of rsync)

So, Here is an actual command:

backup:~# rsync -av --delete --numeric-ids -e ssh \
> root@snoopy:/var/ /home/snoopy-var

This will synchronize all the files in the /var directory on snoopy, to the /home/snoopy-var directory on the host I’m logged on to. It’s that simple!

But I need to keep at least a week of daily backups. How do I do that and not have to use 7 times the disk space?

Well, this is where hard links come in. First you will need to make sure that your target partition supports POSIX hard links and that they work properly. If you are running ext3 on Linux, you are all set. I guess we should talk a little about hard links before we go too much further.

What are Hard Links and how will this help me with backup storage use requirements?

Most of us think that a file name is the file itself, but it is just a hard link to where the data is actually located. Most files only have one hard link, but you can have more then one hard link to the same data. Let me show you:

stu@linus:~$ stat Tux.jpg
 File: `Tux.jpg'
 Size: 3176          Blocks: 8          IO Block: 16384  regular file
Device: 15h/21d    Inode: 3178907     Links: 1
Access: (0600/-rw-------)  Uid: (10000/     stu)   Gid: (  513/netusers)
Access: 2009-08-14 12:40:35.000000000 -0700
Modify: 2005-07-06 13:32:38.000000000 -0700
Change: 2007-01-04 16:33:06.000000000 -0800

stu@linus:~$ ln Tux.jpg Tux-2.jpg

stu@linus:~$ stat Tux.jpg
 File: `Tux.jpg'
 Size: 3176          Blocks: 8          IO Block: 16384  regular file
Device: 15h/21d    Inode: 3178907     Links: 2
Access: (0600/-rw-------)  Uid: (10000/     stu)   Gid: (  513/netusers)
Access: 2009-08-14 12:40:35.000000000 -0700
Modify: 2005-07-06 13:32:38.000000000 -0700
Change: 2009-08-20 11:06:19.000000000 -0700

stu@linus:~$ ls -i Tux.jpg; ls -i Tux-2.jpg
3178907 Tux.jpg
3178907 Tux-2.jpg

Ok, So what do we have here? As you can see, when I ran stat on the Tux.jpg, it shows only one link. I then did a hard link to create a new file name for the same data called Tux-2.jpg. Then when I ran stat again, it now shows two links. Then when I ls the files asking for it to return their inodes, as you can see, they are the same.

What does this mean? well it means that I have two file names that point to the same data. And if I delete one of the file names, it just reduces the number of links to it. It does not delete the data until the last link is deleted. That means I can have time stamped directories or “Snapshots” of the data I’m backing up, but only use slightly more disk space then a single copy of the data.

All we need to is create a directory for each snapshot in time by using the -al command swtiches for cp. Here is an example that pulls this all together:

backup:~# mkdir -p /home/snoopy-var/current

backup:~# rsync -av --delete --numeric-ids -e ssh \
> root@snoopy:/var/ /home/snoopy-var/current/

backup:~# mkdir -p /home/snoopy-var/2009-08-20-00-00

backup:~# cp -al /home/snoopy-var/current/ \
> /home/snoopy-var/2009-08-20-00-00/

So, what we have done is we created / updated the current copy of the snoopy var directory, and we created a complete set of hard links to it with the cp -al command using almost no space what soever.

Now all you need to do is script this and you are set. Or you can use my script here.

As with all open source solutions, your mileage may vary… But I’ve had great success with this method of backup.

— Stu

Share

Leave a Reply

You must be logged in to post a comment.