Virtual Mattox

Blog archive

Using Replication and Deduplication Effectively To Improve Virtual Backups

My previous post explained how to minimize the time and risk associated with using a WAN to transfer backup files to a secondary location. This installment builds on that advice by telling you how to improve backup performance even more by introducing replication and deduplication. Our customers routinely replicate over WANs and reduce their bandwidth requirements between 20 and 40 times by using deduplication.

Why Replicate a Backup File?
To recap, our recommended process for backing up to an off-site location is to back up VM images locally, then to replicate the backup files over the network. This approach enables faster backups, shorter backup windows and better RTOs than backing up over the WAN connection. Replication and deduplication can enhance these benefits, although a common configuration mistake can make deduplication ineffective, as I'll explain.

Replicating a backup file at the secondary location reduces the risk of a failed backup compared to backing up directly over the network. Here's why: When VMs are backed up over the network, the backup software transmits a continuous stream of data to the backup location, where the data is reassembled and the resulting backup file is compressed. If the network connection is lost during the backup, the backup process needs to start again from the beginning -- it can't pick up from the point where the connection was lost.

Breaking Up Is Good to Do
When replication is used, the backup image (which has already been compressed) is sent over the network, which requires less bandwidth and therefore less time to complete. The real advantage to replication is how data transfers are handled. Instead of transmitting a continuous data stream, replication solutions break each backup archive into a series of smaller data sets, and a checksum algorithm is applied to each to verify its accuracy. The smaller data sets are then sent over the network and reassembled on the other side, where the checksum is used to ensure the transmission was complete and accurate. A 10 GB backup archive might be split into thousands of 1 MB packets for transmission, each verified with its own checksum. If the network connection is lost during the process, packets that were already received, verified and reassembled at the secondary site will not need to be retransmitted, which is a major advantage over WAN-based backup.

Don't Get Duped with Deduplication Misconfigurations
Data deduplication solutions can be very effective and can be used with either backup or replication. Not all backup solutions or storage systems support deduplication, but deduplication can be used anyway. However, there are a few configuration issues you need to be aware of to make sure deduplication will be effective. Where you place deduplication in your architecture depends on your systems.

When backup software supports deduplication but the storage system doesn't, run deduplication at the backup level of the system. The deduped file can then be presented for storage. The storage system is unaware that deduplication took place, so non-support for deduplication is not a problem.

If deduplication is done through an appliance, it doesn't matter if the backup software supports deduplication or not as long as you can still turn off compression when deduplication is enabled. Data can be deduped before it is presented for backup.

If deduplication is applied to backup files, it is essential to configure the backup software to disable data compression and encryption in the backup software. Backup software compresses and encrypts data by default, and doing so before deduplication will severely limit deduplication effectiveness. That is because deduplication works by searching block-by-block for recurring patterns of data. When duplicate bit patterns are detected, only a single instance is captured in the deduped file, which is why a deduped file is smaller than the original. Encrypting or compressing data changes its unique bit pattern, which makes it impossible for deduplication solutions to find and omit duplicate data.  You also need to be aware if your backup software modifies its backup archives at each backup cycle, because this can lower your dedupe or backup performance.

Although it seems counterintuitive, you'll save more space by turning off compression and using deduplication alone than you will by using these techniques together. As I noted early, our customers routinely make their backup images 20 to 40 times smaller by applying deduplication, even with compression turned off in the backup software.

Replicating a backup image from the production site to a secondary site -- instead of backing up directly to the secondary site over a WAN connection -- enables faster backups and restores, and minimizes the inconvenience if the network connection goes out during the transfer process.

Posted by Jason Mattox on 06/11/2010 at 12:49 PM


Featured

Subscribe on YouTube