Virtual Mattox

Blog archive

Problems With WAN-Based Backups -- And How to Avoid Them

When the time comes to move to a new place you have a lot of options. On one hand, you can stuff the trunk of your car and make dozens of trips back and forth (and hope no more than 20 percent of your things get broken during the process). On the other, you can hire professional movers, then sit back and let them handle everything.

In the same way, you have multiple options for moving VM backup archives from your primary data center to a storage location, and these options also present a wide range of convenience and risk associated with the move.

In my experience helping customers troubleshoot their backup operations, I've learned that one of the most common methods for transferring backup files to a DR or archive site -- using a WAN to back up to the secondary location in real time -- is also one of the most problematic. This method inevitably adds both time and risk to the backup process. The delays and risks associated with backing up over a WAN also apply to recovery, which means WAN-based operations cannot meet the same recovery time objectives (RTO) as other backup and recovery techniques.

However, WAN-based backup and recovery operations are a fact of life. My point is not "don't back up over a WAN"; what I want to emphasize is how the WAN should be used for this function. In a nutshell, leverage the WANs for transferring backup files and recovering them from DR facilities. However, don't execute backup and recovery operations over the WAN, because of bandwidth and other limitations.

When customers want to transfer VM backup images off site, our recommended best practice is to back up locally, then use the WAN to replicate the completed backup file. This approach minimizes backup times and supports superior RTO levels. Here are the four primary challenges with executing backup and recovery over the WAN, and why local execution is a better alternative.

Problem #1: Smaller Pipes = Larger Backup Windows
Image-based backup works by taking a snapshot of the VM, writing the snapshot data to create the backup, then compressing (and possibly transferring) the backup file. The sooner the snapshot is processed, the faster the backup can be completed. In WAN-based backup, snapshot data is written to the backup location in real time over the WAN link. Therefore, the time required to complete the backup depends on the available bandwidth. While WAN bandwidth continues to improve, it will never match local area network speeds, so WAN-based backup will always take longer than local backup. In keeping with my moving analogy, don't use the trunk of your car when when you need a moving van.

Problem #2: Snapshot Overexposure
During the backup process, the snapshot remains open until the process is complete. To make the backup as fast as possible, you want to close the snapshot as soon as possible. This is especially important when backup windows are tight. The extra time required to complete snapshots over an open WAN connection can make the difference between making or missing the backup window, especially for large VM populations.

When backing up over the WAN, it is critically important to keep the service console optimized to ensure computing resources are available (see my previous post, “5 Best Practices for Improving Image-based Backup Performance”), which adds another item to the system administrator's to-do list. Customers often can complete WAN-based backups when their virtual environments are new, but run into problems later after VM populations have grown.

Problem #3: Increased Risk of Failure
Most virtual backup solutions are designed to execute locally. They write a continuous stream of data to the backup target. If the data stream is interrupted -- for example, because of a dropped network connection -- the writing process does not pick up again where it was interrupted, but instead starts over from the beginning. At LAN speeds, redoing the backup may not cause a problem and often goes unnoticed. However, rewriting failed backups over a WAN can easily result in unacceptable performance and missed backup windows.

Problem #4: Repeat 1-3 for Restores
All of the problems and limitations identified above for WAN-based backup also apply to the restore process. If VM backup archives need to be restored over the WAN (as opposed to recovering the encapsulated files over the WAN and restoring them locally) recovery times and success rates will be hampered by bandwidth limitations and the risk of restarts. For these reasons, local execution is the recommended practice for organizations with aggressive RTOs or any time there is a benefit to backing up and restoring systems quickly.

My next post will cover techniques and advantages for backing up locally and replicating to a secondary location, including tips for how to effectively integrate data deduplication into the process.

Posted by Jason Mattox on 06/07/2010 at 12:49 PM


Subscribe on YouTube