Previous Topic: Using DeduplicationNext Topic: How to Plan a Deduplication Installation


How Data Deduplication Works

Data deduplication is technology that allows you to fit more backups on the same physical media, retain backups for longer periods of time, and speed up data recovery. Deduplication analyzes data streams sent to be backed up, looking for duplicate "chunks." It saves only unique chunks to disk. Duplicates are tracked in special index files.

In Arcserve Backup, deduplication is an in-line process that occurs at the backup server, within a single session. To identify redundancy between the backup jobs performed on the root directories of two different computers, use global deduplication.

During the first backup:

In the diagram below, the disk space needed to backup this data stream is smaller in a deduplication backup job than in a regular backup job.

Illustration: Deduplication saves only unique data chunks to disk.

With deduplication, three files are created for every backup session:

The two index files together consume a small percentage of the total data store so the size of the drive that stores these files is not as critical as its speed. Consider a solid state disk or similar device with excellent seek times for this purpose.

During subsequent backups:

Note: Use Optimization for better throughputs and decreased CPU usage. With Optimization enabled, Arcserve Backup scans file attributes, looking for changes at the file header level. If no changes were made, the hashing algorithm is not executed on those files and the files are not copied to disk. The hashing algorithm runs only on files changed since the last backup. To enable Optimization, select the Allow optimization in Deduplication Backups option located on the Deduplication Group Configuration screen. Optimization is supported on Windows volumes only. It is not supported for stream-based backups, such as SQL VDI, Exchange DB level, Oracle, and VMware Image level backups.

When you must restore deduplicated data, Arcserve Backup refers to the index files to first identify and then find each chunk of data needed to reassemble the original data stream.