Table of Contents

Parallelizing Backups

Several articles in the “GWDG News” deal with one main question about backup:

Identifiying changed files takes too much time!

The following text gives an analysis and some approaches to solve it.

Backup of large file systems using ISP&TSM

As the topic "Looking for suggestions to deal with large backups not completing in 24-hours" was discussed on the „ADSM-L“ mailing list recently we decided to update and translate a text published in the GWDG News in November 2016. The growth of data mentioned two years ago is still ongoing and therefore the file admins and backup operators have to face the challenge of coping with the backup of data within the specified time window and thus complying with the promised protection against manipulation and data loss. In this article, some approaches to speed up the backup using ISP/TSM are discussed. They will be explained briefly, but the scope is on the chances and limitations of each. The second part of the article develops an approach starting with the basic idea of parallelizing the backup towards different variants of a script including some reporting, error handling and statistics.

Current situation

The Amount of data is growing, different analysis show a value of 20% each year (1,2).In addition to the challenges of storing this data sensibly and efficiently, one aspect often falls out of focus:

How can this growing data be backed up?

The nominal performance of the tape systems is growing faster than the growth of the data itself (1,2), but even this view is unfortunately incomplete, since the process of backing up, the actual backup, often represents the bottleneck. “IBM Spectrum Protect (ISP)” (formerly known as „WDSF/VM“,“DFDSM“, „ADSM“ or “Tivoli Storage Manager (TSM)”) has long pursued the approach of backing up only the changed files s since the last run only – instead of doing a full dump sometimes. As there are no planned full dumps, this approach is called “incremental forever”.

The Advantage is obvious:
Especially for large file systems (say > 10 TB) the amount of data changed daily is relatively small. So that even with many versions of older data (the GWDG standard backup policy allows up to 350 versions in 90 days) the additionally needed space is small (After evaluation of the GWDG ISP servers: between 15% and 84% in addition to the active data, whereby the 84% is an outlier, the average value is 39%.) compared to the space needed for the secured data. But, if a full backup is done periodically, the backup capacity must be several times greater than the secured data itself. For mixed approaches, e.g. the “Grandfather-Father-Son”-principle by means of

the necessary backup capacity is larger than with “incremental forever” due to the full dumps. However, “incremental forever” does not solve an essential problem of any incremental backup namely the answer to the question of which data must be backed up at all. ISP identifies the data to be backed up (“backup candidates”) by comparing all directories and files on the computer (to be backed up) with those from the last backup and remembering changed files. This process usually runs at a speed of 1 - 2 million objects (files and folders) per hour.

Searching through a 100 TB file system with around 100 million objects takes between 50 and 100 hours, a daily backup of such a file system therefore is not possible with the common approaches.

In addition, there is the problem that within this long search time, a considerable amount of data will be changed or even deleted. As a result, ISP throws many error messages (“ANS4037E Object <NAME>' changed during processing. Object skipped“ or “ANS4005E Error processing '<NAME>': file not found”).

How to solve this problem?

non-working solutions

One possible solution might be the “bird ostrich method”:
just adapt the service description of the file servers by not guaranteeing daily backups but (initially) only every two days. As the amount of data grows, the backup frequency needs to be continuously adjusted. When reaching about 150 million objects the interval will only be a monthly backup :-(

At this point, the second “zero solution” should be considered:
To give up the backup of corresponding file systems completely and not to lull the users into a (data) security that does not (no longer) exist.

Since searching for “backup candidates” is the problem with backups, one could consider biting the bullet and doing full backups, as tapes are relatively inexpensive compared to DISK storage. Unfortunately, in our experience, this is definitely no solution:
For a full backup of 100 TB, theoretically, only about 24 hours are required with a 10GE connection. However, actual operational experience shows that only about 2 – 4 TB per day are effectively saved and about 25 - 50 days are required for each full backup. In other words, approximately the same time as for an “incremental” backup.

Acceleration with ISP on-board tools

IBM offers several on-board tools to speed up the backup process:

Simplified identification exclusively via the change date

Usually the ISP client compares numerous meta data to select objects for the new backup. Besides the date of the last modification, these are also file size, checksum, access rights / ACLn. In the interactive call dsmc i and/or as Object in the client schedule, the check can be reduced to the comparison of the change date of the object with the date of the last backup by the option -INCRbydate and thus considerably accelerated.
However, the option also has some problems:
Especially if no snapshots are used or if the backup fails, files that are modified or created while the backup is running will be skipped during the next run with -incrbydate if they have not been modified again. IBM therefore strongly recommends running a normal “incremental” regularly 4. Similar problems can occur if client and server have different system times.
Another important point: Deleted files are not recognized, they remain in the backup and files that come into the system with an old date, e.g. due to the installation of software, are not backed up!
In summary, the -INCRbydate option can only be used for the daily backups together with a normal backup at the weekend if the normal backup lasts slightly longer than 24 hours.

Turning off ACLn and checksums

Processing ACLn (and thus previously checking) and creating checksums slows down the identification process and can be influenced by several options. However, it should be carefully considered whether the relatively low speed gain sufficiently outweighs the loss of information.

Parallelization of backups for multiple file spaces

If the data to be backed up are on several partitions, the backup process can be distributed to parallel streams using the RESSOURCEUTILIZATION option (in contrast to IBM documentation, significantly more than 10 are possible, > 100 streams are reported in practice). This makes better use of the bandwidth and considerably reduces the search time through parallelization. Since this also generates additional sessions on the ISP server side, the number of MAXSESSIONS may have to be increased. This approach works only when actually backing up multiple file spaces. As a workaround, of course, a single file space can be split into seemingly multiple file spaces with the VIRTUALMOUNTpoint option and then this approach works, of course. (See also Excursus on virtual Mount points for Windows Clients)

Explicit backup of changed files only

If information is available, which files have changed since the last backup and which files have been deleted since then, ISP can only back up these files. Instead of an “incremental backup”, a “selective backup” with the explicit specification of these files is then possible:

or

The basic principle of “selective backup” is also used in the following approach and in the “file systems that support fast backup”, but requires two explicit lists of files that have been modified or deleted.

JournalBasedBackup / FilepathDemon

IBM has been offering the JournalBasedBackup (JBB) method since TSM 5.
The JBB Demon (or Filepath Demon) monitors the file system to be backed up and collects information on new, modified and deleted files. During backup, the TSM/ISP client uses this information in the same way as doing a selective backup. The effort for identifying the backup candidates is eliminated and the backup is reduced to the transfer of the new / changed data.

Tests done by the GWDG with a Linux fileserver with about 150 TB capacity distributed over 22 file spaces were not successful: The resource requirements for the JBB were extensive, but the time saving, especially due to regular re-indexing, was rather limited. In other constellations, the JBB may bring clear advantages.

There is also an important limitation:
Journal Based Backup only works with local file systems. CIFS / NFS and cluster file systems do not work.

Hint:
Optimizations for data transmission can be found in the Performance Tuning-Guide (V7.1.6).

Hybrid approach with snapshots

Numerous file systems and most filers offer the possibility to create snapshots. A hybrid approach can be implemented by combining snapshots and ISP backup:

Backups are done as often as possible, e.g. weekly, in between snapshots.

In addition to the considerable expansion of the backup time window, there is usually the positive side effect that the end users can access the snapshots directly and the admins are relieved of numerous restore requests. If the backup is also based on a snapshot, the problem of the opened files is also solved (error message ANE4987E Error processing '<NAME>': the object is in use by another process).

A prerequisite for this approach is, of course, that the file systems support snapshots – and in sufficient quantities.

File systems that support fast backup

Some file systems / filers support a fast backup using ISP by identifying the necessary backup candidates and making them available to the ISP client. (This list is only a selection):

IBM Spectrum Scale (ISS, formerly GPFS)

IBM's cluster file system naturally supports backup with ISP and even offers its own script mmbackup. This not only uses the information about the backup candidates, but can also parallelize the data transfer over several (ISP) nodes and GPFS servers.
However mmbackup does not simply run out-of-the-box: The initial creation of the configuration requires a little trial and error, but afterwards mmbackup runs both stable and performant.
In addition to ISP, IBM Spectrum Scale also offers close integration with HPSS as an HSM system, so that the problem can also be reduced by (partially) transferring the data to HPSS - whereby ISP/ISS can also back up very large data volumes in a comparatively short time.

NetApp SnapDiff

NetApp has also been supporting backup of its own NAS filers since TSM 5 in a variety of ways. In addition to NDMP, the SnapDiff function also accelerates the incremental backup. SnapDiff transfers the changes to files and directories between two snapshots to the ISP client. The integration goes so far that the ISP client can even trigger the required snapshots on the filer and after a successful backup can delete the previous one on its own.

Since the SnapDiff function compares only two snapshots, but does not take into account in any way whether the last backup was successful, the same problems arise as when using the -INCRbydate option: errors from the last backup are not compensated and a regular normal incremental backup is strongly recommended. mmbackup, in contrast, takes into account the backup status of all data and is fault-tolerant with regard to the problems mentioned above.

Basically, each cluster/scale-out file system should be able to provide a list of new, modified and deleted files, since this (meta) information is necessary for the consistency of the data (and especially the caches) on the cluster nodes. In practice, the problems are that this information is not easily accessible and there are no tools by manufacturers to access this data. Quantum has responded to customer demand and is currently examining how this information can be made available to the StorNext file system. DELL/EMC also offers ScaleOut NAS systems with the ISILON systems. In version 7 of the operating system, called OneFS, there is the possibility to log changed files, but the resource requirements are so high that there is a lasting impairment of the entire system. With OneFS 8 there should also be improvements here.

Two (simple) ideas for all file systems

For all users who do not have an IBM Spectrum Scale in operation (mmbackup is the best solution for this!) and neither full backups nor NDMP this raises the question of what to do now?

As previously mentioned, identifying backup candidates takes most of the time during ISP backup. This process examines the entire file tree of the file system to be backed up – sequentially in a single thread. The solution is to turn this one process into several parallel processes.

Users can usually be divided into groups (e.g. working groups or institutes). Especially in academic environments, this classification can also be found in file systems, since there is often a folder level with faculties or institutes for easier access control, and below this level are the user and workgroup directories.

Variant 1

Parallel backup is possible by setting up a separate node for each faculty or institute instead of a single ISP node for the entire file system and performing the backup “faculty by faculty” / “institute by institute”. Instead of a single process, several processes search the file system in parallel (file servers are able to process even several hundred parallel processes) and the search times should be significantly reduced. In practice, this approach reveals at least two problems::

Using Unix, the nodes can be separated relatively elegantly using “VIRTUALMOUNTs”, for Windows you either have to create exclude.dir rules for each node, which is both complex and error-prone, or work with a trick (see excursus “VIRTUALMOUNTPOINTS for Windows”).

Variant 2

Often, however, the users on the file systems are not organized in groups, but all directories lie flat next to each other on the entry level. Creating a separate ISP node for each user directory repeats the second problem mentioned above and is very time-consuming regarding the number of users.
It is therefore easier to distinguish the directories according to a pattern, for example after the first character(s): : ^[a,A], ^[b,B], … ,^[z,Z], ^[0-9] (ISP even provides regular expressions at this point!)
You get 27 or 729 ISP nodes, which automatically include all new directories. Unfortunately, the Regular Expressions (RegEx) formulations only capture the directories that exist, not the deleted ones. Remedy is possible if you additionally back up all directories of the start path without subdirectories.
Although this variant is often better than the first, it does not meet all expectations:

In summary, there are certainly application scenarios for both approaches, but experience at GWDG shows that the effort is quite high and there are always a few power users who again need special treatment with these two approaches in order to achieve a usable benefit.

One approach for all file systems

Idea and first steps using BASH

Already in the last decade the (at that time) Generali Versicherungs-AG was faced with the problem outlined at the beginning and Rudolf Wüst as backup admin extended the aforementioned approach by a decisive idea. From this, he developed a solution that successfully parallelized the “search problem” with up to 2000 threads. Mr. Wüst kindly shared his extension and the author took it up and developed it further within the scope of his work at the GWDG.
The goal of a practicable solution must be to capture all directories, store them in an ISP node and still parallelize the search. This can be done by executing a script instead of a simple backup call, which in turn starts several parallel threads to back up the directories. The core of the script consists of a loop of the following form (example 1).

For each (find all directories in given start path)
{
	run a backup for this specific directory
}

Example 1: pseudo code

Instead of “one incremental backup”, many partial incremental backups are performed for each directory.

The deleted directories are recorded with a subsequent backup of the start path without subdirectories – the last specification is extremely important, otherwise, a normal “incremental backup” is made on the entire file system.

As source code for the BASH this looks like in example 2:

startpath=<path to start with>;
folderlist=<path to a file containing foldernames>;
find $startpath –xdev –mindepth 1 –maxdepth 1 –type d –print > $folderlist
for $i in $(cat $folderlist)
do
	dsmc –i $i –subdir=yes &
done
dsmc –i $startpath –subdir=no
rm $folderlist;

Example 2: source code BASH

During the first tests you will find out that the script in its present form will indeed start as many threads as existing directories. On the one hand, this forces the computer that performs the backup to its knees, and on the other hand, the “MaxSessions” setting of the ISP server is probably reached almost immediately and the server refuses further connections.
The remedy is a counter that simply waits when the allowed number of threads is reached. In the bash, the split backup threads have the “parent process ID” of the script itself, so these threads can be counted even if you run the script for several file systems simultaneously.

As BASH code the loop looks like example 3.

pid=$$;	# parents process id 
startpath=<path to start with>;
folderlist=<path to a file containing foldernames>;
maxthreads=<max. number of parallel threads>;
find $startpath –xdev –mindepth 1 –maxdepth 1 –type d –print > $folderlist

while [ -s $folderlist ]
do
	nthreads=$(ps axo ppid,cmd | grep $ppid | grep -v grep | wc -l)
	if [ $nthreads -le $maxthreads ]
	then
		# get new start path
		folder=$(head -n 1 < $folderlist);

		# backup actual folder
		dsmc i $folder/ -subdir=yes -quiet >> $ppid.log &

		# remove first line from folderlist
		sed -i '1 d' $folderlist
	else
		sleep 5; # wait to complete another thread
	fi;
done
dsmc –i $startpath –subdir=no

# wait for all running threads at the end
while [ $nthreads -gt 1 ]
do
	>&2 echo "Waiting for $nthreads threads to end"
	sleep 60;
	nthreads=$(ps axo ppid,cmd | grep $ppid | grep -v grep | wc -l)
done
rm $folderlist;

Example 3: extended BASH code

In the extended form, essential goals are now achieved, but one cannot be completely satisfied:

An excursion to the PowerShell

PowerShell. Unix affinity combined with reservations about the Powershell and above all the double effort ended this project after some work without having created an executable version.

PERL - one solution for all (?) worlds and further development of the simple approach

The closest solution was initially overlooked: a programming/scripting language for all operating systems, neither BASH/MinGW/WSL nor PowerShell / PowerShell Core on Linux, but PERL.

PERL offers numerous functions - also in the area of access to files and directories, which are encapsulated by the respective implementation in such a way that the actual command is independent of the operating system. File system paths can even be specified in both Unix and Windows nomenclature (i.e. with / or \ as directory separator) and thanks to the File::Spec→canonpath function they are converted to the correct format. To a large extend the source code does not need be individually adapted for the respective platform. Exceptions are the paths to the binaries, i.e. \opt\tivoli\client\ba\bin\dsmc or C:\Program Files\Tivoli\baclient\dsmc.exe and (currently) only partial readout of the directory tree using find (Linux) and Robocopy.exe (Windows).
Another reason for PERL is that it allows the use of threads in a simple way and also ensures that only a certain number of (sub) threads run at the same time and thus the start of further threads only takes place after completion of previous threads - and this independent of the operating system!

The steps outlined for the BASH are thus reduced to three essential steps in PERL:

  1. create a new subthread with the fork() function
  2. branch the source code into the paths main script and script for the subthread,
    in the main script only the number of started threads is incremented, in the subthread the partial incremental backup takes place
  3. check whether the desired number of threads has been reached and waiting for a thread to be terminated and then start a new one.

In detail, the source code is of course somewhat more complex and also takes into account, for example, if that starting a subthread was not successful.

Further development: Deeper dive into the directory tree and start parallel threads based on multiple directory levels

The tests with the parallelization approach directly below the base path showed exactly those effects that were already addressed during parallelization via institutes: Individual directories are (usually) larger than all other parallel-lying directories together, so that the speed gain is considerably lower than expected / or desired. A better balance can only be achieved via additional directories; these can be found by searching through further levels in addition to the first, highest directory level below the start path and then allowing the backup to be made via all these directories. The first problem is that the directories are nested, i.e. a partial backup of a directory from a higher level also includes those subdirectories that are backed up in other parallel threads anyway. In this script, this problem was solved by saving all directories above the set “dive depth” with the option -SUbdir=No, i.e. only the contents of these directories including the names of the subdirectories, but not their contents. In a second step, the directories are backed up at the lowest level specified with their subdirectories (-SUbdir=Yes option) (Since backups without subdirectories are usually much faster, those directories with subdirectories are backed up first and those without are backed up second).

Evaluation of the individual runs

Not only for profiling (see below) but also to create a summary of the backup, each sub-thread writes its output to a separate file that contains its own ID in the name in addition to the process ID of the script. Thus, even if the script is aborted, the output can be clearly assigned.

Although the overall evaluation can only take place at the end of the backup, a sufficiently deep dive into the directory tree in practice quickly leads to several thousand to hundreds of thousands of small files and thus to considerable problems. Therefore, the sub threads write the content of the output file to a central log file after completion of their backup, which is evaluated at the end of the script. Within the context of this writing, the information whether subdirectories have been processed is also stored and the runtime is already converted into seconds for profiling and saved. The return value of the backup call is also added.

In the current implementation (December 2018), the final evaluation sums up In the current implementation (July 2018), the final evaluation sums up

If necessary, a conversion to the common size (bytes, seconds) takes place.

In addition, the number of

are counted in each case and as a total.

From the sum of the elapsed times and the runtime of the loop via the directories (WALL CLOCK TIME) the script calculates a parallel speedup, which shows how much faster the parallelization is compared to the sum of the individual times.

Performance optimization using profiling

The runtime of the parallel backup is essentially determined by the runtimes of the individual backup runs. Without a detailed measurement (but by comparing the time for the script call with commented out the backup call) it is assumed that the runtime of the PERL statements is negligible in comparison. The aim of the optimization is the “correct” order of the directories, so that

  1. the large, long-running ones run as parallel as possible,
  2. the large ones are started because a mismatch balance has a less dramatic effect on the shorter runtimes of the smaller directories.

Since the runtime of the backups cannot be estimated in advance, the optimization is based on the last backup (and does assume no dramatic changes, which could only be predicted by complex and therefore time-consuming analyses). As described above, the sub threads also write the runtime in seconds to the central log file at the end of the backup, so that a list of all directories with the respective runtimes is created when they are evaluated. This list is sorted by descending runtimes and written to a profile file.

The next time the script is called, it first creates a list of all directories to be backed up. In the next step (this part does not yet work for Windows and has therefore been swapped out again) the backup script compares this list with the entries from the profiling file.

Directories that are in the profiling file but no longer exist in the directory list are ignored. New entries in the directory list for which there is no runtime in the profiling file are assigned a long runtime (10^10 seconds ^= > 316 years) and are therefore ranked first. If there is no profiling file, the directories are processed in the order they appear in the list from the directory tree.

At the end of the evaluation, the profiling list is overwritten.

Open Issues / Outlook

There are still some questions left, for example about transferring the summary to the server log.

For a good solution, error handling should be added to make the script fault-tolerant to certain situations.

It is also possible to split the work steps “Identify directories” and “partial incremental backup”, so that for very large file systems, the list of directories to be processed is filled up again as soon as the backup window has expired, but still runs one or a few threads - but probably increasing the immersion depth is the better approach.

One problem that cannot be solved is the fact that “partial incremental backups” do not change the “Last Backup” attributes of the nodes or file spaces and, of course, this is not done within the scope of the outlined script. You should refrain from writing to the DB2 of the ISP servers, as this affects IBM's warranty. IBM expressly prohibits direct access to the ISP-DB2 outside of corresponding instructions within the scope of support.

In addition, how do you speed up the restore?

The previously mentioned approaches with ISP on-board means and the outlined approach for parallelization only works for backup. If many files are to be restored from the backup, this is very easy with the approaches with several nodes for a file space, since a separate restore must run for each node anyway and the processes run in parallel. For the parallel threads approach, an adjustment for the restore based on a file list is easily possible: Instead of a “folder list”, a file list is used for the restore.

However, it should be noted that in an environment with a tape library as a storage backend, the number of drives usually limits the performance of the restore. Furthermore, ISP usually organizes the restore (without the -disablenqr=yes option) so that the tape mounts are optimized. If a file list is processed in parallel by numerous parallel threads, the server cannot optimize the tape accesses. However, if a disk-based FILE or container pool is used, the parallel restore over numerous threads is faster. If the data is stored on two servers via server replication, the restore can also be distributed over both servers and thus additionally accelerated.

Unfortunately, experience shows that “full restores” also involve enormous effort when parallelizing and can only be accelerated unsatisfactorily.

Availability / Access to source code / Alternatives

It can be assumed that neither the author of the original idea nor the GWDG can claim to be the only one to have had and implemented the idea outlined. Rather, many TSM/ISP users may have faced the same problem and found similar solutions.

A commercial implementation that follows a similar approach to parallelization can be found in the product „MAGS“ of General Storage.In addition to binding support, “MAGS” offers regular further development and uses several NAS nodes for parallelization with ISILON Scale Out systems. A more detailed product analysis should not take place here. You must also determine the individual benefit.

The script mentioned in this article is freely available in GITLAB of the GWDG under the Apache 2.0 license. The scripts may be used and modified without restrictions. We look forward to receiving your feedback and suggestions.

Transferability to other backup solutions

The approaches presented address the problem of file identification and can therefore be applied to all other questions where a file list is to be created. If you replace the call of the ISP-CLI with another CLI call, you can also find all files in parallel, filtered by all attributes supported by find using appropriate parameters. You can also add another loop that does arbitrary operations with all entries of a complete file list. This also allows you to optimize other backup solutions that can process a directory or file list.

Acknowledgement

The author thanks Gerd Becker (Empalis GmbH), Wolfgang Hitzler (IBM) and Manuel Panea (MPCDF) for proofreading the original article and making suggestions for changes and improvements.

Special thanks to Mr. Rudolf Wüst (Generali Shared Services S.c.a.r.I.) for his generosity in sharing his ideas.

Excursus

Workaround for VIRTUALMOUNTPOINTS for Windows clients

For UNIX, Linux, and MacOS it is possible to configure individual directories as virtual drives in TSM/ISP. This simplifies the configuration of the backup, since the virtual drive can be specified directly as backup source instead of specifying the actual drive and excluding all directories that are not to be backed up using exclude rules.

Unfortunately, there is no comparable function for Windows. This also eliminates the possibility of parallelizing the backup via different virtual drives.

However, if only individual directories are to be backed up, but not the remaining root directory in parallel, numerous exclusion rules must usually be created in the form of exclude.dir records in the dsm.opt. Of course, this way is highly error-prone, additionally all directories newly created in the root directory of a drive are not automatically excluded, but are included in the backup. The following workaround simplifies configuration and parallelization of the backup under Windows:

  • create an advanced share for each directory you want to back up.
  • by adding a $ to the share name, the Windows SMB service also does not list it on the network map (“hidden share”).
  • access to this share is only required by the local admins of the backup node
  • the paths to be backed up can be accessed via the loopback device:
    DOMAIN \\127.0.0.1\<Share1>
    DOMAIN \\127.0.0.1<Share2>

From the point of view of the TSM/ISP client, the shares are independent network shares and can be backed up in parallel!

Threads in PERL

PERL offers its own thread module and thus a much more elegant method than the complex solutions for the BASH or the Powershell:
Using the fork() function, the PERL interpreter creates a second thread that starts at just this point in the script. This thread processes all of the statements below in the same way as the original script. It therefore makes sense to use an IF statement to branch the different tasks. For this, the return value query of the fork() routine provides: if the value is not defined, no thread could be created, if the value is true, there is a new thread - and it is the parent routine in which this IF was executed. The value is also defined in the child thread, but false. The query could therefore look as follows:

my $cpid = fork();
if (! defined $cpid)
{	# forking failed!
       exit 1; # abort the script
}
if ($cpid)
{	# parent process
	# … further commands
}
else
{	# child process
	# … further commands
}


Collecting / waiting for the started threads is also much easier: The wait() function waits for a child thread to end, so they can all be waited for with a simple loop:

while (wait() != -1 ) ;