Exercises are good for health

To sell backup systems, it is claimed that the world is divided into two, those who have lost data, and those who will lose it. It is also forgetting that there are those who take care of their health and others who expose themselves to problems...

This is surely because we are always called after disasters, we hear the same complaint very too often:

If only I had checked that it worked!

A victim

You probably can't imagine the anger and frustration that we encounter when our customers have spent a certain budget to install a magic box that is supposed to protect them, and they finally find that an error somewhere has made this solution ineffective and that they ultimately lost everything...

So, for those who have not yet had the (bad) luck to go through this mourning, we offer you short stories to scare ourselves. And since we are not going to leave you in anxiety, we are also offering you a solution to regain confidence.

© vlanka @ pixabay

If we consider a network infrastructure as a set of "connected stuffs", it is a matter of maintenance or technical inspection. As for a cars where the manufacturers encourage the first (otherwise the warranty is lost) and where some states require the second (in, France, without this safety check, you’ll pay a 135€ fine and the car may be immobilized).

Personally, because of our little demiurge mindset, we prefer to see our infrastructure as a living being that we have created, that lives and that evolves. So we are talking about a lack of training and exercises.

As with health in general, we have to move from the “it would be good to do it” stage to the “we do it” stage. And so to plan those workouts and exercises. Not only does it cost not much, but the benefits are real.

Little stories to scare yourself

As always, these stories are written from our experiences and, as professional secrecy requires, we have anonymized and adapted them to respect the participants, the companies (and their reputation).

Ransomware pierces defenses

Sylvain is a system administrator and has been in charge of backup his company's data for several years when his management offered him a promotion to the new position of CISO (Security Management). Before officially taking up his duties, he is relieved of his current duties (the charge of backups has passed to a colleague) and he begins work-study training for a year.

While learning his new missions, his company is suddenly the victim of ransomware... All company data is encrypted and as long as the computer system is not rebuilt and the data restored, employees will have to manage and work as in old times. Sylvain must put his training on hiatus to save what can still be and rebuild what has been destroyed.

Unfortunately, when he finally finishes the installation of a new file server, he realizes that the backups on which he was counting have not been made since he passed to his colleague... His successor, already very busy with his tasks, had not considered it a priority and had "postponed" it. Six months of production went up in smoke.

After rebuilding the computer system, Sylvain was fired for gross misconduct. It won't bring back the data or pay off the eventual ransom, but it saves face: so it was the fault of the CISO, not the company, the unfortunate victim of the circumstances.

Cascading problems

Charlène is a renowned architect in the country, to the point of having set up her own office and hiring other architects to handle the many projects entrusted to her. As she does not consider herself competent in IT and does not have the budgets to hire a permanent administrator, she called on a specialized company to manage, among other things, her file server (with two disks in RAID1, mirroring each other) and its two backup boxes (including one at home).

For several years, everything went well: Charlène paid for the maintenance and the company set up all its machines, took care of a move to new offices and when one of the server disks failed, they replaced it quickly.

Until the new disk also fails and Charlene discovers that, despite her best efforts, she will not be able to recover her data...

Of course, the technician who carried out these operations no longer works for the outsourcing company since that time and it is with as much surprise that his manager discovers the minefield that he had left behind and which finally exploded in the face of his client.

The case is now in the hands of lawyers and IT experts (with the participation of guest insurers). In a few years, a judge will be able to determine the responsibilities and the amount of damages reimbursed to one or the other. But in the meantime, it won't bring back the three years of lost files.

Faire des exercices

Do exercises

We could have told you more of the same kind. Anytime you would have realized that after setting up a backup system, valiant heroes tend to leave it unattended in its corner. Businesses then consider themselves safe from problems thanks to this foolproof system (after all, that's what white hats promised them).

Plan

In all of our examples, the damage could have been avoided if someone had taken the trouble to check that everything was working as expected. But as often this task, considered as subordinate, is postponed ad vitam aeternam...

And we can understand it. Caught in the uninterrupted flow of tasks to do, we don't see how to free up our time. And since we consider these exercises daunting, our brains find plenty of other things to do instead and eventually forget about them.

© audeCD @ pixabay

If you have difficulty in making this habit, in forcing yourself to do these exercises, the most effective is still to formally plan them. Whether through your calendar or your ticket manager, it is easy to create recurring tasks there (e.g. with kanboard (which we are using) but also with GLPI, Nextcloud or thunderbird).

You can of course adapt the frequency to the density of activity in your infrastructure. The more things move, the more often you have to check that everything stills work.

Seen differently, for data backup, the last exercise corresponds to the most recent data recoverable in the event of a disaster scenario. So don't delay it too long.

Proceed

During an exercise, the goal is to simulate a problem to check that the protective mechanisms (automatic or manual) are effective. Here are some examples :

And since we are mainly talking about redundancy, also check the human redundancy; if an administrator has implemented a solution, the exercise must be performed by someone else.

Even if the administrator is present to deal with any problems, the exercise should be conducted as if he were absent.

Hence the interest in writing formal procedures, updated during each exercise. If the administrator is unavailable, this document will help anyone resolve these issues. Everybody wins; employees in skills and the company in resilience.

To evolve

Ideally, each exercise goes smoothly; the mechanisms work as expected, the procedure is adapted and everything is going well.

In reality, these exercises almost always point to a problem somewhere. And that is all their interest. Once you have encountered any problem, you can apply these two management rules from GTD:

Some corrections will be urgent ("the exercise broke everything") and therefore made immediately. Others less ("the protection is not as effective as expected") and planned for a later time.

In any case, at the end of each exercise, you gain a better vision of the resilience of your infrastructure and the opportunity to improve it even more.

Once it has become routine, this continuous improvement is like Zen.

At first, I found it painful. And then, every once in a while, I got a taste for it, now I'm a running addict.

A long-distance runner

© 12019 @ pixabay

At the arsouyes

To organize, synchronize and avoid forgetting important tasks, we use Kanboard and create tickets for whatever we need or want to do.

And in the midst of all these tasks, we have created a recurring "PRA Test" task, which we perform once a month and which includes, among others, the following subtasks:

And since time, these exercises have allowed us to detect and correct some problems...

To get an idea of ​​the "cost" of these exercises, we time each of the subtasks (kanban takes care of this automatically when we check the box). Last month, it cost us 0.44 hours (or 26 minutes, including 15 to wait for duplicate to list remote files).

And now ?

Over a year, our monthly exercises cost us less than 2 days (of 7 hours), or 1% of a full time (of 218 days), I'll let you do the math in euros or dollars.

It’s not much. Especially when compared to the benefits in terms of experience gained and reduced consequences in the event of failure. It will take well over two days to rebuild what can be, and mourn the rest.

JillWellington @ pixabay

In exchange for these few hours, we are much more relaxed about our ability to resist and survive a big blackout.