One of WellData’s channel partners called late on a Thursday afternoon with a problem one of their clients were having with a database. From the description of the problem it sounded as if a simple access control had gone awry.
The first problem was to gain access; this as always proved problematic. Quite correctly, their network was very secure from outside access. So after some time and negotiation access was eventually gained only to be thwarted by access to the databases themselves. Finally access was gained.
During this period further users were getting similar problems and progressively being locked out of the system. As one can imagine tempers were starting to fray and knowledge of the problem was escalating upward through the company. At each stage of escalation a status report was required, which WellData furnished, explaining the current problem, its solution and at what stage we, and their own IT department, were at implementing that solution.
Having now gained access to the databases it became abundantly clear that this was not a minor problem; the entire database had locked and no access was permitted to any user, including privileged users. This was therefore a major systems failure.
The first action was to determine why the system had locked. This was achieved by getting into the database at the very lowest of levels, where it was found that there was corruption on a number data files; the data files were identified and a plan of action developed. One of the data files held only indexes and could therefore be discarded and rebuilt from scratch post recovery; the remainder held data and would have to be recovered; the only recovery possible is to recover from backups and perform a point in time recovery. This involves recovering the main baseline backup, applying any differential backups and then applying the transaction logs from that point up to the latest valid transaction.
This then lead to the next problem: this was not one of our clients, so we had no knowledge what their backup regime was or where the backups were stored and the recovery times. Our DBAs were given access to their server support staff to determine where and how we can get these backups; they in turn informed us that that they were kept locally and could be available immediately, perfect. Except that when we received these backups they were simply server snapshots; that is to say no effort had been made to put the database into a consistent state prior to the backup being taken, what DBAs call ‘Hot Backup Mode’. This meant that there was a very good chance that the backups would not be useable.
At this point we informed the client that this was going to be an all-nighter and that we were going to need access to their IT staff to provide us with various pieces of information and the backups as required!
The first act was to backup what we presently had, corrupt or not, to ensure that we would not make matters worse. As the first baseline backups became available, as anticipated, they proved to be inconsistent and we had to work our way back in time until a baseline backup that could be recovered was found. This turned out to be one taken over a Bank Holiday weekend where we assume the database was quiet and all logs fully flushed to disk and after all background jobs had complete the day before. This was lucky; there is no reason why backups taken this way should be
We now had the uphill task of applying the differential backups and a huge number of transaction logs. Overnight a pattern was formed: the clients IT department found and fetched transaction log backups, whilst WellData applied and validated their
The job was finally completed at 7:30 the next morning!
Unfortunately that was not the end: although the system had been tested and was working both WellData’s DBAs and the client IT department had to wait until the users arrived in the morning to ensure there were no final glitches. And there was, one, we had forgotten to run the overnight job!
The overnight job was run, with all the server power we could muster, and the users allowed back on. This time there were no glitches.
WellData now supports this client’s databases; all failures in the support structure have been corrected and regular backup tests are undertaken to ensure the next equipment failure does not lead to another panicky all-nighter.