Sunday, July 27, 2008

Integrated Backup Plan

So... last night couple of our application servers crashed. It was apparently due to some new codes that were released although I'm not yet sure. Here is the summarize version of what transpired:

1. Sysadmin got paged that the application servers were not responding.

2. Sysadmin called the DBA team to check the database connections... We said everything looks okay on our end.

3. Sysadmin called the developers next and they found out that the new codes caused the system to go haywire.

4. When the developers tried to revert back the old codes, apparently some of them did not checked-in the older version and it caused a couple missing database privileges to occur. Also, some needed tables were also missing.

5. From that time on, we were really in trouble and sufficient to say it was a very very long night.

The lesson: maybe it is nice to test an integrated backup plan once in awhile. We always think of doing database backups but it is just one component of the whole system. Anyway, that was what I was thinking during the time and the number of hours I have lost because of the incident.


Mark Robson said...

Perhaps your developers' QA procedure and your release process needs some work.

At the very least, your sysadmin team should be aware of (if not involved in) the release of any new build of any code.

Moreover, rollback plans should be documented AND tested.

Although it's difficult to know if a new release will cause problems in a production system (regardless of the amount of testing done), there are a number of relatively simple measures which can be taken to try to minimise risk:

- Allow new features to be operationally disabled at runtime - thus removing the need for a rollback
- Do a staged deploy - deploy to some server(s) first if possible, and watch their behaviour (if a subset of servers get broken, it should have less impact, right?)

