I had the misfortune to be on the third week of my job as the CTO of a major physician practice when our only data center was wiped out. Our fire suppression vendor had brought in a trainee who accidently hit the button without disabling the live system. The system itself was designed for warehouses not data centers and released a fine powder that mixed with the water vapor in the room to form caustic cement on our equipment. I opened up one server and every bit of copper had turned green. To my horror, the only copy of our disaster recovery procedure was stored on a file server in the data center and the previous nights tapes had not been taken offsite.
I held a quick training session for everyone from temp help desk to IT director on how to clean servers. Ordered a bulk next day shipment of servers and replacement SAN. Explained our situation to our vendors and asked for help. Called our sister institutions and explained our situation. Arranged for a professional disaster recovery cleaning service to come in and clean the room. Talked to executive management and gave them the real world recovery scenario and timelines so they could plan their groups.
Four major events expedited our recovery. Dell (they really came through for us) sent us pretty much every spare part they had in their Texas depots and a couple techs to help with the recovery even though our service agreements didn’t cover it. A sister institution was able to loan us a core switch. We were able to use Acronis to take images of servers as we brought them online and restore them to different (clean) hardware (and then use that server for parts on the next). The IT department from top to bottom really pulled together and worked for 3 days with little sleep to get things online.
We were able to bring all revenue cycle online within 48 hours and were completely back online within 72 hours.
I held a quick training session for everyone from temp help desk to IT director on how to clean servers. Ordered a bulk next day shipment of servers and replacement SAN. Explained our situation to our vendors and asked for help. Called our sister institutions and explained our situation. Arranged for a professional disaster recovery cleaning service to come in and clean the room. Talked to executive management and gave them the real world recovery scenario and timelines so they could plan their groups.
Four major events expedited our recovery. Dell (they really came through for us) sent us pretty much every spare part they had in their Texas depots and a couple techs to help with the recovery even though our service agreements didn’t cover it. A sister institution was able to loan us a core switch. We were able to use Acronis to take images of servers as we brought them online and restore them to different (clean) hardware (and then use that server for parts on the next). The IT department from top to bottom really pulled together and worked for 3 days with little sleep to get things online.
We were able to bring all revenue cycle online within 48 hours and were completely back online within 72 hours.
Lessons learned:
-Know your data center. Everything: power consumption, heat load, air conditioning, fire suppression systems, UPS, wet/dry pipes, condition of the roof. Don’t let facilities, your architect or engineer tell you what you need. Check and make sure these components are really up to your needs.
-Disaster recovery procedures should be stored and kept up to date in multiple safe locations.
-Backups and procedures around them (such as taking offsite) need to be audited to ensure they are restorable and done in a consistent manner.
-Most equipment service agreements (even platinum) do not cover act of god. Insurance does but you will not get an immediate payout. Make sure you have enough capital sitting around to make purchases in a disaster scenario.
-Core equipment (big iron) is generally not available for retail purchase on a next day basis. Your service agreement doesn’t cover act of god so you won’t be able to get it from depot. If you absolutely cannot survive without that piece of equipment, buy two and store one offsite.
-Getting an exact duplicate (down to the component level) of a commodity server is generally not possible. Invest in a product (like Acronis) that can restore a backup to non-like hardware.
-Make sure your IT disaster recovery plan is mirrored by an organizational disaster recovery plan. The business should have documented communication, employee placement and documentation methods for unplanned downtime. They should also have a procedure for getting up to date when IT systems become available.
-If you receive spare parts out of the kindness of a vendors heart, make sure you document which ones you use and store them in a way that what you don’t use can be returned.
-Know the financial impact of downtime on your organization. It was very easy to justify building our (and successfully testing) DR hot site for our critical systems when we did our post mortem and realized we had lost slightly over a half million dollars in revenue.
-Keep your head... In a disaster sometimes it makes sense to take risks just make sure they are extremely well calculated ones.
-Know your data center. Everything: power consumption, heat load, air conditioning, fire suppression systems, UPS, wet/dry pipes, condition of the roof. Don’t let facilities, your architect or engineer tell you what you need. Check and make sure these components are really up to your needs.
-Disaster recovery procedures should be stored and kept up to date in multiple safe locations.
-Backups and procedures around them (such as taking offsite) need to be audited to ensure they are restorable and done in a consistent manner.
-Most equipment service agreements (even platinum) do not cover act of god. Insurance does but you will not get an immediate payout. Make sure you have enough capital sitting around to make purchases in a disaster scenario.
-Core equipment (big iron) is generally not available for retail purchase on a next day basis. Your service agreement doesn’t cover act of god so you won’t be able to get it from depot. If you absolutely cannot survive without that piece of equipment, buy two and store one offsite.
-Getting an exact duplicate (down to the component level) of a commodity server is generally not possible. Invest in a product (like Acronis) that can restore a backup to non-like hardware.
-Make sure your IT disaster recovery plan is mirrored by an organizational disaster recovery plan. The business should have documented communication, employee placement and documentation methods for unplanned downtime. They should also have a procedure for getting up to date when IT systems become available.
-If you receive spare parts out of the kindness of a vendors heart, make sure you document which ones you use and store them in a way that what you don’t use can be returned.
-Know the financial impact of downtime on your organization. It was very easy to justify building our (and successfully testing) DR hot site for our critical systems when we did our post mortem and realized we had lost slightly over a half million dollars in revenue.
-Keep your head... In a disaster sometimes it makes sense to take risks just make sure they are extremely well calculated ones.
No comments:
Post a Comment