Missing terabyte

27/04/2011

We are back up and running after a 5hr long outage, but now on crutches.

And, all user and stock image are sitting on a drive AWS is trying to make available to us again. In other words, we don’t have them right now.

How it affects you

Most of user and printer images uploaded before the outage are missing. Their thumbs and small previews are still there, but the originals are on the missing drive.

If someone selects an image with a missing original it will come up blank on the preview and in the order file. There should be no errors generated.

This is confusing to users and is likely to result in a drop in orders.

What to do about it

Nothing a this stage. We are waiting for AWS to give us our data back. They only started working on it an hr ago. There must be a long queue of similar requests. :(

Why did it happen?

AWS started having issues with storage on Thursday. We noticed that some customer websites went down. Few hours later we started seeing the same problems in our servers. We held it together through the outage loosing instances and disks, but still having 100% of the data available and the sites running.

We tried to back up the data during the outage, but it simply didn’t work. AWS systems were too overloaded for this. We decided to play nice and didn’t persist.

Today AWS officially declared the outage over. The next moment the last alive disk with 1TB of user images started getting errors. 30 minutes later we lost some more critical data and had to stop everything to maintain data consistency.

How it was supposed to work

It just wasn’t supposed to fail all at once. Amazon has a motto “Everything fails”. They should add ” at once”. This is what happened.

Our data is replicated across multiple drives on independent servers running in different zones. If one goes down the data is still somewhere on another server. This strategy failed miserably in the current event of everything being down.

Conclusion

Jumping from AWS to maybe Rackspace is not going to make much difference. They may have the same extensive failure. Anyone can. We will be looking at replicating the data and keeping hot backup systems on different platforms. If AWS is down we run off Rackspace and vice versa. It is a major undertaking and a very costly one. Let the dust settle first.

We will update you on the situation with user and printer images in the coming hours.

Leave a Reply