Reliability lessons

27/04/2011

This outage was bad enough. Bad on Amazon, but also bad on us. In the end it’s our fault – assume nothing, trust no one. We trusted in AWS promise to keep zones separate and the core services running. Lesson learned. Here are a few ideas we want to implement in the near future.

Missing terabyte

Option 1

There is still hope we’ll get our missing terabyte of data back and quickly restore all missing images. Unfortunately Amazon is less than responsive at the moment, but they promised to look into this. It would be the best solution.

Option 2

We may need to ask you to re-upload your images, but this would be done via FTP as a bulk upload. You will simply upload everything you think may be relevant. Our software would crunch the images and match them against the database. No need for you to manually reassign them. Still, an unnecessary headache for you.

We will remove missing images from the database after some time or on your request to stop them appearing altogether.

There will also be a special web page to see what images are missing.

Data storage reliability

Backups

We don’t believe in them. A back up is only as good as the last restore. There is always data loss with backups anyway.

Replication

All printer data will be mirrored to a different type of storage (EBS/S3) within Amazon at no additional cost to customers. It will effectively double our costs, though. A hit we are willing to take.
We will offer an option of replicating all data to a different provider like Rackspace at additional cost.

Hot mirrors

We will maintain a ready to go mirror of the software with a different provider, probably Rackspace in case AWS fails again.

Moving so much data between 2 different systems on a short notice when everyone is probably trying to the same is not going to be feasible. Only those paying for off-site replication will be able continue with minimal interruption. All the others will be able to continue, but the downtime will be longer with some data loss.

No-w2p ordering

Our magento web to print extension will allow a fall-back option of taking orders without a preview in case the system is down.
All the fields will be populated and images uploaded, but the artwork would need to be prepared manually.

This is already in the development plan.

Faulty orders

We will wave fees for any faulty orders. Email us early in May with the list. We will not run the billing cycle until well into May.

Thank you for your tolerance and support

We had quite a few phone calls and emails. Everyone is frustrated, but no one freaked out or yelled at us. We’d better go start working on the solution again.

Leave a Reply