Our servers have been going red-hot for a few days now due to increase in traffic. Everyone’s catching up on the lost time during the festive break. It would be all good if we could cope with it. Turns out our very scalable architecture has a little weak link in form of file replication and back ups. We made big changes to it over the New Year with a few hick-ups and stumbles, but it’s all settled now.
The problem arose when we started scaling the number of servers. It simply started choking up on cross-machine traffic and all the negotiations which file goes where. So adding more servers doesn’t add to better performance.
It is completely unacceptable and we are working on a fix, which may take a day or two to roll out. There will be brief outages few minutes long as servers are brought down for updates.
It’s a big FAIL for us, no excuses here.