The Fubra Blog

Storage Improvements

Posted Monday 3rd May 2010 by Mark Sutton

It’s been a very busy bank holiday weekend at Fubra following an intermittent backplane failure on one of our storage arrays. With the help of our storage vendor, Coraid, we’ve managed to stabilise the failing unit as a temporary measure while we move all of the storage to our other array.

This is taking us quite some time however, as it contains hundreds of gigabytes of production storage. We hope to get this process completed later today.

Platform Improvements

On Tuesday we are expecting to receive a replacement for the failed unit, at which point we will be making some fundamental changes to our storage model to ensure that any future failures of this type can in future be recovered much more quickly. This will be done by replicating all data to a warm standby array that can be switched in immediately in the case of an array going bad.

Although there is still more planning to do, we expect that this measure will massively reduce the impact of any future storage array issues.

In addition to improving our storage array architecture, we are also making other improvements.

The most fundamental change is that we are finally eliminating all of our Lustre-backed storage, and replacing all customer volumes with raw volumes hosted directly on the storage array itself. This has been planned for some time, and will reduce the complexity of our storage to provide much fewer points of failure. The difference will be noticed by all customers previously hosted on our Lustre filesystems.

In addition to stability this will also improve filesystem performance as there will be fewer servers and network round-trips involved in the stack.

Improving Communication

Another area we will concentrate on is communication. We recognise that improvements need to be made when it comes to giving a realistic resolution timescale, and are going to work on this in the coming days to put in place an improved system that enables us to communicate more effectively and respond faster during events such as this.

On this occasion, due to the scale of the outage we had difficulty keeping everyone informed in a timely manner as we scrambled to get things fixed. In the first hours of response it can be very difficult to give an idea of timescales, as it can take some time to get to the root cause. This is much more difficult when the issue lies in storage.

As the week unfolds we will make further progress announcements. In the time being if you experience any problems at all then please contact us via the usual channels and we will rectify the problem as quickly as possible.

I’d like to offer our sincere apologies to all customers affected by these recent issues, and give our assurance that the measures we are taking will dramatically improve our hosting platform for the future.

Tags: , , , , , ,

Leave a Reply