Keeping the systems running the Second Life infrastructure operating smoothly is no mean feat. Our monitoring infrastructure keeps an eye on our machines every second, and a team of people work around the clock to ensure that Second Life runs smoothly. We do our best to replace failing systems proactively and invisibly to Residents. Unfortunately, sometimes unexpected problems arise.
In late July, a hardware failure took down four of our latest-generation of simulator hosts. Initially, this was attributed to be a random failure, and the machine was sent off to our vendor for repair. In early October, a second failure took down another four machines. Two weeks later, another failure on another four hosts.
Each host lives inside a chassis along with three other hosts. These four hosts all share a common backplane that provides the hosts with power, networking and storage. The failures were traced to an overheating and subsequent failure of a component on these backplanes.
After exhaustive investigation with our vendor, the root cause of the failures turned out to be a hardware defect in a backplane component. We arranged an on-site visit by our vendor to locate, identify, and replace the affected backplanes. Members of our operations team have been working this week with our vendor in our datacentre to inspect every potentially affected system and replace the defective component to prevent any more failures.
The region restarts that some of you have experienced this week were an unfortunate side-effect of this critical maintenance work. We have done our best to keep these restarts to a minimum as we understand just how disruptive a region restart can be. The affected machines have been repaired, and returned to service and we are confident that no more failures of this type will occur in the future. Thank you all for your patience and understanding as we have proceeded through the extended maintenance window this week.