You are currently in the Blog Archive. All content within this area is Read-Only and cannot be modified. Active Blogs can be found here.

Grid Outage

by Linden on ‎04-29-2010 10:03 PM

It would seem inevitable that the minute I blog about a change in my focus from stability to performance, we would have a major grid outage.  As I'm sure many of you experienced, the grid was unavailable from just before 9pm PT, until about 4am PT this morning.  The cause of the outage was the loss of our Phoenix, AZ data center.

This data center not only contained simulator hosts, but also many key central services, including inventory databases and databases that login processes are dependent on. The reason we lost the Phoenix datacenter was a power outage which affected the floor where the Second Life servers are located.  The root cause of the power outage and the circumstances around it were such that we were impacted even despite having triple redundant power systems in place.


I have spoken to our datacenter provider in Phoenix and let them know, as well as their entire staff, how much impact this outage had to our Residents.  At this moment, we continue experiencing inventory database issues, that may still be causing login problems for some residents.  I sincerely apologize for the inconvenience this has caused Residents, merchants, content creators, and everyone who should expect the grid to be stable, reliable, and always up.

You can be assured that we are testing all servers, network elements, power supplies, and application processes to make sure that we have the grid up and in a stable condition as quickly as possible.  I will also continue to work to drive better engineering into the grid, so that any failure (even one this catastrophic) will not result in such an extended time to bring systems reliably back online.  We continue to make progress on this front but, as I have seen first hand over the last 18 hours, we still have a long way to go.

Comments
by Honored Resident Cameron Milasevic on ‎04-29-2010 10:16 PM

Thank you for posting the explanation, that is interesting that there was such a large failure despite planning for an n+2 scenario.  Makes me wonder about our banks, stock exchanges, military defenses, and other critical server based systems!

by Honored Resident Drew Dwi on ‎04-29-2010 11:07 PM

Was it the datacenter that lost power? All the redundancy in the world is useless if your upstream isn't protected....

by Honored Resident Techwolf Lupindo on ‎04-29-2010 11:10 PM

The real question is how the UPS _AND_ backup generators failed. The only other time I can recall a failure of this type that wasn't due to an large act of god was the ev1 explosion and fire due to a buried primary line shorted out. The UPS and backup generators kick in, but was later shut off due to the fire department lack of halon.

by Honored Resident Spencer Ordinary on ‎04-29-2010 11:26 PM

hehe that was ironic timing FJ.

/me survives the "Great Crash of April 2010" to rez another day

by Honored Resident Villain Baroque on ‎04-29-2010 11:28 PM

Thanks for the info. I am glad is was just an outage since I feared it was just a first sign of increasing instability of the grid.

And concerning the triple redundant systems, I would like to quote the german poet Goethe: "Grau ist alle Theorie und grün des Lebens goldner Baum." (All theorie is dull, Life's Golden Tree is green.")

I am glad, Second Life's Golden Tree survived this outage.

by Honored Resident Soy Nakamori on ‎04-30-2010 12:03 AM

How much of this downtime was the actual datacenter power outage?

My experience with datacenters (UK/NL/DE) is that I have never experienced one that was more than 30-45 minutes at most and that was because the generators didn't automatically kick in after the battery backup. This was a big deal too causing all the other datacenters of that company to test their systems for days after that event.

At least that's my own daily D/C experience for the last 10 or more years.

If the datacenter does not have emergency plans for a power outage that's more than 1 hour then I guess it's not meant for serious SLA. Not that 1 hour is acceptable but at least you can blame it on the engineers on duty that need to walk, call and take action after the alarm rings and they stop downloading movies or talking on Facebook ;-)

Soy

by Honored Resident Nekololi Woodget on ‎04-30-2010 12:31 AM

It's a testament to SL's current stability that this outage is such a big issue.  I remember when outages like this were rather common.  Now they are (almost) unheard of.  FJ and the other Lindens have done a great job making improvements over the past few years.

by Honored Resident spinster Voom on ‎04-30-2010 12:31 AM

Wouldn't it be better to host the website (including blogs and forums) somewhere different from the grid servers?

The grid status page was up but beyond this, there was no way for LL and residents to communicate with each other. A major outage like this is a time when blogs and forums can really come into their own, not only for people to share up-to-the-minute information, but as a place to socialise and pass the time on SL-related discussion until people can get inworld once more. It could even have been used as an opportunity for a greater Linden presence on the forums for a few hours.

ETA: and, of course, people could have continued spending on XStreet, picking out their Linden Homes, browsing the land auctions ...

by Helper on ‎04-30-2010 01:09 AM

It's a testament to SL's current stability that this outage is such a big issue

QFT Nekololi.  And no more Wedenesday maintenance, well done on the progress LL.

Ironically enough I was away for the incident and happened to be chatting to an old friend who told a similar story of triple-failure at a datacentre for UBS years ago - a bank, just as Cameron is concerned about.  That my friend resigned from his facilities-management position a couple of days later was, he assured me, purely coincidental :-)

by Honored Resident Mitzy Shino on ‎04-30-2010 01:13 AM

Thanks FJ (and team),

I figure you guys will be be hard at it so its appreciated heaps for this short note one what happened.

As they say, crap happens sometimes.

by Recognized Member Vick Forcella on ‎04-30-2010 01:18 AM

Inspiring

Once had a job where the data center was in Miami. Bad place to have a data center, with each and e-v-e-ry hurricane the whole company broke down. Took about ten offem before management realised it could be a good idea move it...

by Honored Resident Net Antwerp on ‎04-30-2010 01:59 AM
I have spoken to our datacenter provider in Phoenix and let them know, as well as their entire staff, how much impact this outage had

Oh, Please. The Datacenter should treat all it's customers equally. I'm assuming there's been some sort of intimidation (or perhaps a blank threat or two) with the datacenter, and that's wrong.

Just goes to show how Linden Research is starting to demand that everyone bow down to Linden. That's never going to happen.

everyone who [] expect the grid to be  always up

100% uptime isn't guaranteed in the Terms of Service (ToS). Now that you have mentioned 'always up', why don't Linden Research amend their ToS to include the 'always-on' clause?

Then again, Second Life has become a dull, boring and severely slipping 'business' platform where people are busy making pocket money rather than a 'world' containing the individual's imaginations.

by Honored Resident Rails Bailey on ‎04-30-2010 05:00 AM

Yet another reason why companies in the know, that used to outsource services, brought them back into their control.

One wonders if Linden Lab will charge full tiers to land owners for the time that the grid was down ?

by Advisor Qie Niangao on ‎04-30-2010 05:22 AM
Thank you for posting the explanation, that is interesting that there was such a large failure despite planning for an n+2 scenario.  Makes me wonder about our banks, stock exchanges, military defenses, and other critical server based systems!

Mission critical systems for such domains (also including telecom) are always site-redundant, too, in case of events such as natural disasters or planes crashing into datacenters.

It's an expensive kind of redundancy to achieve, though, and especially challenging to get right for near-real-time systems.  I'd somehow assumed that the central SL services (especially login) were site-redundant, but I can see how hard it would be to quickly fail-over very high-volume data stores such as Assets and Inventory unless the network were massively over-provisioned.

by Member Lynda Klossovsky on ‎04-30-2010 06:29 AM

/me slaps her head and yells "Doh!"

Hey thanks for getting the grid back up so quick...

Anyone remember the big grid wide crash  we had about 12-18 months ago?

In australia we didn't have SL for 3 days !

by New Resident Marc Southmoor on ‎04-30-2010 06:40 AM

Just been logged out of SL. Now when attempting to get back I have the following

"The inventory system is currently unavailable"

Are we seeing a return of yesterdays mega outage?

Anyone else having the same problems?

by New Resident Kurt Mortensen on ‎04-30-2010 06:42 AM

Yup, I can't login either.

by Member Tarius Auxarmes on ‎04-30-2010 06:43 AM

Wow, failure of triple redundant power systems? That is not a small thing. This is obviously either one of those really small chance events, or sabotage. Either way, I think a redesign of that power system may be in order. Of course, LL doesnt handle such things.

As for the downtime, everyone should be lucky it didnt last that long.

by Honored Resident Ian Pahute on ‎04-30-2010 06:47 AM

Thank you for this info. I work full time in Second Life and yesterdays outage caused me huge problems (and I will lose the weekend playing catch up). So this post is appreciated.

by Member Lynda Klossovsky on ‎04-30-2010 06:53 AM

Hmmm.. You Are Right!..

I was in SL not so long ago and cannot log in either..

OH no I just realized my alt was sulking and saying something

about "just you f**%#*ing wait lynda"..

Maybe .well ?

by Resident Cisop Sixpence on ‎04-30-2010 06:53 AM

I'm getting the "The  inventory system is currently unavailable" message as well.

I suspect that this is due to the maintenance activity that is documented on the Status Grid.  Says it will last 2 hours and 30 minutes, so I hope in a few I'll be able to log back in.

by Helper on ‎04-30-2010 06:55 AM

Kudos to FJ Linden for a concise and factual statement of the situation.  As of my last login (around 0430 SL) things were still not back to "normal"...I had to login twice, and even when I was successful, the login process took a long time to complete.  I hope things are fully resolved by now, or if not, then soon.

SL has data centers in several locations, and is planning more (an overseas one).  It would seem to me, a semi-techie, that a service that this many people depend on (some of them for their livelihoods!) should have more redundancy.  Can not the "key components" that went down with the Phoenix center be duplicated at the other locations, as Qie suggests?

by Honored Resident Divine Tokyoska on ‎04-30-2010 07:06 AM

YES I remember the disatrous outage a while back........it was dreadful.....

I was forced to go out there....that bright ball in the sky burned my eyes..tried to change my enviromental settings but I couldn't find the Advance Sky option.......had to actually DRIVE places instead of poofing there, wtf is up with that?!......worst weekend of my life.........

This won't so bad^^

by Member Jack Abraham on ‎04-30-2010 07:08 AM

Net Antwerp said:

Oh, Please. The Datacenter should treat all it's customers equally. I'm assuming there's been some sort of intimidation (or perhaps a blank threat or two) with the datacenter, and that's wrong.

Oh, please.  All customers at the datacenter should be equally upset and making that known.  And a threat to find a datacenter provider who can manage to keep the lights on is entirely appropriate.  I've been on all sides of this kind of event and you better believe I'd be demanding to know what went wrong in Phoenix and what steps will be taken to properly test the failover in the future and, if I were the customer, invoking the SLA provisions of my contract.

FJ, thanks for keeping us informed.  Information is the anti-fear.

Sidebar Has All The Power!


by Member Jack Abraham on ‎04-30-2010 07:14 AM

Tarius Auxarmes said:

This is obviously either one of those really small chance events, or sabotage.

Or incompetence, or sheer organizational failure.  I recall from a prior job a guy who was responsible for testing a similarly critical system.  Roughly 2 times out of 3, it didn't fail over.  He reported it.  But no one had responsibility for maintaining it, so it went unfixed.  Thankfully, we never needed it while I was there.

That company's gone now.

(I didn't have the expertise to fix it, and was told by my bosses to stay out of it.)

Sidebar Is Infinitely Redundant!


by Member Deltango Vale on ‎04-30-2010 07:17 AM

Many thanks for the update. Much appreciated.

by Honored Resident Katarina Malthus on ‎04-30-2010 07:21 AM
I have spoken to our datacenter provider in Phoenix and let them know, as well as their entire staff, how much impact this outage had to our Residents.

You'll pardon me if I find it far more likely that the conversation went more along the lines of, 'your downtime cost us a lot of money, and is likely to get us sued, in which case, we will sue you. What do you plan to do to rectify this situation?' Which, is perfectly reasonable.

by Honored Resident Katarina Malthus on ‎04-30-2010 07:24 AM

Absolutely agreed. On a related note, I've been with Level 3 Communications for 10 years as my company's hosting provider and there has been 0 down time related to their systems. Food for thought.

by Honored Resident Zack84 Burton on ‎04-30-2010 08:38 AM

Well it's 24 hours later (at least for me).  The system was "repaired" enough for me to log in yesterday.  This was only for about 2 hours before SL logged me back out again and wouldn't let me return due to a problem with the "Inventory System" which is understandable due to the outage.  4 hours later and I still couldn't log in, however I was happy to see from the Grid Status page that Registrations had been re-opened.

The priority was to allow brand-new residents who had never been to SL before and bots to access the grid before a portion of the regular resident base?

6 hours later I was finally able to log back in.  Everyone I was working with had given up and left or had gone to sleep so it was a particularly unproductive experience.  I logged out, hopeful that the problem was behind me.

WRONG!  Either due to this "system maintainence" that was planned but then overran its timetable or for another reason that you've not publically mentioned.  Regardless, 24hrs (or more) later and I still can't log in, thanks to the illusive "Inventory System".

by New Resident Vyper Juventa on ‎04-30-2010 09:32 AM

Although we are all suffering from this major Outages, we continue to be a part of SL. Our continuing usages of our inventory dissappearing. We still become involved with everything within our world of SL, yet we cant imagine from the ongoing efforts from all LL,  how greatfull we are from there ongoing efforts of bringing the Grid back online for us and for that I am greatfull.

by Recognized Resident rsd58 Congrejo on ‎04-30-2010 09:36 AM

well i was expected a full blog so that's why i was so disspointed so i have suggestion a system to prevent grid outages like backup power system connected to solar panels or secondary data center with back up system there means grid have double protection system if one data center fails and grid will remain online

by Member Maggie Darwin on ‎04-30-2010 10:17 AM

I notice the fog machine came on with a reference to "the root cause was such that...".

What *was* the root cause, besides "such that"?

What was the single point-of-failure in a triple-redundant system? Just so I avoid it when designing data centers in the future.

by Member Elite Runner on ‎04-30-2010 10:36 AM

I can login and teleport from point A to point B without any issues. Yesterday was a different story.

I agree with Rsd58's idea and others. The data centres need an emergnecy plan in case of power outages, and have back-up power like generators, wind turbines and solar panels, and have a secondary data centre at an another location.

EDIT: And if you were not able to log in over the last couple hours, this was due to the server mainteance in which it affected logins as posted on the Second Life Grid status reports in which it was completed around 9:00am SLT (http://status.secondlifegrid.net/2010/04/29/post992/) After an hour after the mainteance, I can login normally.

by Honored Resident Crap Mariner on ‎04-30-2010 11:07 AM

Rails -

According to the SLA in Billing Policy item 5, the magic number is 24 hours. Clock starts ticking when the service is 100% unavailable.

I'd interpret that as grid down, logins disabled.

FJ's post suggests a downtime of 7 hours (9p - 4a). And once the grid is back up, people can log on... even if in a crippled, no-inventory, no-transactions state.

When the grid's up, the clock's no longer ticking because the raw service is available.

However, you can make the argument that someone who still doesn't have access to their inventory after 24 hours really can't use the system.

I'd be curious to know if anyone actually makes that case, but I doubt it'll happen.

-ls/cm

by Member Linda Brynner on ‎04-30-2010 11:11 AM

What does seriously concern me is that this almost has happened every year since i arrived in SL in Dec. 2006. Despite constant statements from LL i am still highly surprised that this kind of critical redundancy still isn't organized.

Ll can state as much as they can, but much is out their hands i understand since that is the nature of cloud computing overall, however i should have thought that critical outage chances like that these days would be covered better, but on that point not much has happened since 2006. A bit sad really.

Too much is still out of LL's hand... unfortunate.

by Helper on ‎04-30-2010 11:30 AM

Power outage makes me feel much better than if SL was being hacked. Yesterday's event should be reason enough to make an independent website for this blog. Thankfully the grid status website remained running. A better way needs to be found to help communicate any future issues to residents during a crisis. Who is in charge of SL's presence at Twitter? I went there and could not find any info. I looked at several popular Linden profiles and didn't see any info. How much time is between 4 am and 10:03 pm? When did this blog return to normal service? I am sure plenty of Lindens were very busy but plenty were not. Sometimes it is events just like this which best measure who contributes what. There is no I in team but there is an I in win. 

by Honored Resident Serendipity Seraph on ‎04-30-2010 12:12 PM

The mentioned central crucial assets apparently in  one location is a very problematic system design from a reliability POV.  Many IT departments guarantee that major web applications and systems are resilient even if a single datacenter disappears from the face of the earth.  They may hobble a bit but they don't go down.  Certainly not for seven hours with more than a little breakage and shakiness even now.  LL has some work to do to build a truly dependable system that approaches anywhere near current IT state of the art.

by Honored Resident Sindy Tsure on ‎04-30-2010 03:55 PM
the grid was unavailable from just before 9pm PT, until about 4am PT this morning 

7 hours, eh?

For over a year, my home in SL has probably averaged being 'unavailable' that much every week. Every single week.

Don't believe it, FJ? I'd be happy to TP you.. Feel free to let me know and I'll ask the concierge to restart it so you can see how painful SL is at busy places every single day, even on a freshly-booted sim.

/me expects the same answer she usually gets from her serivce provider: none.

by Member TriloByte Zanzibar on ‎04-30-2010 05:22 PM

Thanks for the update, but I have to voice my concerns anyway.  You may want to consider what you consider to be a "triple-redundant" power system.  While I'm sure there's some technology behind it, the solution you and your team had in place was no more effective than a college kid plugging a surge protector into a surge protector into a surge protector....

Thankfully the outage occurred on a weekday during off-hours.  Hopefully everything's fully restored and performing better than ever as we head into the weekend.

by Honored Resident Marcus Perry on ‎04-30-2010 07:10 PM

You know, just a little hint here: it would have helped IMMENSELY if at some early point in this, someone would have posted the most basic of information on the grid status update page. Just a line like : "There was a major power outage that affected our system" instead of reposting the same, useless bit of information over and over again.

by Member Toysoldier Thor on ‎04-30-2010 07:50 PM

Grid Outage

Posted by FJ Linden on Apr 30, 2010 12:03:50 AM

This data center not only contained simulator hosts, but also many key central services, including inventory databases and databases that login processes are dependent on. The reason we lost the Phoenix datacenter was a power outage which affected the floor where the Second Life servers are located.  The root cause of the power outage and the circumstances around it were such that we were impacted even despite having triple redundant power systems in place.

Thanks for the explanation FJ.

Sorry, but as much as I appreciate that this specific "root cause" to the serious outage LL experienced should not have happened because of all you mentioned with regard to the power being triple redundant for that data center, what you explained is that the LL Systems Architects / Engineers should be taken to the mat for a poor design not properly mitigating risk from an unlikely outage of some of your most critical servers.

As a Systems Architect who has developed enterprise IT systems deployments for many industries - including the finance industry like banks, I will tell you that someone at LL dropped the ball in either not performing a proper D.R. Risk Assessment/Mitigation assessment to properly asses how critical these servers are to your entire operation, OR, the Systems Architects did not develop a solution to protect these critical servers from the risk of their loss.

System and even Data Center environmentals redundancies do not protect against "Disasters".  So what you are telling me is that LL's most critical servers are not Disaster Recovery compliant?  In other words, in the unlikely event that your Phoenix DC were to be decommissioned for for a lengthy period of time or indefinitely (i.e. a terrorist attack, tornado, earthquake, etc.), you are telling me your most important servers are not replicated nor operating at your other DC?  If these servers are as critical as they clearly have proven this week to be, they should be operating HOT in your other DC in a failover mode so that even a brief outage of they primary system in Phoenix would allow LL to cut over to your backup server in your other DC.

So although I sympathize with you for encountering a very unlikely event that hit you in your Pheonix facility, I have to be critical and say that it sounds like someone in LL needs to question your Archtects how they missed a pretty basic solution component in their architecture - DR and Geographic Redundancy of your most critical servers.

I know I would be fired RL if I would have missed this in my solution architectures, but I guess most of the industries (including finance) I have provided solutions for have higher levels of mandatory uptimes and service protections then the entertainment / gaming industry - right??  I dont think so,

To the person that was questioning the banking systems vulnerabilities to this type of outage - believe me - they would not have had this exposure.

This is not a matter of criticizing your DC hoster.... this is a matter of criticizing your Systems Archtects / Designers.  They are the ones the screwed up.

I hope LL has learned another valuable lesson and will now take action to protect THEIR most critical systems.

by Member Toysoldier Thor on ‎04-30-2010 07:52 PM

I totally agree.  Why is this outage being explained here in this obscure blog forum yet not even a word to the MASSES that are instructed to watch the main GRID STATUS Boards.  Marcus is completely right.

You know, just a little hint here: it would have helped IMMENSELY if at some early point in this, someone would have posted the most basic of information on the grid status update page. Just a line like : "There was a major power outage that affected our system" instead of reposting the same, useless bit of information over and over again.

by Honored Resident Dorientje Woller on ‎04-30-2010 08:35 PM

Wonderful as the nay sayers and criticasters are jumping on the bandwagon of negativity. Recently, speaking of last year, a competator of LL called *censored* experienced a like wise problem when the Vancouver powergrid experienced a total loss due to a fire in the underground powerstation. With this little difference, that other 3d world did as its nose was bleeding. Still up today, for each major failure they create, no excuse whatsoever is coming over their lips.

Yet, we, as users are expecting a 24/7 up time of the services from LL. Whilst, even in the year 2010, we are forgeting one major point : It's at the end merely technology which can go broke in a split second. Still, what suprises us is that a failure or power outage of one server can create such a havoc ? Hasn't Linden Labs engineers never been teached to back up the database upon a system that is located elsewhere ? Even the most common user of a pc knows that this can, and it will happen once, save his day and a lot of frustration.

We can only hope that LL will direct their stability policy to the correct path.

by Advisor Qie Niangao on ‎05-01-2010 03:20 AM

I also mumbled about site redundancy in an earlier post, but to be fair we don't know that there wasn't a warm standby installation of all critical services at other locations.  One could imagine that the power outage was brief and once power was restored there was no reason to fail-over to the standby units.

I'm making all this up as I go along, mind you, but it could be that Operations had no experience with bringing this much stuff back to life all at the same time and encountered unexpected stability problems with central or network services as they tried to get so many region hosts back online at the same time.  It's usually impossible to test this at scale, and it can be very difficult to anticipate and engineer for all the demand applications will generate in exceptional conditions en masse.

This would be a learning opportunity for Operations.

It would be nice if all the shared services were distributed in normal operation, or hot standby with automatic failover--and either is possible, but at some expense in engineering, operations, hardware, and network capacity.  For an application like SecondLife, it's a business decision whether some slower disaster recovery approach is adequate--if it's a real disaster from which one is recovering.

by Member Maggie Darwin on ‎05-01-2010 05:27 AM
The reason we lost the Phoenix datacenter was a power outage which affected the floor where the Second Life servers are located.  The root cause of the power outage and the circumstances around it were such that we were impacted even despite having triple redundant power systems in place.

The heart of this blog post is right here.

FJ tells us he had "triple-redundant power systems" but that the "root cause"-- some single point-of-failure -- "was such that we were impacted". He doesn't disclose the root cause.

Neither FJ nor Wallace nor anybody else I've talked to is willing to say what that "root cause" was that clobbered them anyway, other than to gesture vaguely towards the colo operator and say "it's not for us to comment on".

Perhaps Linden Research as a client of the colo is subject to terms of service that prevent them from disclosing what happened here. Wouldn't *that* be ironic?

Personally, I hate fingerpointing at a supplier. I think Linden Research owes its customers a better explanation.

by Contributor Dilbert Dilweg on ‎05-01-2010 07:43 AM

Here is a video from the CEO of Codero Datacenter explaining their  Issues and shortcomings about the gens

http://www.datacenterknowledge.com/archives/2010/03/16/codero-addresses-lengthy-power-outage/

Totaly out of Linden Labs hands. the Data Center totally dropped the ball

by Honored Resident Celty Westwick on ‎05-01-2010 09:05 AM

Wrong Dilbert if you look at that date on article it was March16th! In other words they had problems in March with the same issue, and failed to fix it obviously since this was at the end of April. It is certainly not out of the hands of Second Life since they:

1. Design the degree of rendundancy for the system.

2. Select the out-sourced provider for co-location servers, hopefully after vetting them thoroughly.

3. Have the responsiblitity to monitor these service vendors for quality and performance.

by Member Toysoldier Thor on ‎05-01-2010 10:20 AM
May 1, 2010 11:05 AM Celty Westwick  says in response to Dilbert Dilweg: 

Wrong Dilbert if you look at that date on article it was March16th! In other words they had problems in March with the same issue, and failed to fix it obviously since this was at the end of April. It is certainly not out of the hands of Second Life since they:

1. Design the degree of rendundancy for the system.

2. Select the out-sourced provider for co-location servers, hopefully after vetting them thoroughly.

3. Have the responsiblitity to monitor these service vendors for quality and performance.

Dilbert might be wrong but he showed that there was already clear evidence that this DC used by LL had a risk of power issues.  It should have raised flags within the LL IT Team to ask internal questions

  1. What happens if this DC has a power failure again - how would it impact us?
  2. Are there any critical servers/services/devices in this DC (or any DC they operate out of) that would be at risk of an outage with a difficult recovery or failover if we were to lose this system for even a brief period of time - much less a longer period of time?
  3. Have we identified these systems and what measures must we take to ensure they will not be at risk - ie. they have a HOT or WARM failover.
  4. Could data in these systems be corrupted if the outage was sudden?  If so is this data recoverable IF it is critical information (i.e. financial transactions or customer inventory or whatever).

Seems LL even had a blessed shot over the bow in the past that didnt affect them regarding this DC and it seems their IT group did not take the warning serious to make sure they were protected if it happens to them.

Yes Celty, no matter what the technical issues are at this DC, LL should not be blaming the DC outsourcer.  The blame for the size and magnitude of this LL outage 100% rests at the feet of LL's Systems Design Team that created allowed this DC to be a single point of failure in their systems and/or the Operations Team that did not properly test and/or execute on the failover/recovery that might have been in place.  Without knowing details of the LL IT operation, I cant say.

But I can say with 100% accuracy... the blame for the extent and magnitude of the outage is LL's, not the DC Hoster.

More scary is that LL didnt learn from their lessons when there was warning signs... so I suspect they will not learn from this lesson that brought down most of their systems and that they are still cleaning up on.  They will look at this as "it wasnt our fault - just bad luck and lets hope this doesnt happen again".

by Recognized Member Vick Forcella on ‎05-01-2010 12:30 PM

No all data centers are created equal.

More than a statement, it's a fact.

by Member Snickers Snook on ‎05-01-2010 09:09 PM

My data center is chewy on the inside; crunchy and coated with chocolate outside.