Incident Report for Feb 7 Downtime

At 15:55NZT today we experienced a sustained outage of all Path of Exile service for 1 hour and 4 minutes until 16:59NZT. While downtime can happen from time to time, extended downtime of this nature is not acceptable.

By 15:56NZT our server admin Thomas was alerted to the problem and began attempting to diagnose it. At that time all Path of Exile servers, including hot spares in our server host's Dallas data centre were unreachable. This data centre is where Path of Exile's core infrastructure is hosted. We notified our server host about the problem.

At 16:19NZT we were notified that there had been a power event in one of the server rooms of that data center. All servers and network infrastructure in that server room had lost power.

At 16:20NZT power was restored and equipment began to power up again.

At 16:51NZT all of the network infrastructure was back up and our servers were available again. Unfortunately several of our servers did not come back up and one of those servers was a primary database server. At this time we decided to initiate failover of this server to the hot spare.

By 16:57NZT we had prepared the necessary config changes to move to the hot spare. At this time we begin starting the realm again.

At 16:59NZT the realm was back up and functioning again normally.

While this downtime was caused by a power incident outside our control, our ability to recover from such events is our responsibility. For an incident like this, we should require at most 5 minutes of downtime while we move to redundant infrastructure.

This incident is our fault because we did not take sufficient steps to isolate our redundant infrastructure. While we had hot spares, we didn't take care to ensure that those hot spares were actually powered by separate power systems.

In addition, it took longer than it should have once our failover sequence began to actually make the switch.

Over the next few days we will be taking steps to move redundant infrastructure to different locations so that they are completely isolated from failure. We will also be working our procedures for failover so that they can be performed faster and with less steps required.

I'm personally very sorry. As the Technical Director of Grinding Gear Games it is my responsibility to ensure that our infrastructure is sufficiently redundant so that when disasters happen, we have the minimum possible disturbance to our users. This is not the level of service that you should expect from our company, and I can assure you that we will be making changes so that an incident like this will not happen again.
Path of Exile - Lead Programmer
Dayum yo fucking blizzard doesn't even do this.

Jonathan It was two hours it wasn't a big deal xD fuck 8 hours wouldn't be a big deal you guys LITERALLY have no down time 99% of the time that's good enough for me lol
Dys an sohm
Rohs an kyn
Sahl djahs afah
Mah morn narr
Última edição por Coconutdoggy em 6 de fev de 2014 23:59:29
An event out of their control, yet the technical director himself personally apologises.

This is why you get my money GGG.
"Minions of your minions are your minion's minions, not your minions." - Mark
"
ciknay escreveu:
An event out of their control, yet the technical director himself personally apologises.

This is why you get my money GGG.
Always a class act. Appreciate you taking the time to write down this post :)
I wish an hour outage at my company was responded to in such a manner!
IGN: Kulde
"
ciknay escreveu:
An event out of their control, yet the technical director himself personally apologises.

This is why you get my money GGG.


Yep
This is why you have, and will continue to get my money.

Thank You.
"
ciknay escreveu:
This is why you get my money GGG.
+1 mate. Stuff happens.
I love pie.

Reportar Post do Fórum

Reportar Conta:

Tipo de Reporte

Informação Adicional