These forums have been archived and are now read-only.

The new forums are live and can be found at https://forums.eveonline.com/

EVE Information Portal

 
  • Topic is locked indefinitely.
 

Dev Blog: Behind the scenes of a long EVE Online downtime!

First post First post
Author
Aeon Amadii
#41 - 2015-08-08 00:37:27 UTC
Thank you for writing this!

As someone just starting school for Computer Science, this was very exciting and enlightening Big smile

(This character is the Eve version of Aeon Amadi since there is no cross-forum support)

Member of CPM 2

Cor'len
Doomheim
#42 - 2015-08-08 00:44:24 UTC
Vincent Athena wrote:
CCP Masterplan, I get that. Thanks for the reply. But why not keep going and do the startup? Did you have some "one way" DB changes with this update that would have taken extra effort to revert or restore? So much effort that, at any given time, it looked better to just keep trying to fix the issue rather than get de-railed trying to roll back?

Also, it looked like you found the temporary fix by experimenting on TQ, something you would not have been able to do if you had done the roll back.


I expect it was also a case of "We can't reproduce this reliably on our test servers, so we have to debug it in production". I think it was somewhat unclear whether the DB was modified in a way which would've prevented the rollback, and as Masterplan said, a DB restore takes time - I seem to recall a figure of multiple hours.


CCP: Thanks for fixing it, for the skillpoints, and also for the well-written report on the pizza. <3
Dunkov
Center for Advanced Studies
Gallente Federation
#43 - 2015-08-08 01:19:06 UTC
As a Splunk SME at my workplace, I'm very happy, proud and a bit enthralled that CCP uses my favorite big data tool! Huzzah CCP Splunk Ninjas!
MeagerMiner
#44 - 2015-08-08 01:23:50 UTC


Hell even I could follow along. Good job !

Thanks for your continued dedication to EVE .........
Nevyn Auscent
Broke Sauce
#45 - 2015-08-08 01:34:50 UTC
10/10 Dev blog, would read again.
Great explanation of what happened and how you go about such procedures.
Jonathan Yatolila
BKRFLD
#46 - 2015-08-08 04:04:28 UTC
Cor'len wrote:
Vincent Athena wrote:
CCP Masterplan, I get that. Thanks for the reply. But why not keep going and do the startup? Did you have some "one way" DB changes with this update that would have taken extra effort to revert or restore? So much effort that, at any given time, it looked better to just keep trying to fix the issue rather than get de-railed trying to roll back?

Also, it looked like you found the temporary fix by experimenting on TQ, something you would not have been able to do if you had done the roll back.


I expect it was also a case of "We can't reproduce this reliably on our test servers, so we have to debug it in production". I think it was somewhat unclear whether the DB was modified in a way which would've prevented the rollback, and as Masterplan said, a DB restore takes time - I seem to recall a figure of multiple hours.


CCP: Thanks for fixing it, for the skillpoints, and also for the well-written report on the pizza. <3



As others have said - great job on the fix, and an even better huzzah on the report!!! From your write-up - the only way to fix it was to leave the system down and to troubleshoot it on the "live" system - since it was working on the test servers and such. Kuddos to all of you.
Beta Maoye
#47 - 2015-08-08 04:11:30 UTC
Feels like reading Sherlock Holmes. Nice jobs.
Raiz Nhell
State War Academy
Caldari State
#48 - 2015-08-08 05:02:20 UTC
Best Dev Blog ever :)

Great explanation... Great solution :)

Situations like that are a developers worst nightmare... but also the biggest rush... working under the pump, brainstorming, fiddling and then the Eureka!!! moment...

Then the inevitable "So who's code was it?" discussion :)

There is no such thing as a fair fight...

If your fighting fair you have automatically put yourself at a disadvantage.

Jenni Concarnadine
SYNDIC Unlimited
#49 - 2015-08-08 09:20:35 UTC
Thank you very much for this.

It offered a clean account of what must have been chaos and much gnashing of teeth at the time.

White I wouldn't give back my SP, this is worth as much.
Richard TheLordOfDance
Operation Fishbowl Inc.
#50 - 2015-08-08 09:42:36 UTC
Is it weird that this blog made me want to sit down and do some coding?

Extremely well written, almost like a short detective story complete with a cliffhanger at the end! :D
Flay Nardieu
#51 - 2015-08-08 12:27:51 UTC
The candor about the incident and the insight to how it was handled was very appreciated.
Snape Dieboldmotor
Minotaur Congress
#52 - 2015-08-08 13:34:19 UTC
Great read. THANKS
Hel O'Ween
Men On A Mission
#53 - 2015-08-08 13:34:44 UTC
Raiz Nhell wrote:
Best Dev Blog ever :)


I wouldn't say best (technical) blog ever, i.e. I remember a dev blog about TQ's hardware, which was also a very interesting read. But this one's definitely one of the most interesting blogs.

Thx, for the write-up, CCP. Now looking forward to the resolution's blog, once the investigation has revealed the culprit.

tl;dr

+1, would read again. Smile

EVEWalletAware - an offline wallet manager.

Ezio di Firenze
Original Sinners
Pandemic Legion
#54 - 2015-08-08 14:39:07 UTC
+1 great article. It also peeked my intrest, in the wiki it says that the database servers run on Microsoft server and SQL server. What do the Sol layer servers run on? is that also windows or linux or maybe a custom CCP OS?
What do you guys use for that grid computing orchestration, its sounds really awesome how you do that!
Stanislav Kolomnitcki
The Scope
Gallente Federation
#55 - 2015-08-08 14:53:14 UTC
maybe you use the "print" function as debug in "campaign_logger" and do not have default "stdout"? =)
Jessica Danikov
Network Danikov
#56 - 2015-08-08 15:38:22 UTC
Logs can be a pain as, without them, you can have an issue that occurs only on your production servers that has no clear indication on how to reproduce, but if the logging isn't sufficient, sometimes all you can do is make prospective changes to the logging in the hope that the cause is better indicated next release cycle around.

Worse still, logging can be a performance bottleneck, from when you have multiple loggers logging to the same file which would normally require some degree of synchronization to loggers doing reflection to give you nice, informative logging information at the cost of taking 100x longer per call. This makes minimizing logging in production usually desired to stop everything being so slow, at the cost of never knowing what's wrong (the logs show nothing!).

+1 for the article and another +1 for making more technical articles more of a habit (even if it means locking Devs up).
CCP DeNormalized
C C P
C C P Alliance
#57 - 2015-08-08 17:26:16 UTC
Cor'len wrote:


I expect it was also a case of "We can't reproduce this reliably on our test servers, so we have to debug it in production". I think it was somewhat unclear whether the DB was modified in a way which would've prevented the rollback, and as Masterplan said, a DB restore takes time - I seem to recall a figure of multiple hours.


CCP: Thanks for fixing it, for the skillpoints, and also for the well-written report on the pizza. <3



We take full backups prior to each DT, so it would of been a full backup in no recovery mode plus a few transaction log backups to bring us to just past DT.

Roughly 3 hours for the 3 TB+ restore.

Funny enough we can restore faster in our test env. due to having a massive pool of SAS disks (100's) on the SAN vs. the small pool of SSD disks that the TQ DB uses :)

CCP DeNormalized - Database Administrator

Dradis Aulmais
Center for Advanced Studies
Gallente Federation
#58 - 2015-08-08 17:38:38 UTC  |  Edited by: Dradis Aulmais
Sounds like Ghost in the machine.

TQ is a very unique system. 12 years old, reborn several times. Code here code there, its own little ecosystem. Its like the ultimate Capsleer.

Dradis Aulmais, Federal Attorney Number 54896

Free The Scope Three

Soldarius
Dreddit
Test Alliance Please Ignore
#59 - 2015-08-08 17:47:40 UTC
"IBM"

Found your problem. /sarcasm I used to work at IBM. Won't repeat.

lel, seriously though. Very interesting write-up.

http://youtu.be/YVkUvmDQ3HY

KenFlorian
Jednota Inc
#60 - 2015-08-08 17:57:15 UTC  |  Edited by: KenFlorian
Marc Callan wrote:
Illuminating. But worryingly, I got the distinct impression that CCP figured out what was causing the problem but not why - unless the underlying cause of the logging issues has since been determined?



As a former software developer/IT guy this happens more often than most of us choose to publicly acknowledge. Hat off to CCP for telling us what happened as best they could sort it out. They, more than anybody, would like a perfectly coherent explanation...some of the time it's impossible.