These forums have been archived and are now read-only.

The new forums are live and can be found at https://forums.eveonline.com/

EVE General Discussion

 
  • Topic is locked indefinitely.
 

Fixing Node Crashes

Author
Zeee Orlenard
Center for Advanced Studies
Gallente Federation
#1 - 2013-12-08 14:19:38 UTC
This post has one possible fix for mitigating the effect of node crashes on large battles by changing to a soft-restart mechanism, to counter certain fleet doctrines. Pirate

Fix for node crashes... Lets start off with the assumption that you can't prevent node crashes.
Preconditions:

  • Nodes queue up commands and attempt to execute them and may be overloaded.
  • Nodes do crash, causing these commands to disappear or not appear to be executed.
  • Emergency warp allows people to not log back in after the node crashes in order to save their ship. (e.g. Not another Titan save PLease).


This being GD... How would the community react to the following, subtle, change to how node crashes are handled?

  • When the node crashes, the underlying client stays connected to the EVE network.
  • The EVE proxy reasserts transparently the latest read-only snapshot of the universe, keeping chat running but not allowing any commands to be run.
  • The client is notified that the node is restarting and all commands currently in the queue are flushed visually on the client so that each module does not appear to be running during the soft-restart.
  • The underlying node restarts and synchronizes without causing individual clients to crash.
  • Game resumes with a countdown timer that allows all parties to resume the fight at the same time. Perhaps a 1 to 5 minute broadcast in case they went AFK.


You may ask yourself, but if the node crashes we might lose data about the current state of the node when the crash occurred and it wouldn't be fair when the node came back up. I'd posit that since the servers are presumably using a RAM-SAN (or if they aren't they should be), a portion of the node state can be written like a crash dump before the node goes down with priority over all other crash/exception handling mechanisms. This is just software after all.. although it may cost CCP 320 to 700 hours to implement properly if they need to spin someone up on how to do this.

The underlying change here is that the client should never crash due to load on the node, and never put a player in the emergency warp state or allowed to otherwise save their ship if it was in battle during the log-off. Making aggression timers act this way and be persistent is a trivial database modification, if not a bit more complicated for the back-end code.

Node recovery is then transparent without users crashing out or being disconnected, followed by many swift Titan kills.

Posting on GD for optimal tear extraction from certain PLayers who's Titan-save doctrine depends on crashing the node.
Troll Maximus
Republic Military School
Minmatar Republic
#2 - 2013-12-08 14:24:17 UTC
'dis gonna be good.
Lukas Rox
Aideron Technologies
#3 - 2013-12-08 14:31:04 UTC
While I would not propose virtualization for EVE cluster, using some of the virtualization mechanisms (such as transfering running VM between physical hosts) would probably be a solutin for this. It would also allow moving fleet battle from one physical server to a reinforced one - save state of the node on "low capacity" host, copy it to a high capacity host and load the node there, with appropriate feedback for players. While it could mean quite a bit of coding and testing, it *seems* to be technically possible.

Proud developer of LMeve: Industry Contribution and Mass Production Tracker: https://github.com/roxlukas/lmeve | Blogging about EVE on http://pozniak.pl/wp/

Lors Dornick
Kallisti Industries
#4 - 2013-12-08 16:15:16 UTC
Lukas Rox wrote:
While I would not propose virtualization for EVE cluster, using some of the virtualization mechanisms (such as transfering running VM between physical hosts) would probably be a solutin for this. It would also allow moving fleet battle from one physical server to a reinforced one - save state of the node on "low capacity" host, copy it to a high capacity host and load the node there, with appropriate feedback for players. While it could mean quite a bit of coding and testing, it *seems* to be technically possible.

While this seems technically possible, the question is if it's worth the coding effort and the over head.

The main issue remains that there's a certain limit to how big battles can be supported on current hardware.

Adding the ability to move a currently active node to a different server would add some flexibility and creature comfort but it wouldn't help in the largest and most common scenarios (which is a maxed out reinforced node).

Virtualisation adds a lot of freedom, but it doesn't come for free

CCP Greyscale: As to starbases, we agree it's pretty terrible, but we don't want to delay the entire release just for this one factor.

ElQuirko
University of Caille
Gallente Federation
#5 - 2013-12-08 17:04:58 UTC
Reserving this post for reasons of the future

Dodixie > Hek

Grandma Squirel
#6 - 2013-12-08 17:22:39 UTC
It would be very hard to get things reset to the exact state prior to the node crash. Do I loose all my locks? (Disadvantage to the side cap chaining, allows enemy to snap primary FCs that would have been locked and/or prerepped by logi) Everyone's tanks not running? (Last node crash, my tank turned off obviously, but my entire tank decided to offline itself too) Bubbles all expired? (jump out caps) Doomsday timers ticked down? What happens to drones that were out, do we start with them reconnected, or gotta wait to reconnect once the battle 'resumes'? Not to mention the opportunity it presents for out of system reinforcements to move and get ready.

Lastly, what happens if you develop a crash loop? At a certain point, it makes sense to say that if what was happening caused a problem bad enough to crash the node, maybe restarting with the same conditions that just caused the crash isn't such a good idea.
Zeee Orlenard
Center for Advanced Studies
Gallente Federation
#7 - 2013-12-08 17:38:03 UTC  |  Edited by: Zeee Orlenard
You've got some valid points. I think it's workable though - since we're arm-chair engineering at this point, as I'll explain.

Grandma Squirel wrote:
It would be very hard to get things reset to the exact state prior to the node crash.


This depends entirely on how the soft-reset is implemented. Doing this from within the Python VM without making structural changes to how the server code is setup, sure that'd be difficult but not impossible. I'm not saying you serialize out all currently running threads to your crash dump, but rather the important data:

  • Player and ship stat data (health, ammo, coordinates, target lock list, etc...)


This data already exists in memory on the server, so the key difference is whether or not you leave it in Python's native format or convert it into a format that can easily be serialized out to memory. Not even disk necessarily; the Python VM can be restarted in place without the process being torn down if you want it to work that way. There's no technical reason for a crash of any kind or an exception of any kind to actually kill the process running the VM if the appropriate handlers are registered with the underlying OS.

Grandma Squirel wrote:
Do I loose all my locks? (Disadvantage to the side cap chaining, allows enemy to snap primary FCs that would have been locked and/or prerepped by logi) Everyone's tanks not running? (Last node crash, my tank turned off obviously, but my entire tank decided to offline itself too) Bubbles all expired? (jump out caps) Doomsday timers ticked down? What happens to drones that were out, do we start with them reconnected, or gotta wait to reconnect once the battle 'resumes'? Not to mention the opportunity it presents for out of system reinforcements to move and get ready.


These are just a bunch of requirements (good ones too). I don't see any reason why all of them couldn't be implemented. Just a bunch of state data as far as the system is concerned. Serializing this out once a tick is within reason, your overall data rate to do this is measured in kilobytes per second.. Once you detach it from the Python VM, keeping this state data intact would be pretty trivial. It's the processing on this data that's CPU intensive as part of the simulation.

Grandma Squirel wrote:

Lastly, what happens if you develop a crash loop? At a certain point, it makes sense to say that if what was happening caused a problem bad enough to crash the node, maybe restarting with the same conditions that just caused the crash isn't such a good idea.


Put some limits in the software so that if it repeatedly crashes you revert to the current behavior, followed by an automatic submission of the crash-dump info to your internal bug tracking software. It'd suck if you got into a crash loop, but its more of an edge case. I'd be interested in CCP writing up a dev blog on the nature of the node crashes they see most frequently.
Lors Dornick
Kallisti Industries
#8 - 2013-12-08 18:09:41 UTC
Grandma Squirel wrote:
It would be very hard to get things reset to the exact state prior to the node crash. Do I loose all my locks? (Disadvantage to the side cap chaining, allows enemy to snap primary FCs that would have been locked and/or prerepped by logi) Everyone's tanks not running? (Last node crash, my tank turned off obviously, but my entire tank decided to offline itself too) Bubbles all expired? (jump out caps) Doomsday timers ticked down? What happens to drones that were out, do we start with them reconnected, or gotta wait to reconnect once the battle 'resumes'? Not to mention the opportunity it presents for out of system reinforcements to move and get ready.

Lastly, what happens if you develop a crash loop? At a certain point, it makes sense to say that if what was happening caused a problem bad enough to crash the node, maybe restarting with the same conditions that just caused the crash isn't such a good idea.

These are all valid questions.

But did you step back and ponder what the actual problem might be and how to solve it?

Being able to move an overloaded system from one node to another, live, will only solve a minor part of the problem.

CCP Greyscale: As to starbases, we agree it's pretty terrible, but we don't want to delay the entire release just for this one factor.