These forums have been archived and are now read-only.

The new forums are live and can be found at https://forums.eveonline.com/

EVE Information Portal

 
  • Topic is locked indefinitely.
 

Dev Blog: Building a Balanced Universe

First post First post
Author
Haseo Antares
Production N Destruction INC.
F O R M I C I D A E
#161 - 2013-12-05 03:41:37 UTC
Magic, got it.

We currently have the world's greatest linguists and scientists trying to decode what you just said.

Dersen Lowery
The Scope
#162 - 2013-12-05 04:44:58 UTC  |  Edited by: Dersen Lowery
Sentient Blade wrote:
The underlying VM has no idea at all it's been moved.


And since it's taken a nontrivial amount of time to move relative to the 1HZ physics engine, meaning that the odds are very good that your half a second will cross a tick boundary, that means that every move must be followed by a resync with adjacent systems to get everyone back on the same page, right? If one node is off by a server tick, how do you handle that?

Proud founder and member of the Belligerent Desirables.

I voted in CSM X!

Abdiel Kavash
Deep Core Mining Inc.
Caldari State
#163 - 2013-12-05 04:52:16 UTC
Dersen Lowery wrote:
Sentient Blade wrote:
The underlying VM has no idea at all it's been moved.


And since it's taken a nontrivial amount of time to move relative to the 1HZ physics engine, meaning that the odds are very good that your half a second will cross a tick boundary, that means that every move must be followed by a resync with adjacent systems to get everyone back on the same page, right? If one node is off by a server tick, how do you handle that?

During TiDi different systems are not running in sync either.

(I'm not saying this as a proof that this will be easy, rather as anecdotal evidence for it.)
Pak Narhoo
Splinter Foundation
#164 - 2013-12-05 04:59:49 UTC
CCP Prism X, is there any relation between the "balanced universe" and the perceived unresponsiveness from this thread?
NinjaTurtle
Pator Tech School
Minmatar Republic
#165 - 2013-12-05 06:01:49 UTC
Great dev blog! Thanks so much for giving us insight into how you balance the clusters, I for one had been wondering what your process was for some time. Can't wait to see the results
Rn Bonnet
Perkone
Caldari State
#166 - 2013-12-05 08:09:20 UTC
Dersen Lowery wrote:
Sentient Blade wrote:
The underlying VM has no idea at all it's been moved.


And since it's taken a nontrivial amount of time to move relative to the 1HZ physics engine, meaning that the odds are very good that your half a second will cross a tick boundary, that means that every move must be followed by a resync with adjacent systems to get everyone back on the same page, right? If one node is off by a server tick, how do you handle that?


Vmotion at least is truly transparent to the underlying VM. You will see a "pause" but incoming network packets etc. are not dropped ,just queued while the machine is in motion afaik.
Steve Ronuken
Fuzzwork Enterprises
Vote Steve Ronuken for CSM
#167 - 2013-12-05 11:00:17 UTC
Rn Bonnet wrote:
Dersen Lowery wrote:
Sentient Blade wrote:
The underlying VM has no idea at all it's been moved.


And since it's taken a nontrivial amount of time to move relative to the 1HZ physics engine, meaning that the odds are very good that your half a second will cross a tick boundary, that means that every move must be followed by a resync with adjacent systems to get everyone back on the same page, right? If one node is off by a server tick, how do you handle that?


Vmotion at least is truly transparent to the underlying VM. You will see a "pause" but incoming network packets etc. are not dropped ,just queued while the machine is in motion afaik.



Nope.

Set up a continuous ping of a VM, then vmotion it, and you'll see a couple of dropped packets.

Woo! CSM XI!

Fuzzwork Enterprises

Twitter: @fuzzysteve on Twitter

Cerulean Ice
Royal Amarr Reclamation
#168 - 2013-12-05 15:25:39 UTC
I noticed a typo in the 3rd to last image, detailing how the x/y split works to better facilitate the repeated splitting in half.
http://content.eveonline.com/www/newssystem/media/65499/1/wholePowerOfTwoSolution.jpg
In the blue text for the 1st split, 85/64 is not 75.3%. 64/85 is, however. ^^
Cygnet Lythanea
World Welfare Works Association
#169 - 2013-12-05 16:43:24 UTC
It's nice to see work done on high sec, even if it took the servers burning up before CCP would admit that highsec exists... LOL
Mioelnir
Brutor Tribe
Minmatar Republic
#170 - 2013-12-05 21:03:44 UTC  |  Edited by: Mioelnir
About the 2 seconds: that's straight from the vendor. So while, in practice, it may not take more than half a second, you still need to design your cluster to be able to handle a 2 second move. Better yet, a 4 second move. If every client disconnects because a move took .7 instead of .5 seconds, you gained nothing.

And to the every solar system on its own VM: yes, that is rather easy to maintain - from the POV of the virtual infrastructure. But it means x30 more connections on the internal end of the session servers. It also means x30 more SQL sessions which probably can't be scaled down by x30. It also means a larger memory foorprint for the entire server (x30 more OS instances) and decreased cache efficiency. That's why I called it a workaround.

vMotion works nicely for applications where you can add redundancy via IP failover. For protocols with standing connections and high degree of time synchronization - let's just say it gets complicated fast.

Abdiel Kavash wrote:
Dersen Lowery wrote:
Sentient Blade wrote:
The underlying VM has no idea at all it's been moved.


And since it's taken a nontrivial amount of time to move relative to the 1HZ physics engine, meaning that the odds are very good that your half a second will cross a tick boundary, that means that every move must be followed by a resync with adjacent systems to get everyone back on the same page, right? If one node is off by a server tick, how do you handle that?

During TiDi different systems are not running in sync either.

(I'm not saying this as a proof that this will be easy, rather as anecdotal evidence for it.)

The tick between different systems runs differently. It probably always has. While all TQ nodes will run with similar latencies against the same NTP to keep the cluster internal clocks sync'ed, I doubt CCP sync'ed the server tick. Unless they use a wallclock second to initialize the first tick after starting the process - which actually they might have, thinking about it.

But this is not really that important inside the cluster. There really only the wallclock has to be sync'ed so timestamps represent consistently the same to all involved. That can be handled, NTP solved that problem decades ago.

The move is much more likely to desync the tick-count between server and client dogma simulation. The clients would be some seconds ahead of the server.
Here the server could:
- skip forward to the clients, discarding input for the skipped ticks
- skip forward, (try to) apply the entire input queue to the next processed tick
- issue all clients to roll back to his tick, discarding input
- signaling the clients a higher TiDi level than the server actually runs at until the it has caught up again
In any case, the server would have to be notified by the infrastructure that it has been moved, since the eve clients are untrusted terminals and the server can not trust them even if every connected client agrees that the server-tick is off by the same offset.

Btw, I think it's awsome that we as players sit here talking about TQ's cluster architecture.

[Edit]
The most elegant solution would actually be to have the infrastructure send an "intent to move" message to the server. The server could then set TiDi to 100%, completely freezing the universe (similar to how it works at downtime), trigger some "Sol-Node is being moved" message on the client, signal "ready to move" back to the infrastructure. After the move, the the infrastructure would send a "move complete" message, and the server would lift the 100% TiDi and continue the game.
Diomedes Calypso
Aetolian Armada
#171 - 2013-12-06 06:11:34 UTC
These sorts of blog posts make me love the game even though I tend to think of a python as a snake in the amazon or a pet snake around someone's neck at a park in Berkeley California.

Respect for the intelligence and knowledge of the users.

Treating us like adults.

I love that the company has so firmly decided (lol yes, since the 1000$ pants debacle) not to assume that people who don't really grasp more than the broad strokes will be put off by "too much detail"/

Yes I do understand the clusters and understand deviations and balancing etc but get lost or glazed eyed a bit deeper. I love that I'm told more than I want to know on some topics but can suck in the details on topics I'm interested in (start talking the velocity of money and I get real interested)

And .. heck.. I can always start researching terms I don't understand and enjoy the whole thing and be more knowledgeable about computers from playing the game !

.

Blue Harrier
#172 - 2013-12-06 15:40:30 UTC
Can I just pop in and say having read all 9 pages of this thread I wish more threads were like this on the forums.

Constructive talking among a diverse group of some very and some not so very knowledgeable members, no one having tantrums, throwing teddies out of prams, nothing but reasoned arguments.

Some putting forward what if’s, others debating and showing why this would not be possible but leaving room for further debate in case they missed something.

Must be the spirit of Christmas or something, well done to all.

"You wait - time passes, Thorin sits down and starts singing about gold." from The Hobbit on ZX Spectrum 1982.

Katrina Bekers
A Blessed Bean
Pandemic Horde
#173 - 2013-12-06 17:13:17 UTC
Steve Ronuken wrote:
Nope.

Set up a continuous ping of a VM, then vmotion it, and you'll see a couple of dropped packets.


Ping is connectionless and has a timeout of 3 seconds.

A TCP connection is - duh! - connection based, and usually the timeout is at 30 seconds.

Perfect? No.

But a dropped ping doesn't necessarily mean a dropped connection.

<< THE RABBLE BRIGADE >>

Steve Ronuken
Fuzzwork Enterprises
Vote Steve Ronuken for CSM
#174 - 2013-12-06 17:50:33 UTC
Katrina Bekers wrote:
Steve Ronuken wrote:
Nope.

Set up a continuous ping of a VM, then vmotion it, and you'll see a couple of dropped packets.


Ping is connectionless and has a timeout of 3 seconds.

A TCP connection is - duh! - connection based, and usually the timeout is at 30 seconds.

Perfect? No.

But a dropped ping doesn't necessarily mean a dropped connection.



It does mean dropped packets though. /That/ is what I was saying.

Woo! CSM XI!

Fuzzwork Enterprises

Twitter: @fuzzysteve on Twitter

Rain6637
GoonWaffe
Goonswarm Federation
#175 - 2013-12-06 21:34:29 UTC
wormhole mass accumulation needs to be looked at, specifically: how it relates to traffic control. traffic control prevented a wormhole jump, giving me a "you will be cleared to jump within the next X seconds," but also counted my ship's mass against the remainder on the hole, subsequently shutting it down while I stared at a traffic control timer. if quiet systems = sisi-esque dropped jump attempts, the least consideration you could also make is preventing dropped jumps from contributing to wormhole mass limits.
Rain6636
GoonWaffe
Goonswarm Federation
#176 - 2013-12-07 00:43:40 UTC  |  Edited by: Rain6636
I've submitted a bug report, referencing the dev blog, and outlining the scenario in which traffic control will reject a wormhole jump while the ship's mass is still counted toward the hole's mass limit (as if the jump was successfully made). I can't find a bug report number to list here.

tell me if i'm wrong, and if traffic control does not affect mass limit totals under any node/load condition.
Jessica Danikov
Network Danikov
#177 - 2013-12-07 13:20:31 UTC
Andy Koraka wrote:
Maybe I'm misunderstanding something, but as far as I can tell this will only have a negative effect on the quality of game play in regards to already painful fleet combat.

Frankly I don't remember the last time I was in a full fleet and there wasn't heavy Ti-Di. Every time a solitary 250 man fleet jumps a gate the system spikes to 10% tidi for 30-45 seconds. Even if every fleet fight was on an individual reinforced node (reinforced nodes are the exception, not the rule) the issue of gate Tidi is going to be exponentially worse under the new regional scheme since every individual fleet in the area traveling to (or from) the combat system is going to be sequentially triggering gate lag on the same node. It's going to be a particularly painful change given the recent quality of life hits to the majority of fleet ships, there's nothing fun or engaging about staring at a warp tunnel for 10 minutes per system the entire trip home.

As far as the metagame is concerned, even without a published node map it's going to be exploited. For example in a defensive Sov war, if most of a region is on the same node it's not going to be hard to find a linked system by trial and error and dock/undock repeatedly to cascade the entire node (most of a region in the current scheme) into a sustained 10% tidi to discourage siege fleets from grinding structures.

Yes the old system wasn't perfect, but the guy ratting in an empty system halfway across EvE could have just moved over to a different system and continued ratting. Maybe this is the right solution for Empire where loads are usually steady from day to day but it's the wrong approach in Nullsec.


The changes made haven't done much to change this problem significantly- both systems create large areas of connected systems that are all on a single node, the new one just ignores constellation boundaries and balances the (predicted) load across nodes better, while also ensuring all solar systems on a node are fairly local to each other. At worst, it may make the contiguous spaces a little larger.

The static mapper could do a lot more for this issue by striping nodes if the difference between intra-node and inter-node jumps really is significant (especially when scaled up) and the efforts to do so should be fairly minimal. If not, the Brain in a Box is going to be the next big advance in that area.
Rain6636
GoonWaffe
Goonswarm Federation
#178 - 2013-12-07 20:10:36 UTC
still waiting for confirmation that failed wormhole jumps with traffic control messages count against the wormhole mass, but will be looked into. (meanwhile there will be support tickets, handled by uninformed customer service staff)
Alex Logan
OK Researches And Inventions
#179 - 2013-12-07 23:06:04 UTC
I don't think we should trust CCP Prinsm X.

I don't think libras are serious and trustworthy.

Sorry but I won't read your stuff.
James Amril-Kesh
Viziam
Amarr Empire
#180 - 2013-12-07 23:22:00 UTC
Christ, these changes are awful.
"We noticed that inter-node jumps are less expensive than intra-node jumps"
And then proceeds to put adjacent systems on the same node to increase intra-node jumps.

Enjoying the rain today? ;)