These forums have been archived and are now read-only.

The new forums are live and can be found at https://forums.eveonline.com/

EVE Information Portal

 
  • Topic is locked indefinitely.
 

Dev Blog: Building a Balanced Universe

First post First post
Author
CCP Prism X
C C P
C C P Alliance
#141 - 2013-12-04 16:41:52 UTC
Sentient Blade wrote:
I've mentioned it elsewhere, but why are these machines not virtualised (or are they?) surely something like vMotion would be able to move high-use systems onto dedicated hardware without the need to pause anything.


Can I say "Because if it was that easy we'd have done it already" and leave it at that? I'd rather not try to elaborate on that very complex subject because I don't know everything about everything and I'd rather not accidentally lie to you.
Vincent Athena
Photosynth
#142 - 2013-12-04 16:43:51 UTC
Ive heard the words "Brain in a Box" quite a bit and seen vague descriptions of it having to do with preparing session change data on a separate node. But is there a full description somewhere? What it does, how much load it will remove, and so on?

Know a Frozen fan? Check this out

Frozen fanfiction

CCP Prism X
C C P
C C P Alliance
#143 - 2013-12-04 16:49:14 UTC  |  Edited by: CCP Prism X
Vincent Athena wrote:
Ive heard the words "Brain in a Box" quite a bit and seen vague descriptions of it having to do with preparing session change data on a separate node. But is there a full description somewhere? What it does, how much load it will remove, and so on?


It's meant to offload work currently being done by solar system nodes onto a different node, as well as to reduce the total amount of work needing to be done over and over again at certain time. How much it will offload is wholly dependant on the end result and how much of the current code is refactored into C.

I'm not sure if there are external sources with information for you. If I were at the office I could look something up for you or ask people. But I'm not. Perhaps I'll remember tomorrow! Blink
Mioelnir
Brutor Tribe
Minmatar Republic
#144 - 2013-12-04 16:58:57 UTC  |  Edited by: Mioelnir
Sentient Blade wrote:
I've mentioned it elsewhere, but why are these machines not virtualised (or are they?) surely something like vMotion would be able to move high-use systems onto dedicated hardware without the need to pause anything.

You really think a company, running the largest gaming cluster for over 10 years now would not have already bought a stock solution if it worked for them? Cute.

Not quite as cute as the requests to tweak this resource distribution to behave optimally for random load spikes, but still cute.

The load distribution was wacky even for the average/standard case; normal operation. This is what this devblog is about. It's a good fix.

Tracking the movement of large player groups to make better predictions ("yesterday 2000 EU players moved to within 10ly of a system with an EU timer today") could refine the method, I'd assume that process to have a lot of false positives though unless it has a way to be fed political metadata. And accounting for cyno jumps, which give a lot less data points for the same path, will be horrible.

That said, collecting player populations for the distributions could provide some good insights. Less in the area of preallocating additional resources based on specific predictions, but giving less resources to a cluster of systems than it would receive by the traditional cpu metric, because the 500 people that lived there moved on - free'ing underutilized resources back into the pool earlier.

@ Prism X:
Over what timeframe are the cpu metrics used for premapping collected? Could a single escalated fight in a usually empty system not skew the metrics, forcing the system to become reinforced the next day? Are outlier datapoints stripped?
Mioelnir
Brutor Tribe
Minmatar Republic
#145 - 2013-12-04 17:05:54 UTC
Vincent Athena wrote:
Ive heard the words "Brain in a Box" quite a bit and seen vague descriptions of it having to do with preparing session change data on a separate node. But is there a full description somewhere? What it does, how much load it will remove, and so on?

I don't think there is much more information out there than the bits Veritas leaves all over the place (Fanfest talks, forums etc). But it boils down to solar system nodes currently rebuilding and applying the skill modifiers of each pilot all the time / too often / badly distributed.
So he wants to build a service that does that centrally for all solar system nodes, so a node can just request the ready-2-use memory datastructure of player's fully modified ship stats and start computing with it.

My understanding of the bits and pieces I picked up. Slippery when wet. Yadda yadda.
CCP Prism X
C C P
C C P Alliance
#146 - 2013-12-04 17:08:18 UTC
Mioelnir wrote:
@ Prism X:
Over what timeframe are the cpu metrics used for premapping collected? Could a single escalated fight in a usually empty system not skew the metrics, forcing the system to become reinforced the next day? Are outlier datapoints stripped?


It's really not as sophisticated as it should be. Essentially we do know the load from hardware metrics but have to split that load between the many different systems running on the node. To do that we store the time it takes one simulation loop (IIRC, I am not in the office and don't feel like VPNing) to finish for a given system. Using the ratios between that we can estimate the % load that belongs to that system and we use that to evolve the value.

Outliers are factored in here, I'm not sure if it would be a good idea not to as they can be indicative of staging systems. Some outliers are however ignored like any system that's been moved to the Incursion load balancing group will be ignored while it is there (again IIRC). So a single escelated fight will skew the system for a bit, but it will then start regressing again.

This code has pretty much not been touched. I did a minor change to it when we first started noticing empire going all whack. It used to assign load by number of jumps and docks in a system but that's in no way related to the load an empire system will sustain. It might have worked for nullsec, but nullsec runs either cool or burning hot so trying to find the perfect balance there is an exercise in futility. We need fleet fighter prediction tools for that and the ones we currently have are riddled with false positives and have been turned off.

But yeah, this is the first step of many we need to do just for the static balancer. Changing the load evolution will probably be the next one. BiB comes first tho.
Rammix
TheMurk
#147 - 2013-12-04 17:21:27 UTC
CCP Dolan wrote:
Check out some math and numbers that make me feel dumb, and make the server feel great with CCP Prism X's new Dev Blog.

You should use this on in-system level. I mean split each solar system into several pieces each assigned to a different node. This could decrease max TiDi levels for huge battles (like 2k people in local) very significantly.
For more clarity, this would mean that if 2 huge fighting groups in the same system were in different parts of the system (e.g. planet 1 and planet 7) they would be on different nodes and each node would have much lower level of TiDi.
I think you should seriously think about something like this, because currently TiDi kills the most of fun in epic battles.

OpenSUSE Leap 42.1, wine >1.9

Covert cyno in highsec: https://forums.eveonline.com/default.aspx?g=posts&t=296129&find=unread

Mioelnir
Brutor Tribe
Minmatar Republic
#148 - 2013-12-04 17:27:42 UTC
Rammix wrote:
... because currently TiDi kills the most of fun in epic battles.

No. TiDi resuscitates fleet fight fun, keeping it alive but not entirely healthy.

What we had before ritualistically pillaged and dismembered fleet fight fun.
Lors Dornick
Kallisti Industries
#149 - 2013-12-04 17:57:04 UTC  |  Edited by: Lors Dornick
CCP Prism X wrote:
Vincent Athena wrote:
Ive heard the words "Brain in a Box" quite a bit and seen vague descriptions of it having to do with preparing session change data on a separate node. But is there a full description somewhere? What it does, how much load it will remove, and so on?


It's meant to offload work currently being done by solar system nodes onto a different node, as well as to reduce the total amount of work needing to be done over and over again at certain time. How much it will offload is wholly dependant on the end result and how much of the current code is refactored into C.

I'm not sure if there are external sources with information for you. If I were at the office I could look something up for you or ask people. But I'm not. Perhaps I'll remember tomorrow! Blink


What "Brain in a Box" actually means, how it will be implemented and what it will mean to Eve is most likely boxed into CCP Veritas brain. ;)

CCP Greyscale: As to starbases, we agree it's pretty terrible, but we don't want to delay the entire release just for this one factor.

Rammix
TheMurk
#150 - 2013-12-04 18:12:43 UTC
Mioelnir wrote:
Rammix wrote:
... because currently TiDi kills the most of fun in epic battles.

No. TiDi resuscitates fleet fight fun, keeping it alive but not entirely healthy.

What we had before ritualistically pillaged and dismembered fleet fight fun.

That's why I said "the most of".
They should've seriously thinked about further splitting the basic pieces - from systems down to slices of systems - long time ago. Don't know why they still keep to the system-based approach in 2013.
Currently the perfect way would be - area-based in-system load distribution between nodes: let's say there is 3 areas in the system split between 3 nodes, if the new grid is created "geographically" in the 1st area then players who warp into that area get into the 1st node. This would bring inter-node travel inside one system, somewhat similar to gatejump, but I think it could be animated smoothly with some short delay in-warp. Such thing could be done to all systems, or maybe for the start just to the reinforced ones.
note: I'm not a programmer, just theoritizing from what I "know" about load balance between nodes. If I'm totally wrong please explain that. Smile

OpenSUSE Leap 42.1, wine >1.9

Covert cyno in highsec: https://forums.eveonline.com/default.aspx?g=posts&t=296129&find=unread

Andy Koraka
State War Academy
Caldari State
#151 - 2013-12-04 18:26:15 UTC  |  Edited by: Andy Koraka
Maybe I'm misunderstanding something, but as far as I can tell this will only have a negative effect on the quality of game play in regards to already painful fleet combat.

Frankly I don't remember the last time I was in a full fleet and there wasn't heavy Ti-Di. Every time a solitary 250 man fleet jumps a gate the system spikes to 10% tidi for 30-45 seconds. Even if every fleet fight was on an individual reinforced node (reinforced nodes are the exception, not the rule) the issue of gate Tidi is going to be exponentially worse under the new regional scheme since every individual fleet in the area traveling to (or from) the combat system is going to be sequentially triggering gate lag on the same node. It's going to be a particularly painful change given the recent quality of life hits to the majority of fleet ships, there's nothing fun or engaging about staring at a warp tunnel for 10 minutes per system the entire trip home.

As far as the metagame is concerned, even without a published node map it's going to be exploited. For example in a defensive Sov war, if most of a region is on the same node it's not going to be hard to find a linked system by trial and error and dock/undock repeatedly to cascade the entire node (most of a region in the current scheme) into a sustained 10% tidi to discourage siege fleets from grinding structures.

Yes the old system wasn't perfect, but the guy ratting in an empty system halfway across EvE could have just moved over to a different system and continued ratting. Maybe this is the right solution for Empire where loads are usually steady from day to day but it's the wrong approach in Nullsec.
Melek D'Ivri
Illuminated Overwatch Group
#152 - 2013-12-04 19:04:49 UTC
CCP Dolan wrote:
Check out some math and numbers that make me feel dumb, and make the server feel great with CCP Prism X's new Dev Blog.


Pretty sure that explanation is as simple and easily understood as it gets, folks!
Joshua Blue
Brutor Tribe
Minmatar Republic
#153 - 2013-12-04 19:53:06 UTC
Best blog ever!
CCP Phantom
C C P
C C P Alliance
#154 - 2013-12-04 20:41:48 UTC
Gilbaron wrote:
is there any kind of support for university papers ? i might actually be interested (not on a technical level, but for markets or politics)

Yes, there is! It depends a bit on the type of research etc., but yes, in theory we can support academical research and did that in the past already. If you have serious interest and some specifics already in place, please contact the Community team (just send a support ticket with some details).

CCP Phantom - Senior Community Developer

Tasha Saisima
Doomheim
#155 - 2013-12-04 21:10:51 UTC
The busiest system needs to share a node with the least busiest system so fewer people are affected
Dersen Lowery
The Scope
#156 - 2013-12-04 21:35:07 UTC  |  Edited by: Dersen Lowery
Vincent Athena wrote:
Ive heard the words "Brain in a Box" quite a bit and seen vague descriptions of it having to do with preparing session change data on a separate node. But is there a full description somewhere? What it does, how much load it will remove, and so on?


Based on what I remember from CCP Veritas talking about it:

The basic problem is that every time you do a session change, the new node has to query the database for all the information about you: skills, implants, clone, yadda yadda yadda. When 500 people undock, or jump a gate, that's 500 relatively large database queries at once, with the node twiddling its thumbs until the results come back (because it can't guess how your skills impact the particular fit of the particular ship you're flying, etc.). Boom, TiDi.

The "information about you" is the "brain." The "box" is a portable data structure--a cache, really, stored on a dedicated server--and so a handle to \where your brain is in which box can be handed from one node to another when you change sessions. Database queries are then decoupled from session changes, and they can be done as needed, asynchronously, by a process running a server which is not also hosting a star system. Suddenly, fleet undocks, docks, jumps, etc., no longer spike TiDi.

Also from memory, the next initiative after that decouples the notification system from the physics engine so that it can run on its own node. Then the physics engine only has to figure out what happened, and it can asynchronously call another process on another core (or node) to tell everyone on grid what happened. That will reduce the level of sustained TiDi caused by a major fleet fight by as much as a guesstimated 40%. If that initiative has a cute name, I haven't heard it yet.

Proud founder and member of the Belligerent Desirables.

I voted in CSM X!

Sentient Blade
Crisis Atmosphere
Coalition of the Unfortunate
#157 - 2013-12-04 21:44:21 UTC
Mioelnir wrote:
Sentient Blade wrote:
I've mentioned it elsewhere, but why are these machines not virtualised (or are they?) surely something like vMotion would be able to move high-use systems onto dedicated hardware without the need to pause anything.

You really think a company, running the largest gaming cluster for over 10 years now would not have already bought a stock solution if it worked for them? Cute.


"Stock Solution"?

I'm not sure you appreciate the complexity it would take to roll out such a massive deploy and use live migration all while pumping the entire local network through virtual switches.

Easy it isn't...
Abdiel Kavash
Deep Core Mining Inc.
Caldari State
#158 - 2013-12-04 21:47:07 UTC
Andy Koraka wrote:
As far as the metagame is concerned, even without a published node map it's going to be exploited. For example in a defensive Sov war, if most of a region is on the same node it's not going to be hard to find a linked system by trial and error and dock/undock repeatedly to cascade the entire node (most of a region in the current scheme) into a sustained 10% tidi to discourage siege fleets from grinding structures.

I wouldn't be too afraid of that. Keep in mind that intentionally putting extra load on the servers is a serious EULA violation. And since CCP are already closely monitoring system load, your docking shenanigans will show up as a big flare. And as soon as some dev looks closer and sees that there is no actual fighting associated with the extra load, you're in trouble.

People have been given warnings and bans in the past for all sorts of exploits trying to force a node to break.
Mioelnir
Brutor Tribe
Minmatar Republic
#159 - 2013-12-04 22:54:19 UTC
Sentient Blade wrote:
Mioelnir wrote:
Sentient Blade wrote:
I've mentioned it elsewhere, but why are these machines not virtualised (or are they?) surely something like vMotion would be able to move high-use systems onto dedicated hardware without the need to pause anything.

You really think a company, running the largest gaming cluster for over 10 years now would not have already bought a stock solution if it worked for them? Cute.


"Stock Solution"?

I'm not sure you appreciate the complexity it would take to roll out such a massive deploy and use live migration all while pumping the entire local network through virtual switches.

Easy it isn't...

Environment integration and live rollout of a technology have nothing to do with it. First that technology has to solve your particular problem. And yes, VMware ESX / vMotion "live migration" is pretty much a stock solution as far as virtualization goes.

TQ "nodes" are not machines, they are processes. With each process serving multiple solar systems. For that reason, you can't move a high-use solar system by moving a virtual OS around.

As far as workarounds go, one could run one virtual server with a single process running a single solar system for every system, sure. With all the increased overhead that brings along with it. And I am equally sure some Dev at CCP evaluated that already. And the fact that they did not adopt it (or something similar) means it did not work for them.

And if all that is sorted out, the EVE server code runs at 1HZ. Freezing a node for 2 seconds to copy it over to somewhere else are 2 missed server cycles that the clients have to resynchronize with again.
Which means a major redesign and rewrite of the network code on the client and server side.

For those kind of development resources, you need a really strong business case. And while dynamically reinforcing nodes is a nice target, we push those into TiDi as well. All the time. Which means it's essentially a band-aid. Not something you get a man-year of development effort approved for.
Sentient Blade
Crisis Atmosphere
Coalition of the Unfortunate
#160 - 2013-12-05 01:18:58 UTC
Mioelnir wrote:
Sentient Blade wrote:
Mioelnir wrote:
Sentient Blade wrote:
I've mentioned it elsewhere, but why are these machines not virtualised (or are they?) surely something like vMotion would be able to move high-use systems onto dedicated hardware without the need to pause anything.

You really think a company, running the largest gaming cluster for over 10 years now would not have already bought a stock solution if it worked for them? Cute.


"Stock Solution"?

I'm not sure you appreciate the complexity it would take to roll out such a massive deploy and use live migration all while pumping the entire local network through virtual switches.

Easy it isn't...

Environment integration and live rollout of a technology have nothing to do with it. First that technology has to solve your particular problem. And yes, VMware ESX / vMotion "live migration" is pretty much a stock solution as far as virtualization goes.

TQ "nodes" are not machines, they are processes. With each process serving multiple solar systems. For that reason, you can't move a high-use solar system by moving a virtual OS around.

As far as workarounds go, one could run one virtual server with a single process running a single solar system for every system, sure. With all the increased overhead that brings along with it. And I am equally sure some Dev at CCP evaluated that already. And the fact that they did not adopt it (or something similar) means it did not work for them.

And if all that is sorted out, the EVE server code runs at 1HZ. Freezing a node for 2 seconds to copy it over to somewhere else are 2 missed server cycles that the clients have to resynchronize with again.
Which means a major redesign and rewrite of the network code on the client and server side.

For those kind of development resources, you need a really strong business case. And while dynamically reinforcing nodes is a nice target, we push those into TiDi as well. All the time. Which means it's essentially a band-aid. Not something you get a man-year of development effort approved for.


I'll take these in turn...

#1 Yes it's a stock solution, but it's method of deciding when to migrate guests between hardware isn't. You'd need some kind of real-time reporting from the solar system servers, collating, and then deciding what gets put on what hardware.

#2. You could go for this approach of 1 system per VM. In a virtualized system this would actually be rather easy to maintain and would provide the best real world gains for many-threads few-intensive workloads.

#3 If you have to freeze something for 2 seconds your migration code isn't working right. You should be able to migrate an entire VM over with maybe half a seconds pause, if that. Not that you actually miss them in the first place, their data just stays in the queue. The underlying VM has no idea at all it's been moved.