These forums have been archived and are now read-only.

The new forums are live and can be found at https://forums.eveonline.com/

Player Features and Ideas Discussion

Forum Index

EVE Forums » EVE Technology and Research Center » Player Features and Ideas Discussion » A low-investment method to stop polonged...

Topic is locked indefinitely.

A low-investment method to stop polonged downtime in future
Author

Previous Topic Next Topic

Illiander Aideron

The Scope

Gallente Federation

Likes received: 4

#1 - 2015-07-15 19:54:07 UTC

I'm going to lay out my assumptions, and a proposed solution to the obvious problem that CCP have no "roll-back" method in place for their live server.

This is not a solution to the current downtime, as it requires the server to be in a known good state to start with, it is merely a means to stop this happening again.

Assumptions:
As the data backend is a Microsoft SQL Server, I'm going to assume that you are running the servers on Windows, for whatever historical reasons you might have.
From that assumption, I think it's fair to assume that you're running on stock Intel/AMD processors, rather than anything fancy.

Solution:
Run the server in a VM.
VMs these days run at native speed on Intel/AMD processors if the host and client OSs match, so there wouldn't be any performance degredation.

Why:
Hard Drive Snapshots

Procedure:
At downtime, after server is down, before system restart: Take a snapshot of the drive state, deleting the previous day's save of the machine state to save space (This is *not* a database save, as the database shouldn't be on the same machine as the server, so this snapshot should be quite a small amount of data).
Then run any updates.
Then restart the system.

If the updates broke things, restore the end-of-day snapshot, then restart the system without the update applied. Then go back to testing. This will fix *any* issue that the server machine might be having, at the cost of removing the patch just added.

It also has the added advantage that applying an update can simply be moving the test-server disk image over to the live server if you are using clever URL settings for the databases (have the database servers for live and testing use the same URL to identify them to the servers, and be clever about where the local network sends those URLs from the live and test machines (having the live and test setups on different local networks, with their own local DNS will do this simply, but there are ways to accomplish the same thing if you *need* the test and live servers to be on the same internal network for some reason)). Or, if you can't/won't use clever URLs for the database servers, then you copy the image over, and update a single URL on the new live server.

None of this should be complicated, difficult or interesting for any competant sysadmin to implement.

If any of my assumptions are wrong, please do let me know.

Generic Corp Name 532
Rawketsled Generic Corp Name Likes received: 532	#2 - 2015-07-15 21:30:35 UTC \| Edited by: Rawketsled Then you have to issue a roll-back patch to the clients too.

CrashCat Corporation 218
Ellendras Silver CrashCat Corporation Likes received: 218	#3 - 2015-07-15 21:47:22 UTC 5 i think it is hard to make a solution on a IT issue with so many assumptions. [u]Carpe noctem[/u]


Illiander Aideron The Scope Gallente Federation Likes received: 4	#4 - 2015-07-15 21:47:31 UTC That should be utterly trivial. There should be backup copies of every client version released. And there certainly should be copies of the latest few. Issueing a "new" client that's a copy of an old client really shouldn't be difficult.

Rawketsled

Generic Corp Name

Likes received: 532

#5 - 2015-07-15 22:04:23 UTC

Illiander Aideron wrote:

That should be utterly trivial.

There should be backup copies of every client version released. And there *certainly* should be copies of the latest few.

Issueing a "new" client that's a copy of an old client really shouldn't be difficult.

An efficent patcher is incremental, so it can't just push patch [current_version minus one] out again, or it'd just turn everyone's installs to gobbledygook.

Ellendras Silver

CrashCat Corporation

Likes received: 218

#6 - 2015-07-15 22:11:53 UTC

Illiander Aideron wrote:

you either over your head here OR you should be working for CCP my money is on the first (no offence)

but you can`t offer real solutions without knowing the infra structure i am no n00b with computers but i aint an expert either but i am pretty sure that what i say is correct, if not feel free to correct me without assumptions.

[u]Carpe noctem[/u]

afkalt

Republic Military School

Minmatar Republic

Likes received: 2,510

#7 - 2015-07-15 22:26:00 UTC

Rawketsled wrote:

Then you have to issue a roll-back patch to the clients too.

Them and the app servers. It's far from just the DB.

After you find the fault, which may be logical or conditional. You also need to contend with the possibility of a partial backout, a failed or incomplete backout.

Then you have to do diagnostics to find out what is broken. You ideally need to replicate this elsewhere to be 100% sure, no-one likes a "fix" that neither fixes or makes worse.

This work can take hours, sometimes you don't find the bug here, but you get a fix in and root cause goes on for days/vendors involved etc etc.

Let's say though that at some point a decision is made to move from fix on fail/move forward to back out. If back out, you're looking at ~3Tbs of database alone, a restore of that size isn't fast, even using snapshot technology - that time is not measured in minutes (again, for us, using BCVs/array replication/flash and tens of terabytes, fark only knows how long windows will take to do that ****).

Then you need to roll it forward (I'm oracle, we can do that, I assume whatever DB they have has similar functions) to a point in time before the failure point - but for all you know, it was something planted YESTERDAY that's hit a timer/logical corruption.!! Then, you need to roll app servers and clients back, then you need to test. Then you need to be triple sure. Then you need to push the client fix.

Main take away here is that "easy, just use a snapshot" is like saying "easy, just land the plane gracefully on the water if the engines fail".

Also, prod databases on a VM make me a sad panda.

James Baboli

Warp to Pharmacy

Likes received: 1,038

#8 - 2015-07-15 22:26:07 UTC

So, a tangental comment, on a programming board, was that they had implemented multi-core support for the multi-threaded python that eve runs on about 6 months ago, and then there was talk of a "massive, real world, high-uptime" implementation to follow. Bets that it was EVE that the programming board was talking about, and that the bit that broke is related to multi-core support for the python, thus making it
A: a deeply technical issue
B: Deep wizardry if it broke at the compiler level
C: breaking any backups from before this was implemented
D: making most of this WMG about what went wrong uninformed and useless, even if someone is correct

Talking more,

Flying crazier,

And drinking more

Making battleships worth the warp

Doomheim 6
Mariya Oktyabrskaya Doomheim Likes received: 6	#9 - 2015-07-15 22:55:46 UTC 2 The forums get so wacky over extended downtimes

Zan Shiro

Doomheim

Likes received: 900

#10 - 2015-07-15 23:18:50 UTC

Please do tell op how a vm is better than a dedicated physical box.

Keep in mind the following:

You can build a physical that matches or exceeds the specs on an esx host (will assume vm esx system, can be windows based hypervisor),

In esx you have to limit the resources to the VM. Ideally you want something not used as a buffer. At 75% max utilization you should be getting leery about load use.

VM does not offset redundancy costs. that fail over esx host( or hosts) has to match and/or exceed the "primary". Primary being the one you prefer to host VM's from. I can move my vm's to esx hosts at will....we just like to see them on preferred hosts when we vm vsphere in. So its actually a few monster esx host boxes.

Snapshots are not backups......they are more convenient than say tape but, they have not replaced tape/drive based dedicated backup solutuions. Nice to have yes....but tape (or say disk based stuff like EMC has been trying to pitch to us for a while now....wish we had the money for it tbh) is still king for backup.

If snapshot used depending on how long they are stale you are now replaying transact logs to get the database up to date. With sql restores this can be the time killer. I can get backup off even tape decent. Its replays that can be fun.

Nafensoriel

Brutor Tribe

Minmatar Republic

Likes received: 269

#11 - 2015-07-16 03:55:17 UTC

Amusing to see people assume tranquility is a simple server and not a super computer custom built, designed, and maintained exclusively for the purpose of running proprietary custom software such as EVE.

It's also hilariously amusing to think the insanely massive raw volume of data that is tranquility would be "oh so easy" to mirror every single day 100% without knowing jack and sh** about exactly how tranquility is setup.

Oh yes... and this completely ignores how you automatically assumed this custom proprietary software running on custom proprietary hardware will magically work perfectly with a virtual machine for :reasons:.

CiCiP Sux2

Viziam

Amarr Empire

Likes received: 2

#12 - 2015-07-16 04:50:25 UTC

Nafensoriel wrote:

Somewhere in past reads CCP hinted/mentioned that each system runs on a particular server but not all systems are on one server. This would suggest a cluster of servers with each server managing its own world of Eve and its own work load.

Assuming so could/would mean that each users connection could terminate at a front end server. Creating a multi-tiered architecture model. As the user goes from one system to another the front end client connection is redirected (handed off) to the other server. I'm going to presume that Eve is a multi-tiered model rather than a monolithic model.

With that said. That would mean that the multiple front end servers could be actually behind some load balancers (e.g F5 BIG-IP or alike) Linux does have a built-in daemon that could be enabled so anything goes depending on the applications requirement.

People throw the virtualization word around like its so simple. It's true that the most advanced hypervisor is VMware, but you need to different versions to take advantage of the HA and/or the intelligent modules. Snapshot give you the ability to role back within 30 seconds, far faster than reloading data from spinning disk or tape. So there's some great benefits.

What has to be considered is the underlying infrastructure, hypervisor is specifically designed for servers ideally with as least amount of vCPU allocated as possible - so 1-2 vCPU is the best. in this case less is more! Why because hypervisor has to wait until the allocated cpu's are ideal before a query can be addressed into the cpu. So when an server with apps on a physical server can have up to 24 proc this is not ideal in the VM world. and in 90% of times the sys admin will drop the proc count to less than 8 to get optimal performance, I've even seen it go below 4. Again its environmental and a great deal of experimentation is usually required. Cardinal rule number one is never ever allocate to one vm the total number of physical CPU's on the hardware. Mainstream servers are at 24 or 36 CPU's today in the Intel/AMD Chipset.

While people are guessing what's going on, we really don't have a friggen clue as what I've just mentioned is not even scratching the iceberg (no pan intended for Iceland Blink

).

I'm not even going to add to this the software code and the gazillion customisations that can be had. The mind boggles in astonishment to what could or could not be the RCA (root cause affect) of what we just witnessed in the last 16 hours or so.

Imperial Shipment Amarr Empire 1,144
Barrogh Habalu Imperial Shipment Amarr Empire Likes received: 1,144	#13 - 2015-07-16 05:55:49 UTC Meanwhile, I'm not sure if rollback is what CCP would be willing to do in such a situation in the future anyway. Future of T3 cruisers - multi-tool they aspired to be instead of sledgehammer they have become


Rawketsled Generic Corp Name Likes received: 532	#14 - 2015-07-16 07:15:01 UTC What happens if the rollback procedures fail?

Itsukame-Zainou Hyperspatial Inquiries Ltd. Arataka Research Consortium 15
Tiranius Avetus Itsukame-Zainou Hyperspatial Inquiries Ltd. Arataka Research Consortium Likes received: 15	#15 - 2015-07-16 12:48:46 UTC Rawketsled wrote: What happens if the rollback procedures fail? You'll get extended downtime

Donnachadh

United Allegiance of Undesirables

Likes received: 1,313

#16 - 2015-07-16 13:28:00 UTC | Edited by: Donnachadh

Overall this is the same crap as was posted here
https://forums.eveonline.com/default.aspx?g=posts&t=434766&find=unread

It all comes down to a bunch of people who have no clue how CCP has any of this set up. what software / hardware ect CCP uses yet they are some kind og computer gods, or they read something somewhere on the internet (see link above) and they think it will be better.

As in the topic linked how about we leave the ideas on solving problems to those who actually know how all of this is set up and people that work with it every day/.

Cidanel Afuran

Grant Village

Likes received: 723

#17 - 2015-07-16 16:07:37 UTC | Edited by: Cidanel Afuran

Illiander Aideron wrote:

Remarkable how you understand the unique intricacies of CCPs server environment without ever seeing any detail on how it is set up or run.

Amazing, really.

And if you think rolling back updates is as simple as a single snapshot, you have obviously never worked in application development.

Federal Navy Academy Gallente Federation 15
Sierra Payne Federal Navy Academy Gallente Federation Likes received: 15	#18 - 2015-07-16 16:08:59 UTC facepalms at OP How could you make assumptions without having internal knowledge?

Radiation Sickness 602
Maldiro Selkurk Radiation Sickness Likes received: 602	#19 - 2015-07-17 00:52:44 UTC \| Edited by: Maldiro Selkurk Dont get me wrong ive practically coined the mantra that CCP only hires devs that got straight D- grades in game development but even i would credit them with properly and efficiently handling down time. Yawn, I'm right as usual. The predictability kinda gets boring really.

HiddenPorpoise

Jarlhettur's Drop

United Federation of Conifers

Likes received: 411

#20 - 2015-07-17 02:49:58 UTC

Maldiro Selkurk wrote:

Dont get me wrong ive practically coined the mantra that CCP only hires devs that got straight D- grades in game development but even i would credit them with properly and efficiently handling down time.

The devs are D-, the server techs are people that don't see the code anymore.

The idea doesn't work. Tranc is a room full of super computers that change configuration every night; running a VM on that would be madness and I don't know how it would help.

Player Features and Ideas Discussion

The Scope

Gallente Federation

Generic Corp Name

CrashCat Corporation

Republic Military School

Minmatar Republic

Warp to Pharmacy

Doomheim

Doomheim

Brutor Tribe

Minmatar Republic

Viziam

Amarr Empire

Imperial Shipment

Amarr Empire

Itsukame-Zainou Hyperspatial Inquiries Ltd.

Arataka Research Consortium

United Allegiance of Undesirables

Grant Village

Federal Navy Academy

Gallente Federation

Radiation Sickness

Jarlhettur's Drop

United Federation of Conifers