These forums have been archived and are now read-only.

The new forums are live and can be found at https://forums.eveonline.com/

EVE General Discussion

 
  • Topic is locked indefinitely.
 

EVE API woes - What happened Thursday ?

First post
Author
CCP Red Button
C C P
C C P Alliance
#1 - 2011-10-07 22:14:42 UTC  |  Edited by: CCP Red Button
Hello everyone,

Those of you who use one of the many applications and websites that make use of the EVE API probably noticed that this Thursday they were not working particularly well or at all for most of the day. In fact the EVE API was largely non functional for about 14 hours and I would like to shed a little light on what happened and what we are doing to prevent future occurrences.

The popularity of the EVE API has grown tremendously over the past few years with an ever larger number of websites and applications making use of the information that is available via the API. This is a really good thing and opens endless possibilities for interacting with and presenting in-game information on a variety of portals. But as the popularity has grown exponentially so has the load on the service as a whole and at times to be honest we have struggled to keep up with this growth from both a hardware and software perspective. In between spurts of hardware upgrades and software improvements there have been periods where the usage has grown in leaps and bounds and thus left the API struggling to keep up and vulnerable to sudden peaks and shifts in usage.

This is more or less at the core of what happened Thursday. One of the larger more popular applications that uses the API is EVEMon. It is used by tens of thousands of players (including a lot of CCP devs) to view skills and attributes, monitor skill training and wallet balance etc. In version 1.5.0 of EVEMon released last Friday the behavior of the application changed slightly so that it would start aggressively retrying if it did not get a properly XML formatted CCP error response back in the case of an API error.

At the start of extended downtime on Thursday the API was not gracefully shut down and presented the applications with a default HTTP error page instead of a properly formatted XML one. This caused several thousand EVEMon clients all over the internet to go into overdrive and bombard the server with anywhere from 2 to 50-60 requests per client per second. Once the API was back online after downtime it never stood a chance at recovering from the massive flood as the servers could not keep up, filled up to their max connection limit and keeled over despite our best efforts to increase throughput and capacity. Once the servers reached their connection limit (which took only a few seconds) they no longer responded with a properly formatted XML response but a generic one which only served to sustain the request flood. Now this was greatly escalated by the fact that the API was already very fragile and vulnerable to such an overload condition as we had not taken care to plan for such an event predictable as it may be when I reflect back on this. So the API only had enough capacity to handle around 5x normal average load whereas this was on the order of 20x or more and we had no built-in means to throttle connection attempts.

Once we realized what was happening and after some attempts at restoring service on our own we reached out to the EVEMon devs which responded ultra quickly (massive props due) and released an updated version in less than two hours after we contacted them. A fix was ready at around 19:30 GMT and soon after started to propagate and we saw the load decrease until ultimately shortly after midnight full service was restored and things were back to normal.

Now needless to say there are a number of obvious improvements that need to be made to the API and to the supporting webserver infrastructure as a result. Top of the list is the ability to throttle inbound connections and to selectively allow certain types of requests. We are also planning to greatly increase the capacity of the API servers and be ready to throw in additional resources as needed with short notice. We encountered some issues troubleshooting what exactly was causing this so additional debugging tools and the ability to accurately trace through the application has been added to the shortlist of things to improve.

In closing I would like to make it absolutely clear that even though it may have been the EVEMon application that caused the increased traffic it ultimately could have been any application and we were just plain and simple not prepared and ill equipped to handle this properly. That is definitely a lesson learned and we will be putting a lot of effort in the next days and weeks into making the API more robust as it is a critical part of our game environment and will only continue to become more important as we evolve.

Yours,
Red Button
The Apostle
Doomheim
#2 - 2011-10-07 22:21:14 UTC
oh Cwibba, you wascal.

[i]Take an aspirin. If pain persists consult your local priest. WTB: An Austrian kangaroo![/i]

Chribba
Otherworld Enterprises
Otherworld Empire
#3 - 2011-10-07 22:22:20 UTC
PURPLE BUTTON! LIKE A BOSS!

Thanks for the update Smile

/c

★★★ Secure 3rd party service ★★★

Visit my in-game channel 'Holy Veldspar'

Twitter @ChribbaVeldspar

okst666
Federal Navy Academy
Gallente Federation
#4 - 2011-10-07 22:31:24 UTC  |  Edited by: okst666
You can never plan for all eventualities. Once you start worring about such issues and plan countermeasures and stuff, you open pandoras box and will never ever finish implementing what you wanted in the first place, as the monster allways grows another pair of heads once you cut one.

1. If it crashes under certain conditions, change the conditions or fix the issue...rinse and repeat for every other occuring incident..

2. After 2 years of fixing, rewrite and optimize. goto 1

[X] < Nail here for new monitor

Squizz Caphinator
The Wormhole Police
#5 - 2011-10-07 22:46:42 UTC
Thanks for the write up! I'm glad it wasn't me that caused the problem Lol

Various projects I enjoy putting my free time into:

https://zkillboard.com | https://evewho.com

Desmont McCallock
#6 - 2011-10-08 05:47:46 UTC  |  Edited by: Desmont McCallock
First of, I would like to express my apologies to the EVE Online community, CCP, 3rd party app devs and whoever else was affected, for any inconvenience caused.

We never intended to brake the API but an oversight in the code of version 1.5.0 caused the behavior mentioned in CCP Red Button's post.
The only good thing is that the API DDoS was caused from inside the EVE Online community and CCP had a way to get in contact with us and we were able to respond in no time. I can't even imagine the consequences and impact of a similar event, from an outside of the EVE Online community source. Indeed this calls for reinforces to fortify the services provided by CCP.
From our part, we realized that we have to review and improve the way we do our code testing, so to prevent EVEMon from behaving unpredictable.

Secondly, I would like to clear out, that Chribba is not related with EVEMon in any way (EVEMon is published under BattleClinic and Chribba services are published under OMG Labs), although you can keep blaming him (joke).

On behalf of the EVEMon Dev Team,
Desmont McCallock (a.k.a. Jimi)
The Apostle
Doomheim
#7 - 2011-10-08 06:01:17 UTC
Desmont McCallock wrote:

Secondly, I would like to clear out, that Chribba is not related with EVEMon in any way (EVEMon is published under BattleClinic and Chribba services are published under OMG Labs), although you can keep blaming him (joke).

On behalf of the EVEMon Dev Team,
Desmont McCallock (a.k.a. Jimi)


Oh dang and bwast. Another popuwar myth evapowates.

Sowwy cwibba.

[i]Take an aspirin. If pain persists consult your local priest. WTB: An Austrian kangaroo![/i]