These forums have been archived and are now read-only.

The new forums are live and can be found at https://forums.eveonline.com/

EVE Information Portal

 
  • Topic is locked indefinitely.
 

New dev blog: Changes to Toolkit Exported Data

First post First post
Author
James Bryant
Deep Core Mining Inc.
Caldari State
#21 - 2012-05-03 19:18:31 UTC
CCP Solomon wrote:
As some have correctly noted, the reason for the split format delivery is due to an internal process change in how we manage static data, both during authoring and at run-time. This is a gradual migration effort that will see more and more portions of the static data dump delivered as YAML files.


Can you shed any light on what this process change is and why it was undertaken? That might help us select the correct tools to manage a hybrid YAML/SQL environment like what I'm assuming you guys will be doing.
Andrea Griffin
#22 - 2012-05-03 19:26:31 UTC
I'm a bit befuddled over the choice of YAML but hey, if it works for you guys then that's great. I know that a lot of guys where I work are moving a lot of our configs over from a simple key=value format to YAML.

I'm not sure I would want it for the massive data sets that Eve uses, but it's a standard format, it isn't XML (which is good and bad), and it's very readable. So, what'ev. : > I'm just happy that CCP is awesome enough to provide us with the data in the first place.
Ciar Meara
PIE Inc.
Khimi Harar
#23 - 2012-05-03 19:49:47 UTC  |  Edited by: Ciar Meara
.

- [img]http://go-dl1.eve-files.com/media/corp/janus/ceosig.jpg[/img] [yellow]English only please. Zymurgist[/yellow]

Packtu'sa
Nabaal Construction and Industrials Corp
Nabaal Syndicate
#24 - 2012-05-03 19:58:22 UTC
CCP Solomon wrote:
As some have correctly noted, the reason for the split format delivery is due to an internal process change in how we manage static data, both during authoring and at run-time.


Yes, but why? In what way is the data changing such that YAML is the most appropriate format, and/or in what way is YAML the most appropriate format for the existing data?

There may well be a very good reason. I'm just interested in hearing it. Smile
Matthew
BloodStar Technologies
#25 - 2012-05-03 20:05:12 UTC
CCP Solomon wrote:
As some have correctly noted, the reason for the split format delivery is due to an internal process change in how we manage static data, both during authoring and at run-time. This is a gradual migration effort that will see more and more portions of the static data dump delivered as YAML files.


The blog itself seems to suggest that only the client is currently using the static data in YAML format. Does this mean that the server (fed as it is through a lovely beast of an SQL Server), still has this data in a database format?

If so, what is the logic behind not providing all the data in both formats?

Or is the plan that all static data will exist only as YAML throughout both client and server?

My concern with this move is that while, as you point out, there are plenty of yaml readers for common programming languages, support for it in more off-the-shelf usage scenarios (ranging from an SQL Express install on someone's desktop, right down to someone knocking together their own home-brew spreadsheet) is rather less complete. Unfortunately, the latter group of data-dump users are unlikely to really even consider themselves 3rd party developers, so the first we'll probably hear from them is the wave of moans when the first of the really key data tables transitions over (the rest of invTypes, for example).

I guess I'd just be happier at accepting the higher barrier to entry that this creates if there was a bit more detail as to the advantages you expect in moving the data to YAML-only. Right now it looks like a shift of format of what is essentially perfectly happy, tablular, relational data, without any obvious benefit.

Chribba wrote:

I'd probably be looking at converting it all back to db since that's what I prefer myself though.


If CCP are really going to push the data as YAML-only, then I can see this being a very popular service, particularly with myself!
CCP Redundancy
C C P
C C P Alliance
#26 - 2012-05-03 20:20:02 UTC
I figure I'll just answer some questions in an incomprehensible techy way.

As an organization, CCP has decided that we benefit from developers being able to work in branches (in a Perforce sense), and working in branches eventually means that you need to be able to change your data with your code and not affect other people. Binary formats and DBs can be perfect for the run-time data requirements that you have, but they're sucky from the point of view of understanding and merging data as a frail human doing an integration. So we want our data in files, and we want to be able to merge them.

So why not CSV? Yes, it would seem to be more appropriate to some of the data we've shown off so far, but at some point you also have to realize that the reason your data is tabular is because you've been storing it in a tabular storage medium. It's also really sucky to deal with foreign key relations between text files. As we progressively migrate more data, and deal with things like moons being the children of planets, we can decide to represent that in the structure of our data.

And why not JSON? Well, JSON is spiffy and all, but we already use YAML as an institution (if you've ever looked at our .red files for our ship assets etc), and it can deal with some things much more nicely than JSON (we use YAML reference support for some things). JSON somewhat suffers for having been built for a language that hasn't had a standard concept of a map until ECMA script 6 (stringified attribute names on objects just don't count, and this issue carries over into the proposed JSON schema validation standard).

We can output our static data as JSON, it's just not what we want to work with. We asked about this at the fanfest, but I think the detail was probably missed in the noise of the orbital bombardment Big smile

So we wanted to use YAML, and a no-sql-ish document-like data setup, but this isn't really appropriate for the runtime. In fact, we don't use it for the runtime... we use a structured binary format that's built from the data (think MessagePack), attempts to minimize memory overhead and disk seek operations. In some cases, we might even use Sqlite (woo, standard python library!) as appropriate for the use-case.

The end result of what we're trying to achieve is better runtime memory overhead performance, with an easier time for our developers to add / remove and change data formats in a human-understandable base format that can be versioned with changes to the code that are associated with it and isolated in branches. Some of the data will get built straight back into relation tables in MS-SQL, but some of it will just sit in a structured format that means that all related data to a particular "thing" is just accessible right there without subsequent join operations or lookups being required. We get to be able to sync our source control and get our data as it was at that point.

An important thing for us is to just try this and see if it can work for us, which is why we're starting so small. Beyond the changes to the data are much bigger changes to our tools and methods of working, not so much based on what the format is, but more on the issue of the data being local, isolated and in files rather than a central authoring DB.

I would suggest that the best long-term solution *might* be to look at NoSQL type databases, of which there are a number of free and very good options, or you can choose to try and maintain scripts that process our data into relational structures.
Dalmont Delantee
Gecko Corp
#27 - 2012-05-03 20:32:06 UTC
Wow, that is seriously nerdgasm speak...I understood about 1 out of 10000 words but still made me shiver :P

James Bryant
Deep Core Mining Inc.
Caldari State
#28 - 2012-05-03 20:39:32 UTC
Thanks CCP Redundacy,

We too are dealing with the difficulties of maintaining some kind of versioning system with DDL and static data; it is really a problem that doesn't have a good solution yet.

I can understand going to files that are easily parseable by CVS/Git/Whatever. It also allows people to check out the files and make their own changes locally without affecting the test database, and without having to load a fresh database copy into their local test environment every time they need something.

VS2010 Premium actually has some good SQL versioning capabilities when used with Team Foundation Server, but I have to say that I'm intrigued by CCP's approach here. Kinda the best of both worlds, in a certain sense (for you guys, a bit less so for us). A bit unwieldy, for multiple join type queries, but that stuff can be handled in code instead of in the database, I suppose.
CCP Nobody
C C P
C C P Alliance
#29 - 2012-05-03 21:02:47 UTC
Matthew wrote:

The blog itself seems to suggest that only the client is currently using the static data in YAML format. Does this mean that the server (fed as it is through a lovely beast of an SQL Server), still has this data in a database format?


No, we will also change the servers static data to YAML because, as CCP Redundancy said, we are moving away from the central authoring DB solution.

The gain in this is that the client will not contain data that it doesn't need and neither will the server, which can't be bad Big smile
Packtu'sa
Nabaal Construction and Industrials Corp
Nabaal Syndicate
#30 - 2012-05-03 21:28:05 UTC  |  Edited by: Packtu'sa
CCP Redundancy wrote:
...


Cheers. To clarify, does CCP have any plans to add structures which can't easily be represented in a database? (I'm having difficulty imagining what these might be, but YAML can do a lot.)

If there are any performance issues with YAML in third-party applications, I'm sure someone over at the Technology Lab will come up with a more useful package. (Something similar to the binary format that CCP Redundancy mentioned?)

[EDIT]

CCP Redundancy wrote:
I figure I'll just answer some questions in an incomprehensible techy way.

This, please, more of this! I've recently come back to EVE after playing some other in-development games, and it's refreshing to once again chat with devs who respect the player base and are themselves respectable.
Alx Warlord
The Scope
Gallente Federation
#31 - 2012-05-03 21:46:04 UTC
Yammy database P !!! uhmmnnn tasty!!! Lol


* oh it is not yammy it is yaml... D :
CCP Redundancy
C C P
C C P Alliance
#32 - 2012-05-03 21:51:34 UTC
Packtu'sa wrote:
CCP Redundancy wrote:
...


Cheers. To clarify, does CCP have any plans to add structures which can't easily be represented in a database? (I'm having difficulty imagining what these might be, but YAML can do a lot.)

If there are any performance issues with YAML in third-party applications, I'm sure someone over at the Technology Lab will come up with a more useful package. (Something similar to the binary format that CCP Redundancy mentioned?)

[EDIT]

CCP Redundancy wrote:
I figure I'll just answer some questions in an incomprehensible techy way.

This, please, more of this! I've recently come back to EVE after playing some other in-development games, and it's refreshing to once again chat with devs who respect the player base and are themselves respectable.


I don't recommend YAML for anything where you worry about performance. Check out MongoDB (BSON) or MessagePack as a starting point (also NoSQL in general is an interesting thing to play with if you've been all-relational, but I won't pretend that it's a good solution to everything). If you need to use YAML, make sure you're using a native parser at least (pyYAML + libYAML, for example).

We'll be sticking to lists and dicts and nested objects (pretty much JSON), and mainly focusing on working out how to convert our existing datasets (that are already in the DB) to this sort of thing without screwing things up for everyone at CCP. Python is very handy at working with this sort of data, so I personally recommend that for transforming it to whatever format you prefer.

This sort of structure: { 1: ['a', 'cat'], 2:['two','dogs'] } is a pain in the ass in a DB... do-able, but I don't want to insist that people build a relational version unless they need to.
CCP Redundancy
C C P
C C P Alliance
#33 - 2012-05-03 21:57:07 UTC
James Bryant wrote:
VS2010 Premium actually has some good SQL versioning capabilities when used with Team Foundation Server, but I have to say that I'm intrigued by CCP's approach here. Kinda the best of both worlds, in a certain sense (for you guys, a bit less so for us). A bit unwieldy, for multiple join type queries, but that stuff can be handled in code instead of in the database, I suppose.


We evaluated that technology, but determined that it probably wasn't going to fit our needs.

There are a few ways to handle multiple joins in a NoSQL-y way: you can separate the data out into another document collection and do the lookup (like MongoDB document links) and you can also duplicate and pre-embed the data. That sounds wasteful, but if you're sensible about it, it's no way near as bad as the memory overhead that python has (~10MB of data in terms of pure integer/float etc memory can easily blow up to 90MB, which starts to add up if you pickle and unpickle large data structures). We use schemas to omit type and attribute name information (like all of those "graphicID" strings in the raw data), which can be a big factor in more permissive structured data representations. [Side note - this is a big reason why we have typically seen a rise in the memory of the character selection screen each expansion as we add more/new data ]

Keep in mind that this stuff is heavily built towards static data that's immutable at runtime (do you know how difficult it is to find a key-value storage system library that's built for that particular requirement?). We can build all sorts of indices however we want - planets could be embedded inside of a solar system document, but we could still make efficient indices for looking up planets by ID within that. We can also load the data from disk in a cache-friendly manner if needed.

So in general, when dealing with static data, pre-bake your joins - funnily, we tend to already do this in performance critical databases by denormalizing data (only denormalized relational databases can't do that for lists or parent-child relations).

At least, that's the theory...
CCP Nobody
C C P
C C P Alliance
#34 - 2012-05-03 22:09:27 UTC
..*slow clap*...
Zaotome
The Scope
Gallente Federation
#35 - 2012-05-03 22:58:23 UTC
slow clap? clap! clapclapclap! Big smile
Packtu'sa
Nabaal Construction and Industrials Corp
Nabaal Syndicate
#36 - 2012-05-04 00:59:09 UTC
CCP Redundancy wrote:
...

Alright, you've convinced me. A unit of Spirits to you! Big smile
James Bryant
Deep Core Mining Inc.
Caldari State
#37 - 2012-05-04 01:37:22 UTC  |  Edited by: James Bryant
CCP Redundancy wrote:
There are a few ways to handle multiple joins in a NoSQL-y way: you can separate the data out into another document collection and do the lookup (like MongoDB document links) and you can also duplicate and pre-embed the data.

The problem is, in YAML, that's the slowest part, and with a binary format like MessagePack, how the heck are you getting at a specific piece of data you want without unpacking the whole thing? If something as large as invTypes needs to be parsed (or unpacked), that's an awfully large piece of memory (and slow code). For the PHP and other web folks who have to load data for every page, that gets pretty nasty. I suppose that's probably just not the right tool for that particular job, but I'm just trying to flesh out the options.

I can see MongoDB or another No-SQL style as perhaps the weapon of choice for the web guys for that reason if the data ever gets to the point of being outside the realm of what can be handled by a traditional relational format.

I haven't read completely through the MessagePack docs (which are pretty bare), but I'm not seeing random access capability. It is entirely possible I'm completely missing something though.

There's an additional problem I see for the 3rd party folks, and that's for non-dynamically-typed languages. The YAML (or JSON, or BSON) can have any number of arbitrary data structures. I suppose, like for Java or C#, you could maybe just use a Hashmap.

Quote:
So in general, when dealing with static data, pre-bake your joins - funnily, we tend to already do this in performance critical databases by denormalizing data (only denormalized relational databases can't do that for lists or parent-child relations).

True, and I tend to do the same, trying to denormalize when it makes performance sense, such as adding some information from invTypes to various other API pulls, like assets and the wallet transactions, to avoid having to join them every time.
SkillQueueMonitor
Pator Tech School
Minmatar Republic
#38 - 2012-05-04 02:13:05 UTC
Bout time. That denormalized table inside SQL made my soul hurt.

AND

I never have to install MSSQL ever again.
Lairel Dallocort
Hot Lobster
#39 - 2012-05-04 03:00:20 UTC
As a Linux user who has no access to an MSSQL server, this makes me super happy!
Jinli mei
Dreddit
Test Alliance Please Ignore
#40 - 2012-05-04 05:31:45 UTC
James Bryant wrote:
For the PHP and other web folks who have to load data for every page, that gets pretty nasty. I suppose that's probably just not the right tool for that particular job, but I'm just trying to flesh out the options.


With web-based stuff you can cache it either using a nosql approach like mongo, or something sane people use like memcached. If you think about it hard enough, you realize that most data you're pulling from CCP should likely be in a cached state rather than pinging the database for it or parsing it anyway.