By Frogboy Posted May 18, 2009 7:30:13 PM
I’ll be writing a lot more on this particular issue in the coming weeks as I’ve had more time to review internal reports.
For those of you just joining us let me bring you up to speed.
Our story so far…
Demigod, a high profile, AAA action-strategy-role playing game was released on April 14th. Well, it was supposed to be released on April 14th but actually got released at Gamestop stores early due to a…miscommunication between their corporate HQ and their brick and mortar outlets. This wouldn’t normally have been that big of a deal except this happened to be over Easter weekend and the release servers for the game weren’t yet up. Moreover, it also caused the “warez” version (i.e. there’s no copy protection on the game so the warez version meant someone bravely zipping it up and putting it up on a torrent) resulting in over 100,000 people using it – at once – before we were even back from Easter break. Suffice to say, it wasn’t a pretty picture.
For the first few days, we struggled to migrate people to a different set of servers that only legitimate users had access to. This took about 48 hours. But during this brief window, the game was basically unplayable because you couldn’t even get online – at all. We got whacked with some pretty negative first week reviews not surprisingly.
But our woes weren’t over yet. It became pretty clear that the NAT servers (the servers that negotiate the connection between player A and player
couldn’t handle the # of users on the game resulting in a horrible online experience. As other people have pointed out, this sort of thing isn’t unique to Demigod (i.e. plenty of other games have had rough online launches) but the big difference is that those other games had a lot more single player content whereas Demigod relies more on its multiplayer experience than most games so it was a much bigger problem.
Like most games, Demigod uses a lot of licensed code. Demigod’s awesome 3D models are powered party by
Granny 3D. The videos in the game are powered by
Bink. The sound is powered by
Fmod. And the network connectivity was powered by
Raknet. These are all very good libraries and used by major publishers.
But Demigod’s network requirements are somewhat unusual and demanding. First, Demigod is peer-to-peer and not client server. Everyone connects to everyone. Second, the number of people playing is unusual. Yes, some people do play 4 on 4 games of Supreme Commander or Company of Heroes but
typically they’re 1 on 1 or 2 on 2. The more connections, the more complex.
The result was that it was a nightmare to get games going online.
The problems
Demigod’s connectivity problems have basically boiled down to 1 bad design decision and 1 architectural limitation. The
bad design decision was made in December of 2008 when it was decided to have the network library hand off
sockets to Demigod proper. In most games, the connection between players is handled purely by one source. For instance, in Supreme Commander, GPGNet handled the entire connection.
So in Demigod, on launch day, Alice would host a game. Tom would be connected to Alice by the network library and then that socket would be handed to Demigod. Then, Alice and Tom would open a new socket to listen for more players to join in. As a result, a user might end up using a half dozen ports and sockets which some routers didn’t like and it just made things incredibly complex to connect people and put a lot of strain on the servers to manage all those connections.
Now, the architectural limitation came from the way the network library’s database handled things. We still don’t have a clear idea on why it was so limited but this was the overwhelming problem that only got resolved late last week. Here’s how it works:
Alice hosts a game. In doing so, she sends a message to the NAT server (as well as our servers). Tom wants to join so Tom clicks join and it tells the NAT server to begin connecting them. But, it turned out that a relatively small number of people online at once would quickly result in a huge delay in messages being sent back and forth. For instance, when Tom clicks join it sends a message to the server to tell it to start connecting Tom and Alice. But Alice might not get that message for 30 or 40 seconds. That means, for that entire time, Tom and Alice are “attempting to connect” but haven’t even really started because Alice hasn’t even gotten the message. As more people tried to join the game, that delay could get worse and worse. If someone left the game, it could take that amount of time for the server to realize that player had left (meanwhile it was trying to connect them).