Dev Diary: Synchronization Woes!
V1.12 Sync Test: http://www.ironcladgames.com/sins/Sins112SyncTest.zip
Hello all,
The goal this past weekend was to seek and destroy the elusive desync problem. It's disappointing that its showing up again in higher frequency, no doubt due to the increase in numbers of people joining Sins multiplayer after the 1.1 release. Unfortunately, we didn't get consistant desyncs during the lengthy beta so we were never able to nail it down. I also really want it out of the way before Entrenchment is released. It's multiplayer component is a lot of fun so I don't want it blemished by sync issues.
For the longest time I was sure it was some mod related problem. We hadn't personally seen a desync in over 6 months, however when the reports started coming in again after 1.1 went live I decided to dedicate the entire weekend to doing nothing but tracking it down. All of Friday, Saturday and most of Sunday I played with a lot of people who were all as dedicated as I was to eradicating the beast. There were theories to test, combinations to try, players to track down - particularly those who reportedly could produce desyncs consistently, IRC debates, logs to submit and analyze and much more.
Not one problem occurred all of Friday and Saturday but then Sunday afternoon I got word through the grapevine that a mysterious player named "Krunk" was currently the desync king. So a bunch of Sins fans on ICO hunted him down for me and we set up a game. Sure, enough - no more than a few seconds into the game - he desynced. I couldn't believe it when the red desync text materialized and proceeded to burn my eyes. Well, I guess there really is a desync problem.
Sh!t.
What is a Sync Bug?In the world of developing RTS games, there is nothing more painful than a sync bug. They consume vast amounts of time to track down and much of the engine design is focused around methodology to prevent them. Just because I think it’s interesting I'm going to explain what they are and how they are typically caused.
Unlike MMO's, FPS's or many other multiplayer games, in an RTS there is no master server who has a final say in the current state of the game. You may notice in something like World of Warcraft that your avatar suddenly gets his position corrected. What is happening there is your local simulation of the WoW universe has the character moving a certain way, but then the master server says that is incorrect and tells him to reposition himself. You were out of sync with the boss and the boss set you in your place. For an RTS game its far too impractical (in many different respects) to have a master server checking over everybody so we rely on determinism to make sure every stays in sync. Determinism is about making sure everything happens in the exact same way. If you can guarantee everything happens in the same way, you don't need a master server telling everyone what the results are and you don't have to send much information between the player's computers.
Here is a simple example from Sins: if an AI player on my machine randomly decides to attack player X, then I need to trust that the AI on your machine will also randomly decide to attack Player X. There is no communication between our two computers, the synchronization is implicit in the math, logic, and structure. If something goes wrong with this, the AI on your machine may decide to attack someone else. From this point onward our universes diverge and we see completely different results, or in the worst case our games crash. Sometimes the divergence starts so small and grows so slowly that we don't notice for a very long time, if at all. Regardless any form of divergeance is a desync, or a sync bug.Sync Bug Causes
So what can cause the divergence? Why are my ships in a different position than yours? Why did the AI make a difference decision on my machine than yours? Here are a few examples straight from Sins:
1. First, we might be using different CPU's. Different architectures can generate slightly different results, particularly between brands (say Intel vs AMD). This is usually part of the Floating Point Unit so one of the first things we do is make sure we synchronize the FPU's on everyone's machines using a special command. You may have remembered a sync bug shortly after we released the ability to load up mods much earlier in the year. This was caused by the mod setup codepath bypassing the FPU Control call the regular setup of the game used. From that point onward, there is a small chance your mathematical calculations are going to give slightly different results than mine, which usually shows up first as a miscalculation in the orientation or position of a ship since there is a lot of floating point math going on there (particularly with the matrix multiplies required for rotation).
2. Next, we might call non-deterministic operations in a deterministic code block. An RTS engine is really broken up into two separate parts: the Simulation and the Presentation. The Simulation is what is actually happening (AI, physics, gameplay etc) and is the part that has to be deterministic and in sync. The Presentation (rendering, particle systems, etc) is what you see and it doesn’t' have to be in sync or be deterministic. Typically, the Presentation is a custom interpretation of the simulation (e.g it looks at the simulation and decides how best to show you that information based on how powerful your computer is). For example, the simulation says that one of your ships blew up. The Presentation then realizes you have a very powerful graphics card so it decides to render a ton of particle effects to make the explosion look pretty. On an older graphics card it may decide to show a boring white blob grow and shrink. Now to make the nice pretty explosion the Presentation may make many calls to a random number generator to spew various fireball images in random directions while the crappy white explosion didn't make any calls to the random number generator. It's important to note here that random number generators aren't really random, they just spit out numbers that look to humans to be random but really they are numbers that follow a very predictable pattern and order. The generator "remembers" where it was the last it was called so that each successive call doesn't generate the same starting pattern over and over again. But what happens if the AI decides to use that same random number generator to randomly decide which player to attack? Because my pretty explosion made many calls to it and your crappy explosion didn't make any, our random number generators will generate different results because they were left at different positions in the sequence of numbers. Your random numbers are behind mine in the sequence. In order to solve this problem we have to use two separate random number generators - the Deterministic Generator and the Non-Deterministic Generator. Everything in the Presentation calls the ND-Generator and everything in the Simulation calls the D-Generator. One of the early sync bugs in Sins was caused by the autocast code on the Novalith cannon calling the ND Generator when trying to decide which enemy planet to fire at. This of course would give completely different results on every machine. It took a long time to find this one because A, it takes a long time to tech up to the Novalith and B. not many people use the Novalith's autocast.
3. Finally, desyncs can be caused by bad state initialization. One of the key ideas to determinism is that given the same initial conditions, a series of operations on the state of the system will generate the same result on any machine. Naturally, if the initial conditions are different you are screwed from the get go. When programming RTS games its very important that when you create various objects (ships, buildings etc) that they always have the same state from the start of the game. To be honest we rarely release code to the public that has bad state initialization because this type of desync is pretty easy to detect as soon as the game starts (as opposed to the other kind which take a long time to occur) and our testers don't have to play for hours upon hours to see if one exists. However, there are some special cases where bad state initialization sneaks in and is very difficult, highly improbable, if not near impossible to detect. In a sense, bad state initialization is both the easiest and most difficult type of desync to find. It turns out the sync bug I was tracking on the weekend was one such bug, has existed since last spring, and until Sunday afternoon I swore it didn't exist. Here is what caused it and why it was so elusive:
The Failing Market:
The market system has two special state variables called "stateStartTime" and "stateEndTime" that control the time interval of various market states (e.g Metal Boom, Crystal Crash etc). These values were not initialized properly. As I said above this is typically caught pretty quickly but this case falls under the very improbable. If two players start their first multiplayer game, both their market state variables will be incorrectly initialized in the same way, so even though they are incorrect, at least they are in sync. Every time they end a game those two values are left in whatever state they were in for the start of the next game. But even then, those two players can continue playing all day long with each other without any sync problems. So suppose they decide to play against someone else. That new player's market values are not screwed up in the same way theirs is. Normally, in this case they would go out of sync right away and we wouldn't have much of a problem. Easy find, easy fix. Nope, not in this case. How is this possible? It’s clear that everyone's market values are completely different but they stay in sync? As I said to myself on Sunday, "wtf!!!???!!!"
The reason they stay in sync is because the market simulation code that uses those particular state values is very rarely executed. It takes a very particular set of conditions to cause the market to enter the state that will use these values to determine the evolution of the market.
You need the following conditions to get this sync bug to occur:
1. One player who already played a game of Sins.
2. His original game must have entered one of a few, very rare market states.
3. This player must play someone who he hasn't already played a game with.
4. He must not have restarted Sins.
5. Their new game must also enter the same, very rare market state.
So on Sunday, I finally got to play against two players (Krunk and ZanZ) that met these rare conditions. After desyncing with them, they sent me their sync logs and I was able to compare them against my own to determine that their market state variables differed and using that information I could trace back what caused the divergence. The process of detecting a divergence and tracing its causes is also a very interesting topic so maybe I'll do a write up on that if there is some interest.
Before we officially release 1.12 I'd like to have the fix tested with a lot more people. You can grab a special 1.12 build at http://www.ironcladgames.com/sins/Sins112SyncTest.rar or http://www.ironcladgames.com/sins/Sins112SyncTest.zip if you want to give it a shot. It also fixes the buff stacking issue. Just extract the exe into your Sins install folder and run it. You will only be able to play against people also using this build so don't overwrite your 1.11 exe if you want to jump back and forth between versions. Also keep in mind that this new exe will need to be allowed through your firewall.
It's also possible there is another sync bug out there but I doubt it very much given that all the logs on the weekend pointed to the same cause. But just in case I won't be making any "monkey's uncle" claims like I did before
Blair
Special thanks to everyone who participated in tracking this down, especially:
(in no particular order)
AnnatarRaknorHowDidYouDoThat (HowThe?)KrunkZanZCoolJetsCykurSting and ofcourse SpaceFish for his sync snapshot code.
my guess is that a (or some) memory registers are not cleared after ending a game. thus joining a new game with wrong data in a certain register will cause an error (since data of players differs). thats why blair wrote to close and restart the client after each game.
...but, i am only a dumb user, so don't nail me on that
I got logs from Soduka, but not from the other two listed here. Is Soduka perhaps one of them?
sounds fair enough... its pretty easy to spot someone who didnt restart his game either, loading goes very fast if you already played a game , probably cuase most stuff is already in this register....
BTW whity, how did that replay turn out? i looked at it again even restarted once but somehow it still doesnt show him building the massed capital ships, even tough im very sure i watched that replay after that game and he did exactly the same thing, maybe it's something he did outside of the game?, something the replay couldnt record (that still doesnt explain why it showed after watching the replay for the first time tough)
No...well at least I don't think. It turns out that Shalom posted right above me, so you might want to PM him. As for JohnJames, I have no idea. I mostly just know people from their online screenname.
Hi Blair, I have a question. Since the above conditions are very specific, how is it that a single player (Krunk) sees this on a regular basis? Does this mean he doesn't shut down his Sins often? Sounds unlikely to me.
So, since you did not mention it in your post (which I find excellent reading - thanks!), I have to ask what you do with out-of-order or random dropped network sync issues. For example, assume a user has a poor net connection (note: poor net connection, not completely dropped connection). Actions from other players (or his own) are always sent/recieved out of order or missed entirely. Wouldn't that also result in a desync? I assume that you guys already take care of such cases (re-transmits, serializing actions, ACKs, etc etc), but what if there is an extended delay/loss in the transmission of such "actions" between MPs - would that result in a desync? And how do you get around it, while other players' simulations cannot be "paused" to wait for the one poor-net-connect guy to "catch up"..
PS. I do believe that you should be able to see such net-related problems from your log files, but again, since you didn't mention it...
Sorry, follow up to my post.. Let me give you an example scenario about what I'm talking about using your deterministic RNG.. Assume very simple 1 x 1 MP..
Init: P1 and P2 RNG seed=1
Time 0: P1 execute action 1.1, RNG seed now at 1
Time 1: P1 -> send sync action 1.1 to P2, waiting for ACK from P2
Time 2: P2 execute action 2.1, RNG seed now at 1
Time 3: P2 -> send sync action 2.1 to P1, waiting for ACK from P1
Time 4: P2 recieve action 1.1 from P1, send ACK to P1
Time 5: P2 execute action 1.1, but RNG seed now at 2
Time 6: P1 recieve action 2.1 from P2, send ACK to P2
Time 7: P1 execute action 2.1, but RNG seed now at 2
Time 8: P2 recieve ACK fom P1
Time 9: P1 recieve ACK from P2
Although it may seem that both P1 and P2 are in sync, isn't it true that action 1.1 was executed with RNG=1 on P1, but P2 had RNG=2 when it executed 1.1? Same with action 2.1 but reversed. I don't know if this sort of scenario applies to Sins, so I'm just asking..
EDIT: when I say "RNG seed", I really mean the position in the RNG sequence. I assume the real RNG seed is identical across P1 and P2, and yes, as Blair said, RNGs are not "really" random, they follow a mathematically known sequence..
It means Krunk has a different sync bug or he's using 1.11 or he's playing with people who are using 1.11. I didn't go into the full list of desync situations in my write up, I just highlighted three cases. Believe me, there is much more to sync bugs than what I outlined. W/r to the network issues, I'd have to do another full dev diary to answer just the tip of the iceberg on that topic.
Just the tip?
LOL!!
Ah damn.. I thought with Krunk, 1.12 is actually pretty close! Oh well.. I don't envy you guys at IC office right now.. I've had my share of terrible bugs which is hard to find (while customers are banging on the door) - the "need 10,000 test runs over 50 months to find the cause" kind of bug.. Best of luck to you guys and if I could, I'd buy you all a keg to keep your spirits up!
And keep the productivity down Unless they code better while drunk. A guy I went to Uni with thought he did!!!
You don't want to do that if you can avoid it. Sending position, orientation, status, etc information for every single unit, planet, weapon trail etc in the game every frame will overwhelm your network connections in short order, especially in a large game. Much better to send the list of orders. Plus, this allows replays.
The other big advantage to parallel simulation is that it avoids cheating by the server-host. If I'm the server, and I hack my code to say that my ships have triple the normal hitpoints, all the clients would accept it. But in a peer-to-peer parallel simulation, if I change my ship's health, I'll just get a desync as soon as my ship survives on my machine but dies on the other player's machine.
I have been developing a game for the past few years so I understand the level of dedication and effort all who have worked on and continue to work on this project. Keep up the good work! By the way, updated to the 1.12 beta patch today after talking to my friend while playing 1.11 and getting repeated de-syncs. Worked well for us!
Auido
peer-to-peer parallel simulation does not stop the cheating, yes you are correct you could not effectivly modify the hitpoints. However, with no dedicated server somone could build with no restrictions on materials because there is no authority of monitoring a players materials. It would be possible for someone to flood the game with ships and it would not be rejected by any other client, because the client puts in the game whatever it is told to put in the game.
OK, I may be wrong, but I think that's also not possible in Sins case. Consider if we have a MP game, and my copy has been hacked to give me 10x more metal, crystal, etc.. A desync would occur pretty quickly if not right away since my starting values in each of the other players in the game would not match. Remember, in this simulation, though the I may be the "remote" guy, the other client PCs will also have to simulate my stats, etc etc to a certain degree. At some point, these differences will screw up the AI, the simulation, etc etc.
Hack, let me just say that one cannot produce good code when one is depressed. How much spirits is require to reach the zone, varies from programmer to programmer ha ha ha!
I'll throw in for a keg
Everything is duplicated across each machine -- including resource count. So when I build a ship using non-existant materials, the other game is goign to reject it -- and desync the instant I give a command to that non-existant ship.
That is a good point
So, how goes the 1.12 patch? I'm guessing that you want to track down and eliminate all of the desyncs before releasing it? Will there be an updated 1.12 sync test patch?
I was wondering how the progress was going myself. Cant wait for the official release, as that will make playing multiplayer games easyer. since it is hard enough to find good games in 1.11, but i cant ever seem to find anyone with 1.12 on.
Interesting. I'd like to give this a go, but I'm not sure how to achieve the conditions that might bring about the divergence. I can't tell if my market values are already incorrect. Are these logged in the sync log as well?
I can confirm it happened exactly like he said above:
This is on LAN, btw; we played a game together he closed the game out, I left it on, we came back for another and I noticed my metal was high, made a comment to him, and then we realized, desync.
My metal was up around 700-800 and his at a normal 400ish
blair what is your email so I can email you some checksum of 1.12 desyncing
I'm no Fraser, but I've got this:
https://forums.sinsofasolarempire.com/332352
Hope that helps.
I have had that LAN desync several times as well gets real annoying having to back track though auto-saves to find were you were still synced.
P.S. I use the same game on both comps I.E. I copied sins off my laptop onto my desktop. could that be what is giving me the problem?
Hi, I'm having severe desync issues as well with v1.12. We are playing Multiplayer-LAN-Games with my copy of sins, all players got the same version installed (patched by copying my version over theirs, as Impulse would require them to buy the game) and experience desyncs every single time we play, especially in combat situations.
Could it be caused by this rather unelegant patching method?
And if so - what way of distributing my patched copy to my friends for LAN games would work?
All things aside, they would consider buying their own copy if we were sure it helped this issue... heck, I basically bought Sins just for multiplayer games, which are ruined by desyncs.
Best regards, John
There are many great features available to you once you register, including:
Sign in or Create Account