Tracking down dsync errors in a deterministic game simulation

The last few weeks I tried to fix dsync errors in Mazebert TD. So what's a dsync anyways? Mazebert TD is a deterministic simulation. This means a savegame contains all player actions and if the simulation executes those actions the outcome should always be exactly the same. And by exactly I mean every bit must be the same, every turn. When you load a game, the simulation starts from scratch and then applies all actions you did so far. That's why long games take longer to load. It also enables multiplayer games, since only the player actions need to be sent over the network. And finally, it allows to automatically replay and validate games that you submitted to the Mazebert server. This was the original idea: If someone is using CheatEngine or similar stuff to tinker with the game, the server detects a dsync and rejects the submitted game. But a dsync error also happens when the simulation returns different results for the same save game on different devices. As I'm writing these lines, this is what happened to about 22 of 14,715 submitted games so far. So it happened in ~0.15% of all games. When it happened it was very late-game, mostly hours into the game have passed. And players who reported this were certainly not cheating. I felt terribly sorry for them, putting so much effort into a single run, just to be rejected by the server! So I went on to fix it. Well, shit. There I was, having no idea what was causing the problems. I was super careful about dsyncs when I implemented the simulation. I have a unit test in place to ensure that all classes use the strictfp modifier, that guarantees same floating point results on all platforms. I was using only StrictMath. All random code ran through my own carefully seeded RandomPlugin. On my MacBook, on my PC, on my server, on my iPhone and on my two Android devices I always got the same results even in games that took hours or days. I was completly clueless what was going wrong. You might ask yourself, how the simulation even knew there was a dsync happening. For every simulated turn the simulation calculates a hash of the entire game state and stores it in the save game. Kind of like a blockchain, but with a much simpler and faster hash. Each turn the simulation can then compare its own hash with the one in the save game. And if it does not match, we get the dreaded dsync:
if (replayFrame.hash != expected) {
   throw new DsyncException("Oh shit, we have a dsync during loading.", lastValidTurnNumber);
}
First thing I did was to allow partially valid game submissions. This way players who put hours into a game were at least rewarded all hours until the dsync happened. I went on and stored the last valid turn number of those games in the database on my server. But this was just a workaround, not a fix. Players would still lose a significant amount of hours they put into those games. If I could just debug those games! What I needed was such a device and a debugger. But I didn't have access to such a device. So I shipped a kind of dsync-debugger with the app. When the server responds with a partially valid game, the result screen would first apologize to the player and briefly explain what kind of error just happened. And it would offer a new button, to help with debugging that dsync:
A button to help with debugging the dsync
Once the player presses that button, the game is simulated on the player's device one more time. And as soon as the simulation reaches the last valid turn - the one server and client still agreed upon - the hash implementation is swapped to a debug hash implementation:
public strictfp class DebugHash extends Hash {
    private final LogPlugin logPlugin = Sim.context().logPlugin;

    @Override
    public void add(boolean value) {
        super.add(value);
        log("boolean=" + value);
    }

    @Override
    public void add(int value) {
        super.add(value);
        log("int=" + value);
    }

    @Override
    public void add(long value) {
        super.add(value);
        log("long=" + value);
    }

    @Override
    public void add(float value) {
        super.add(value);
        log("float=" + value + ", bin=" + Float.floatToIntBits(value));
    }

    @Override
    public void add(double value) {
        super.add(value);
        log("double=" + value + ", bin=" + Double.doubleToLongBits(value));
    }

    @Override
    public void add(Enum value) {
        super.add(value);
        log("enum=" + value);
    }

    @Override
    public void add(UUID value) {
        super.add(value);
        log("uuid=" + value);
    }

    @Override
    public void add(Hashable value) {
        log("Begin hash of " + value);
        super.add(value);
        log("End hash of " + value);
    }

    private void log(String message) {
        logPlugin.log(get() + ": " + message);
    }
}
Now, all hash calculations are logged until the dsync exception is thrown! Afterwards the log is sent to my server. Then I would download the save game to my MacBook and simulate the game one more time with the same debug option activated. And finally I ended up with a diff of hash log lines to compare. Here's one of the diffs I've been checking recently: A dsync diff In this diff, the problem was the max health calculation of creeps. The dsync happened when hashing the hitpoints of a creep. I then tracked it down to this method:
public static double getTotalCreepHitpoints(int round, Difficulty difficulty) {
    double x = round - 1;

    // Add endgame hitpoints if we are there yet!
    double endgameHitpoints = 0.0;
    if (x >= 140.0) {
        double endgameX = x - 140;
        endgameHitpoints = difficulty.endGameFactor * endgameX * endgameX * endgameX * endgameX;
    }

    return StrictMath.round(endgameHitpoints + difficulty.midGameFactor * x * x * x * x + difficulty.earlyGameFactor * x * x + getLinearCreepHitpoints(round));
}
It took me quite a moment to find out what the problem was. StrictMath.round returns a long which is then casted back into a double. Apparently the cast from long to double is not really covered by strictfp on all devices. Now the fix was pretty straight forward:
double hp = endgameHitpoints + difficulty.midGameFactor * x * x * x * x + difficulty.earlyGameFactor * x * x + getLinearCreepHitpoints(round);
if (version >= Sim.v15) {
    return hp;
} else {
    return StrictMath.round(hp);
}
If you are wondering about the if condition, that one is required so that old games can still be validated by the server and that new clients can play old replay files without running into a dsync. Another dsync I debugged this way was caused by a treasure goblin... This was my code to calculate the gold for a round (since a treasure goblin is alone he will get all gold for himself):
public static int getGoldForRound(int round) {
   int gold = (int) StrictMath.round(StrictMath.pow(1.0125, round) * 50.0);
   return gold > 1000 ? 1000 : gold;
}
Now, in higher rounds this caused an integer overflow. The gold becomes negative! Which by itself is already a bug, but as in the hitpoints dsync one device did this conversion slightly different:
Client: int=-1738923812
Server: int=-1738923813
And this is what caused the dsync in the end. The fixed code is now this:
public static int getGoldForRound(int round, int version) {
   double rawGold = StrictMath.pow(1.0125, round) * 50.0;
   int gold = (int) StrictMath.round(rawGold);

   if (version >= Sim.v16) {
      if (rawGold > 1000.0) {
         return 1000;
      }
   } else {
      if (gold > 1000) {
         return 1000;
      }
   }
   return gold;
}
I really hope from now on there won't be any more dsyncs for me to debug! A big sorry to all the players who experienced dsync errors with their games. I feel with you, this must have been very frustrating. And a big thank you for being so patient with me and staying so calm despite losing hours of game time. I'm so glad this is such a nice and helpful community! Happy building!