Developers information desyncs - FAForever/fa GitHub Wiki

Desyncs

In Supreme Commander the game runs the same simulation on every computer. The simulation needs to be in sync in order for everyone to be able to play a game. This is by default not a trivial task, as an example: if you work with floating numbers a simulation can already diverge across systems if it runs long enough. Luckily, the implementation used in Supreme Commander is robust. In general it is difficult to desync a game. Yet, sometimes it can happen that while writing code you end up writing something that can desync. We'll discuss some patterns and functions that seem fine on the surface, but if used incorrectly can cause desyncs.

The scope of this topic is slightly expanded to include reasons of desyncs outside of programming.

Non-programming related issues that can desync

Out of sync source files

By far the most likely cause when players desync: their source files do not match up. Before you start debugging it is critical to make sure that the people that reported the desync guarantee you that they're using the same source files.

Programming patterns that can desync

Lua statements

Table keys

The documentation on Lua tables mentions that every value except for nil can be used as a table key. A table internally works with hashes. Every value that is being passed in is turned into a hash. This process is called hashing, and is often used in combination with a hash table. The conversion of a value to a hash is cryptically available for us to see in the source code of Lua.

static Node *mainposition (const Table *t, const TValue *key) {
  switch (ttype(key)) {
    case LUA_TNUMINT:                               // integers: by value
      return hashint(t, ivalue(key));
    case LUA_TNUMFLT:                               // floats or doubles: by value
      return hashmod(t, l_hashfloat(fltvalue(key)));
    case LUA_TSHRSTR:                               // short strings: by value
      return hashstr(t, tsvalue(key));
    case LUA_TLNGSTR:                               // long strings: by value
      return hashpow2(t, luaS_hashlongstr(tsvalue(key)));
    case LUA_TBOOLEAN:                              // booleans: by value
      return hashboolean(t, bvalue(key));
    case LUA_TLIGHTUSERDATA:                        // (as an example) categories: by memory refence
      return hashpointer(t, pvalue(key));
    case LUA_TLCF:                                  // C functions: by memory reference
      return hashpointer(t, fvalue(key));
    default:                                        // tables and Lua functions: by memory reference
      lua_assert(!ttisdeadkey(key));
      return hashpointer(t, gcvalue(key));
  }
}

How a value is converted is relevant to our discussion. As a general rule for hashing: the same value always turns into the same hash. A value (like a number, or a string) is always the same value across players. But a memory reference is not the same value across players. The order of iteration is different for various players when you use a memory reference as a hash. And that can cause desyncs if you apply logic to only the first few elements.

We ran into this issue when we were implementing the automated fabricator behavior. You can find all the changes in #3813. The behavior works with a registration process, as we can see here:

AddEnabledEnergyExcessEntity = function (self, entity)
    self.EnergyExcessUnitsEnabled[entity] = true  -- <-- by memory reference, not the same for all players!
    self.EnergyExcessUnitsDisabled[entity] = nil  -- <-- by memory reference, not the same for all players!
end,

During the registration process we use the entity itself (a table) as a key value. As we now know, a memory reference is used when hashing and because of that the order of iteration is different between players. This matters because we're manipulating the fabricators one by one, as we can see here:

ToggleEnergyExcessUnitsThread = function (self)
    while true do 

        local energyStoredRatio = self:GetEconomyStoredRatio('ENERGY')
        local energyTrend = 10 * self:GetEconomyTrend('ENERGY')

        -- low on storage and insufficient energy income, disable fabricators
        if energyStoredRatio < 0.4 and energyTrend < 0 then 

            -- while we have fabricators to disable
            for fabricator, _ in self.EnergyExcessUnitsEnabled do          -- <-- iteration, order is different across players because of hashed memory reference
                if fabricator and not fabricator:BeenDestroyed() then 

                    -- disable fabricator
                    fabricator:OnProductionPaused()                        -- <-- not applied to the same unit for all players, desync!

                    -- keep track of it
                    self.EnergyExcessUnitsDisabled[fabricator] = true
                    self.EnergyExcessUnitsEnabled[fabricator] = nil

                    break
                end
            end
        end

        -- (...)

        CoroutineYield(1)
    end
end,

With a different order, different fabricators where disabled (or enabled) for each player. As a result the game desyncs the moment a player has more than two fabricators. We adjusted this in #3838, where instead of using the table itself as a key we use EntityId as a key - a unique value that is assigned to all entities.

Game functions that can desync

Not all functions available in the simulation are safe to use. In particular incorrect use of engine functions can be a cause for desyncs.

GetFocusArmy

The function returns a number that represents the army that we are playing. This value is different for all players. This immediately sounds like trouble, and we can into this when we implemented the recall behavior. You can find all the changes in #4203. The behavior works with a voting process:

local function ArmyVoteRecall(army, vote, lastVote)

    local focus = GetFocusArmy()                      -- <- we retrieve the focus army
    if not IsAlly(focus, army) then
        return false                                  -- <- early exit, using the focus army
    end

    -- (...)

    if lastVote then
        for index, ally in ArmyBrains do
            if army ~= index and not ally:IsDefeated() and IsAlly(army, index) then
                local thread = ally.recallVotingThread
                if thread then
                    coroutine.resume(thread)          -- <- continue a thread, but only for those that are allied to the focus army
                    break
                end
            end
        end
    end

    return true
end

Every player has to vote, but only those that are allied should be able to vote. Therefore an early exit was introduced. Yet, at the end of a function there was a block of code that was relevant for all players: processing the results of the vote! As a consequence, the game desynced when recalling. This was fixed before merging the pull request by processing the block of code before the early exit is processed.

GetSystemTimeSecondsOnlyForProfileUse

The function returns a number that represents how long the simulation has been running with very high precision. As the name implies, it is often used for profiling, or for benchmarking. As an example, all benchmarks available in the repository make use of this function. The return value depends on how fast your computer is, and that sounds like trouble. The function was used in an attempt to improve the performance in M27ai, made by Maudlin. You can find all the changes in #20 (maudlin27/M27AI).

oPathingUnit[M27UnitInfo.refiPathingCheckTime] = (oPathingUnit[M27UnitInfo.refiPathingCheckTime] or 0) + (GetSystemTimeSecondsOnlyForProfileUse() - iCurSystemTime)
aiBrain[M27UnitInfo.refiPathingCheckTime] = (aiBrain[M27UnitInfo.refiPathingCheckTime] or 0) + (GetSystemTimeSecondsOnlyForProfileUse() - iCurSystemTime)
aiBrain[M27UnitInfo.refiPathingCheckCount] = (aiBrain[M27UnitInfo.refiPathingCheckCount] or 0) + 1
if (GetSystemTimeSecondsOnlyForProfileUse() - iCurSystemTime) > 0.3 then bDebugMessages = true end --Retain for audit trail - to show significant pathing related freezes we have had
if bDebugMessages == true then LOG(sFunctionRef..': GameTime='..GetGameTimeSeconds()..'; bHaveChangedPathing='..tostring(bHaveChangedPathing)..'; Time taken='..(GetSystemTimeSecondsOnlyForProfileUse() - iCurSystemTime)..'; Brain count='..aiBrain[M27UnitInfo.refiPathingCheckCount]..'; Brain total time='..aiBrain[M27UnitInfo.refiPathingCheckTime]) end

As a result, the value was different for all players. The logic that was being run was different for all players as that value is stored and used in conditions. As a consequence: the game desynced!