SQLAlchemy, Race Conditions, And PostgreSQL Advisory Locks

Race conditions and game state consistency are a hard problem in the concurrent world of FastAPI. Can SQL locks be an easy solution?

SQLAlchemy, Race Conditions, And PostgreSQL Advisory Locks
Photo by John Salvino / Unsplash

Race conditions and game state consistency are a hard problem in the concurrent world of FastAPI. Can SQL locks be an easy solution? Let's start by looking at the problem statement:

What is a Race Condition?

In short, a race condition is when multiple things occurring at once leads to unexpected outcomes. Wikipedia has a page for race conditions that explains it in technical detail:

Race condition - Wikipedia

In terms of the game, Blasphemess? There's a specific problem in terms of gameplay caused by race conditions.

Let's say two players decide to attack the same NPC at the same time. Their attacks both happen to inflict a status effect on hit. Let's also say this status effect is a singleton status effect, i.e. that there can only be one application of the status at a time on the same NPC.

When the players attack the NPC, the game server reads the database (DB) table for any existing instances of the status effect. There are none yet, so both logic blocks say "it's clear to add the status effect."

Then both attacks, still executing in parallel, try to add the status effect. Without any protections in place, the NPC ends up with two of the same effect, which should not be possible according to the desired game rules.

This is a problem in many, many ways. Actions can generally be duplicated when they're not supposed to be able to, i.e. a character may be killed twice, or they can be healed and killed at the same time so that they have health whilst dead.

If we tweak the logic to wait for a lock before reading the DB table, that could help immensely. It slows down execution, but improves consistency – which is a tradeoff players would prefer, if they even noticed.

The Deadlock Danger Zone

Wait a second, some locks are dangerous! They can also be an antipattern for concurrent design.

In trying to stop race conditions with locks, we may introduce yet another problem: deadlocks.

A deadlock is where attempts to acquire a resource get stuck waiting for a lock that will never release. This can lead to programming exceptions such as timeouts, or interruptions when the deadlock is detected.

Deadlocks can be avoided with careful coding and meticulous testing, but that's tedious and poor practice. We want to design around ease of development, and people make mistakes all the time regardless of our defences.

Still, we need to use some type of locking to solve the race condition risk. Let's look at what we can do!

Potential Approaches

We have several levels of locking to consider: row-level, table-level, and advisory lock level. There's also "optimistic locking." There are other types of locks, but they're less relevant here.

Row-level locking is used transparently by the ORM I've chosen, SQLAlchemy, in certain cases. It's granular, meaning that it doesn't interfere with transactions on other rows in the same table. It's pretty close to the ideal.

Table level locking, on the other hand, is a heavy-handed approach. I could lock the characterstatus and npcstatus tables before getting and inserting statuses. However, that would risk slowing down combat significantly, as every request for combat must wait for other requests to go through – any combat anywhere in the server. It's not granular.

Optimistic locking is where you don't take a lock, but instead validate a transaction after trying to make it. Normally this is done with a version identifier, or a timestamp, or something similar that indicates the transaction in progress.

Finally, there's the approach I'm taking currently: advisory locks.

Advisory locks are application-defined locks, stored in the database. That means that I can lock without caring about any specific underlying database object being referenced. I'm using transaction level advisory locks, specifically.

With my approach, I'm getting the kind of granular locking that you'd expect from a row-level lock, without needing to reference actual rows of the DB.

What this means concretely is best shown with code:

Code Snippets for an Advisory Lock to Solve Race Conditions

In the current incarnation of FastAPI and SQLAlchemy in my stack, I've got these snippets:

db/advisory_lock.py

import hashlib

from sqlalchemy.orm import Session
from sqlalchemy.sql import expression
from sqlalchemy.sql.expression import func


def acquire_advisory_lock(session: Session, key_str: str) -> None:
    """
    Acquire an advisory lock based on the hash of key_str.

    Note: can be acquired repeatedly within one session safely.
    """
    key_bytes: bytes = key_str.encode("utf-8")
    m = hashlib.sha256()
    m.update(key_bytes)
    # pg_try_advisory_xact_lock is limited to an 8-byte signed integer
    key = int.from_bytes(m.digest()[:8], byteorder="big", signed=True)

    # get a lock on the db with the key
    session.execute(expression.select([func.pg_advisory_xact_lock(key)]))

game/logic/entity.py

def _status_effect_advisory_lock_key(current_entity) -> str:
    return f"status-{'pc' if current_entity.is_pc else 'npc'}-{current_entity.id}"


def add_status_effect(
    db: Session,
    *,
    status: "BaseStatus",
    data: None | dict[str, typing.Any],
) -> bool:
    """Add a status effect to the entity."""
    if (
        not status._current_entity.is_dead and not status._current_entity.is_incapacitated
    ) or (
        status.persists_through_death and status._current_entity.is_dead
    ) or (
        status.persists_through_incapacitation and status._current_entity.is_incapacitated and not status._current_entity.is_dead
    ):
        advisory_lock.acquire_advisory_lock(
            db, _status_effect_advisory_lock_key(status._current_entity)
        )
        status.add_status_effect(db=db, data=data)
        return True
    return False

Note that this is only the first iteration of the code, as part of the prototype. There's a lot in flux, but this is functional enough to serve for the prototype.

Notice that the _status_effect_advisory_lock_key is based on the entity type and its ID, e.g. status-npc-123 or status-pc-4 which is then hashed and the first 8 bytes of the hash are taken as the key.

This is what makes the lock so granular. There can be several rows in the status effect table relating to the entity, and by locking it based on the entity, other NPCs or characters won't be blocked. It's like a row lock, except using the advisory lock system.

Mastodon