llama.cpp with --parallel 1 queues requests at the HTTP layer. Two concurrent callers don't parallelize - they starve. The first request takes 30 seconds. The second sits in the queue for 28 of those, times out at the client side, the client retries, the retry sits in the queue again. Cascading timeouts.
I hit this the day I added a second agent to my Hermes setup. yui kept stalling mid-conversation because a cron job had grabbed the LLM for a long report. The fix is 40 lines of Python with fcntl. Let me walk through it.
The constraint
My local inference server is Qwen 3.5 27B hybrid, quantized via TurboQuant4, running on an RTX 3090 with 128K context. It can hold exactly one conversation at a time. I could run two instances and load-balance - except 3090 VRAM is tight and doubling the model would cost me the ability to run anything else on the GPU.
So: single slot. The model does not parallelize. Every caller must wait its turn.
The naive approach is to let llama.cpp's HTTP layer do the queueing. But llama.cpp's queue is fixed-size and FIFO, with no awareness of which callers can wait and which can't. A 30-minute cron report will block a user's live chat indefinitely, and worse, the user's chat will hit client-side timeouts while still queued server-side.
The pattern: OS-level file lock before every run
gateway/llm_lock.py is 40 lines:
"""
Inter-process LLM access serialization.
With a single-slot inference server, only one agent conversation can
productively use the LLM at a time. Without coordination, concurrent
requests queue at the HTTP level and starve each other.
This module provides a file-based lock (fcntl/flock) that all gateway
processes and cron jobs acquire before calling run_conversation().
"""
import fcntl, logging, os, time
from pathlib import Path
LOCK_DIR = Path(os.environ.get("HERMES_HOME", str(Path.home() / ".hermes"))) / "locks"
LOCK_DIR.mkdir(exist_ok=True)
LLM_LOCK_PATH = LOCK_DIR / "llm.lock"
_log = logging.getLogger(__name__)
def acquire_llm(timeout: float = 120.0, caller: str = "unknown"):
"""
Block up to `timeout` seconds to acquire the LLM lock.
Returns an open fd (caller must release_llm(fd)), or None on timeout.
"""
fd = os.open(str(LLM_LOCK_PATH), os.O_RDWR | os.O_CREAT, 0o644)
deadline = time.monotonic() + timeout
while True:
try:
fcntl.flock(fd, fcntl.LOCK_EX | fcntl.LOCK_NB)
os.write(fd, f"{os.getpid()} {caller}\n".encode())
_log.info("LLM lock acquired by %s (pid=%d)", caller, os.getpid())
return fd
except BlockingIOError:
if time.monotonic() > deadline:
os.close(fd)
_log.info("LLM lock timeout for %s after %.1fs", caller, timeout)
return None
time.sleep(0.5)
def release_llm(fd) -> None:
if fd is None:
return
try:
fcntl.flock(fd, fcntl.LOCK_UN)
finally:
os.close(fd) Two things worth noting:
1. fcntl.flock() with LOCK_NB gives non-blocking semantics. The caller polls with a sleep, which sounds bad but is actually the right shape - blocking flock can deadlock across process hierarchies in ways non-blocking doesn't.
2. The lockfile records pid+caller. cat ~/.hermes/locks/llm.lock tells you who's holding the lock right now. Useful during incidents.
Integrating into the caller
In Hermes' cron runner, every job wraps its LLM interaction in the lock :
# cron/scheduler.py - simplified
def run_job(job: dict) -> tuple[bool, str, str, Optional[str]]:
_llm_fd = None
try:
from gateway.llm_lock import acquire_llm, release_llm
_llm_wait = float(os.getenv("HERMES_CRON_LLM_WAIT", 60))
_llm_fd = acquire_llm(timeout=_llm_wait, caller=f"cron:{job['name']}")
if _llm_fd is None:
msg = f"LLM busy - skipping cron job {job['name']} this cycle"
logger.info(msg)
# Return [SILENT] so the delivery layer doesn't broadcast the skip.
return False, f"# {job['name']}\n\n{msg}", "[SILENT]", None
# ... actually run the conversation ...
finally:
if _llm_fd is not None:
release_llm(_llm_fd) The critical decision: cron jobs skip gracefully. They do NOT wait indefinitely. HERMES_CRON_LLM_WAIT is 60 seconds by default. If the LLM is busy with a user chat for more than a minute, the cron cycle just skips and tries again next tick.
Chat callers, by contrast, wait longer (default 120s) because a user is actively looking at the screen expecting a response.
Why this beats HTTP queueing
Three concrete reasons:
1. A failed flock() surfaces in the caller's logic as a graceful skip. You can decide per-caller what to do: cron skips, chat waits longer, interactive tools maybe fail fast and show an error.
2. HTTP queueing inside llama.cpp is invisible to the caller's timing. The caller thinks it's still waiting for a response; the server has the request on a shelf. Timeouts happen at the wrong layer.
3. You can introspect the lock. /proc/locks + the lockfile contents give you real-time visibility into who's holding the LLM. HTTP-internal queues are opaque.
Failure modes and how I handle them
- A crashed caller leaves a stale lock. flock is OS-level, so a process death releases the lock automatically. No stale-lock cleanup code needed.
- A caller blocks on a slow tool call after acquiring. I don't want the LLM locked for 10 minutes while an agent's search tool does something weird. Solution: acquire the lock only around run_conversation, not around the whole job. Tools execute outside the lock.
- Container restarts. The lock is on disk under ~/.hermes/locks/llm.lock, not in a tmpfs. Survives restarts. fcntl's exclusive semantics are reset when all holders go away.
The generalizable principle
Any time you have a shared resource that callers might contend over, you want the contention exposed to the callers, not queued invisibly at the server. A failed acquire is information. An infinite wait isn't.
For single-slot inference servers specifically: build the lock first, wire every caller through it, then worry about scaling. Adding a second GPU is easy once you have a lock; retrofitting serialization across a fleet of already-running agents is painful.
Future work
Two things I'd like to add but haven't:
1. Fair queueing. Right now it's first-come-first-served. A priority queue where chat beats cron beats backlog would be more principled.
2. Token budgeting per caller. Cron jobs sometimes run away with a 50K-token synthesis that locks the LLM for 20 minutes. A per-caller token budget enforced at acquire-time would prevent that.
Both are nice-to-haves. The 40-line version has been in production for weeks and hasn't had an incident.
No comments yet