Radiating signal, representing event-driven architecture

Event-Driven Spokes vs. Polling: How I Cut Agent Latency 40×

HackerNews' Firebase SSE pushes new IDs in under a second. Our old cron polled every 5 min. Switching saved more than latency.

My old news-curation agent polled HackerNews' Firebase endpoint every five minutes. If a relevant story landed at 10:01, the agent wouldn't notice until 10:05. Average latency: 150 seconds. Worst case: nearly five minutes.

I switched to a spoke-and-listener architecture. The spoke subscribes to HN's Firebase SSE stream. The listener picks up dropped queue files via inotify. Average latency now: 3 seconds. Worst case: maybe 10. That's a 40× improvement - but the real win wasn't latency. It was LLM spend.

The pattern generalizes beyond HackerNews. Any source that publishes events (SSE, WebSocket, webhook, polling-with-cursors-that-look-like-a-stream) belongs in a spoke. Cron does scheduled work; spokes do reactive work.

The pattern

spoke (SSE/stream) --> ~/.hermes/anna-news-queue/*.json
                                  |
                                  v  (inotify IN_CLOSE_WRITE)
                            anna-listener
                                  |
                                  v  (cheap embedding gate)
                            relevance score
                                  |
                                  v  (if above threshold)
                              LLM filter
                                  |
                                  v
                               Matrix

Three distinct processes, one per shape of work:

- spoke: pure I/O. Reads from the source, writes a JSON file per event to a queue directory. Does not evaluate relevance. Does not call the LLM.

- listener: pure dispatch. Watches the queue dir, runs an embedding-similarity check against a relevance anchor (the user's stated interests), drops items below threshold. Calls the LLM only on items that survive the gate.

- Matrix publisher: takes the LLM's output, writes it to the configured room.

Each process is independently restartable. The queue directory is the buffer. If the listener dies, the spoke keeps dropping files and the backlog waits. If the spoke dies, the listener quietly drains to zero.

The spoke

For HackerNews, the source is a Firebase SSE stream of new item IDs:

# anna-spoke-hackernews (simplified)
import json, time, uuid
from pathlib import Path
import httpx

QUEUE = Path.home() / ".hermes" / "anna-news-queue"
QUEUE.mkdir(exist_ok=True)

def drop(event: dict) -> None:
    path = QUEUE / f"hn-{uuid.uuid4()}.json"
    tmp = path.with_suffix(".tmp")
    tmp.write_text(json.dumps(event))
    tmp.rename(path)  # atomic - listener only sees fully-written files

with httpx.Client(timeout=None) as c:
    while True:
        try:
            with c.stream("GET", "https://hacker-news.firebaseio.com/v0/newstories.json") as r:
                for line in r.iter_lines():
                    if line.startswith(b"data: "):
                        drop({"source": "hn", "data": line[6:].decode()})
        except Exception as e:
            time.sleep(5)  # back off, reconnect

The atomic rename is important. inotify watchers trigger on IN_CLOSE_WRITE, but a partial write visible to the watcher can cause the listener to parse invalid JSON. Write-then-rename is the one-line discipline that prevents this.

For X / Twitter, the 'spoke' is a CloakBrowser session that polls the home timeline and diffs against last-seen IDs. Different source shape, same contract: events become JSON files in the queue.

The listener

# anna-listener (simplified)
from watchfiles import awatch
import asyncio, json
from pathlib import Path

QUEUE = Path.home() / ".hermes" / "anna-news-queue"
ANCHOR = "AI agents, self-hosted infrastructure, identity, distributed systems"

async def main():
    # Process anything already in the queue on startup
    for f in sorted(QUEUE.glob("*.json")):
        await process(f)
    # Then watch for new files
    async for changes in awatch(QUEUE):
        for change, path_str in changes:
            path = Path(path_str)
            if path.suffix == ".json":
                await process(path)

async def process(path: Path):
    try:
        event = json.loads(path.read_text())
    except Exception:
        path.unlink(missing_ok=True)
        return

    if await embedding_score(event, ANCHOR) < 0.72:
        path.unlink(missing_ok=True)
        return

    summary = await llm_filter(event, ANCHOR)
    if summary.strip().startswith("[SKIP]"):
        path.unlink(missing_ok=True)
        return

    await matrix_send(summary)
    path.unlink(missing_ok=True)

The embedding gate is where the savings live. Cheap local embedding call (~5ms, ~no cost on a loaded GPU), cosine against the anchor. Most items from a firehose like HN are irrelevant to any given user - 95% of new stories on HN aren't AI-agent-related. The gate drops them before the expensive LLM filter.

LLM filter is the second layer: the item passed the embedding gate (topically adjacent), but is it actually worth surfacing? Is it a rehash of yesterday's article? Is the headline misleading? Does it match the anchor in spirit? LLM reads the title + first paragraph, returns a one-liner or [SKIP].

Why this decomposition matters

Embedding cost per event: ~0 USD, ~5ms local.

LLM cost per event at rate-limited cloud API: $0.0001-0.001 depending on model.

Volume on HN: ~150 new stories/hour. ~3600/day.

If the listener called the LLM on every event without an embedding gate: $0.36-$3.60/day per anchor per source. Multiply by a few sources and a few agents and it's real money.

With the embedding gate filtering 95%: 180 items/day survive. LLM cost drops to $0.018-$0.18/day. An order of magnitude down, sometimes two.

Polling every 5 minutes plus the same gate would give you the same cost but 150× worse latency. Spokes are strictly better.

When to pick which

Scheduled work - daily summaries, weekly reports, periodic audits - is cron . No events, so nothing to react to.

Reactive work - news, notifications, webhooks - is spokes. Events arrive on their schedule, not yours.

Hybrid - for instance, a 'digest every hour of surviving items' - is a cron job that reads accumulated results from the listener's output, not a second spoke.

Polling is the wrong answer to almost any reactive-work question. If a source only offers polling (no SSE, no webhook), build the polling inside a spoke anyway - same interface for the listener, just a slower source.

Failure modes worth planning for

Queue backlog during listener restart. Listener processes accumulated files on startup. If the listener is down for 20 minutes during a heavy news cycle, you get a spike of activity on restart. Rate-limit the listener's LLM calls to avoid slamming your API key.

Stuck spoke. If the SSE connection is broken but the spoke's exception handling doesn't fire, you'll have a silent-death spoke. Add a heartbeat: spoke writes its timestamp to a heartbeat file every 30s; a systemd watchdog restarts it if the heartbeat stales.

Duplicate events. Upstreams occasionally replay. Dedup by content hash at the listener, not at the spoke - the spoke should be dumb.

Extending the pattern

Once the spoke/listener/publisher shape clicks, every new source is 30 minutes of work. GitHub notifications? Subscribe to /notifications, drop to queue. Forgejo CI results? POST webhook to a tiny FastAPI, drop to queue. RSS firehose? Python rss streamer, drop to queue.

All of them feed the same listener with the same gating logic. Adding a source doesn't mean adding a new agent - it means adding one spoke.

Related posts

No comments yet