DEV Community: John

"How to Stop AI Agent Skills, Hooks, and Cron Jobs from Silently Conflicting Over Where They Run and What Data They Trust"

John — Wed, 01 Jul 2026 00:00:08 +0000

Originally published on hexisteme notes.

Make every skill, hook, and scheduled job declare four invariants before it ships — Locality (where it can run), Source-of-truth (which facts it owns or borrows), Cross-ref (what depends on it and what it depends on), and Trigger-measurability (whether its trigger is observable at runtime or hidden in external state) — and refuse to hand off any component that leaves one undeclared, because an undeclared assumption is exactly the seam where two components silently disagree.

Two separate runtime leaks surfaced in a single audit session, and both traced back to the same root cause: a component that never declared its assumptions. One read configuration from a file that had stopped being the source of truth (so it always returned a stale default); the other was a scheduled job pointed at a remote sandbox while its prompt referenced local-only paths — caught minutes before registration, where any later and it would have billed compute and produced nothing. Neither was a coding bug. Both were missing declarations.

The failure mode: components that work alone but leak when combined

When you build an AI agent system out of small parts — skills the model loads on demand, hooks that fire on lifecycle events, cron jobs and scheduled routines that run unattended, helper scripts, config profiles — each part usually gets tested in isolation. It works. You move on. The trouble is that "it works" only proves single-shot correctness; it says nothing about whether the part's assumptions agree with the rest of the system.

Every component carries hidden assumptions: where it runs (local machine vs. a remote sandbox), which facts it treats as authoritative, what other components it silently depends on, and what its trigger actually measures. When those assumptions go undeclared, conflicts stay invisible until they surface to the user as a flaky, hard-to-trace symptom — the kind that feels like a vicious cycle because every fix in one place re-opens a gap somewhere else. The deeper problem isn't any single line of code; it's that each component is operating without knowing its own assumptions, so nothing can detect when two of them disagree.

The fix: four invariants every component must declare

The rule is simple and enforceable: every new skill, hook, cron job, script, or routine declares four invariants, or it cannot be handed off. An undeclared invariant means the component itself is incomplete — it is the literal root of conflicts, leaks, and automations that bill compute but return nothing.

Locality — Where can this run? Declare one of local-only, repo-only, or both. A job that runs in a remote sandbox but references local-only paths is broken before it starts.
Source-of-truth — Which facts does it own authoritatively, and when it borrows an external fact, where does that fact actually live? Declare a sources: [...] list of the files/components it reads.
Cross-ref — What depends on this component, and what does it depend on? Declare a consumers: [...] list (who applies or reads it) and a "See also" section (what it relies on).
Trigger-measurability — Can the model observe this trigger in real time within a session, or does it depend on external state (machine powered on, network reachable, a remote service up)? If it's external-state-dependent, say so explicitly.

These four answer the four questions that, left unanswered, become the four ways components leak into each other.

Two real leaks, one missing declaration

Leak 1 — stale source-of-truth. A small script reported which config profile was active by reading the mcpServers key from one settings file. But the authoritative definition of those servers had migrated to a different file. The script was written when the first file was the source of truth, and never updated after the move — so it always returned the default base, and the status it displayed was quietly wrong. Had a Source-of-truth invariant been declared ("this fact lives in file B, not file A"), the mismatch would have been visible the day the truth moved.

Leak 2 — locality mismatch. A scheduled routine was registered to run in a remote sandbox, but its prompt referenced local-only resources (paths that exist only on the developer's machine). This was caught in the final seconds before registration. Any later and the schedule would have fired on the remote, found nothing, and produced billed-but-empty runs on a recurring cadence. A declared Locality invariant ("this runs remote → it may only reference remote-visible resources") would have flagged it immediately.

The two leaks looked unrelated — one a config-reading bug, one a scheduling mistake — but they share a single cause: a component that never declared its assumptions. That they happened in the same session is the tell that this is a structural gap, not a one-off slip.

Where the declarations live, by component type

Keep the declaration cost to roughly one line per invariant — the weight of declaring is far below the cost of verifying a conflict after the fact. Concretely:

# For a reusable decision record / doc with frontmatter:
---
id: NOTE-XXX
locality: local-only        # or repo-only / both
sources: ["path/to/file-that-owns-the-fact"]
consumers: ["who-applies-this", "downstream-automation"]
---

For a skill or command, put the same four in the markdown header or description — minimally a one-liner like (locality: local-only — depends on ~/config) plus a "See also" listing the RDUs/skills/commands it relies on. For a hook defined in JSON (where comments aren't allowed), register the invariants in the hook's companion doc and list the settings file in sources. For a cron job or routine, gate registration behind five questions answered out loud first: (1) Locality — does it run local or remote, and do its referenced resources exist there? (2) Sources — what's the source-of-truth for every fact the prompt cites? (3) Cross-ref — where does the output flow (a user report, a file update, a record evolution)? (4) Trigger-measurability — does the cron time line up with resource availability (machine on, service reachable)? (5) Reversibility — how do you turn it off if it's wrong (launchctl unload / disable / delete)? If even one answer is fuzzy, hold registration until it's clear.

Backfilling existing components and keeping it honest

You don't have to retrofit everything at once. Prioritize by blast radius. System entry points (the scripts that classify profiles or route config) come first — they're the most dangerous because everything downstream trusts them. Next, hooks that fire on every prompt, since a wrong assumption there taxes every interaction. Then your highest-frequency skills. Everything else can be backfilled slowly on a periodic audit cadence.

Bake the check into that cadence: scan for decision records whose frontmatter is missing locality (backfill candidates), skills whose headers don't declare invariants (backfill candidates), and any routine/cron/hook unchanged for 30+ days (re-verify its invariants, because the world it assumed may have moved underneath it — exactly how Leak 1 happened). Declaring an invariant isn't bureaucracy; it makes each component self-document its assumptions so that the next person to touch the system — including a future version of you, or your own agent — can detect a conflict on sight instead of debugging it in production.

Honest limitations

This is a discipline, not a compiler. Nothing mechanically enforces the four invariants unless you build that enforcement — a frontmatter linter, a registration gate, a pre-commit check. Declarations can also drift: a component can declare sources: [file A] truthfully on Monday and become a lie on Friday when the source-of-truth moves, which is precisely why the 30-day re-verification step exists rather than a one-time declaration. The invariants also can't catch a conflict between two components that each declare correctly but make incompatible assumptions about each other's behavior — declaration surfaces the seam, it doesn't reconcile semantics. And there's a real failure mode of over-declaring: if every trivial helper grows a four-field header, people stop reading them and the signal drowns. Keep declarations to one line each, reserve the full five-question gate for unattended automation (where a silent failure is most expensive), and treat the lightweight inline form as enough for in-session skills and commands.

More notes at hexisteme.github.io/notes.

"Do Multiple Personas on One LLM Give Real Diversity, or Do You Need Different Model Families?"

John — Tue, 30 Jun 2026 00:00:05 +0000

Originally published on hexisteme notes.

Multiple personas on a single LLM do not give you real diversity — they are prompt variations of one set of weights, so they share the same blind spots, and the only durable diversity comes from different model families plus external tool verification plus an adversarial round.

In an 8-round self-audit of a persona-only council (all personas on the same model family), the measured ceiling was a track_record of 31% and internal_consistency of 65% — no amount of prompt-trailer tuning, persona-count tuning, or model-tier splitting pushed past it. Two patterns explain why and how to fix it: the Aggregator Bottleneck (a single aggregator model re-homogenizes whatever diversity its sub-agents produced) and a ΔEVD test (measure mean-pairwise-cosine-distance between answers; keep a reframing only if it adds more than +0.15, discard it as theatrical if it adds +0.05 or less).

The short answer: same weights, same blind spots

If you spin up eight "experts" — a skeptic, an optimist, a security reviewer, a contrarian — and they are all the same underlying model behind different system prompts, you have built a debate club whose members were all educated at the same school, from the same textbooks, by the same teachers. They will phrase disagreement differently, but they hallucinate the same nonexistent citations, anchor on the same wrong dates, and miss the same structural weakness in your design. Persona diversity is a presentation-layer trick; it is not an epistemic one.

This matters because a council's whole value proposition is catching the error your first answer missed. If every member shares the same training data, the same RLHF shaping, and the same tokenizer, then the error your base model is prone to is an error all members are prone to. You get a chorus, not a cross-check. The fix is not more personas — it is genuinely different sources of judgment.

The measured ceiling from an 8-round self-audit

This isn't a vibe. Over eight rounds of meta-auditing a persona-only council (every persona running on one model family), the aggregate numbers settled at a track_record of 31% and an internal_consistency ceiling of 65%. The audit deliberately exhausted the obvious levers: evolving the persona system-prompt trailers across five versions, fixing the data pipeline that scored outcomes, and splitting personas across a larger and a smaller model of the same family. None of it moved the ceiling. That is the signature of a limit that lives in the model weights, not in the prompt.

The interpretation is straightforward and a little humbling: when consensus is built from one model's weights, agreement measures conformity, not correctness. A 65% internal-consistency ceiling means the personas couldn't even reliably agree with themselves across runs, and a 31% track record means their confident consensus was usually not the thing that actually held up. Prompt engineering could change the flavor of the answers but not the underlying distribution they were drawn from.

Three axes of real diversity

If persona count is the wrong lever, what are the right ones? Three axes, applied in sequence, each contributing a kind of diversity that prompt variation cannot manufacture:

(A) Cross-model family. Route the same question to models with different training data, different RLHF, and different tokenizers — e.g. one from each of two or three independent providers. Because their failure modes are uncorrelated, a weakness that all of them independently flag is a real weakness; a claim only one of them makes is suspect. This is the load-bearing axis.
(B) Tool verification (one pass). After the models converge, run a single web/citation check against the claims they relied on. Models confidently cite papers that don't exist and dates that are wrong; one external fact-check pass severs the hallucinations before they reach your conclusion. Label anything you can't verify as unverified and drop it from the consensus.
(C) Adversarial perspective (separate round). Once a verified consensus exists, run a distinct round whose only job is to attack it: "assume this consensus is wrong — what single experiment would prove it?" This is not a compromise or a synthesis of the earlier rounds; it is a deliberate falsifier-generation step that produces an experiment + observable + timeline + null-result implication, not more prose.

The output shape is the point. A persona council hands you a paragraph of consensus. A three-axis council hands you a weakest link, a list of verified citations, and a falsifiable prediction — three things you can act on immediately (run the experiment, dig deeper, or discard the idea). In one head-to-head application the actionable ROI was roughly an order of magnitude higher, almost entirely because the deliverable changed kind, not just quality.

The Aggregator Bottleneck

Here is the trap that quietly destroys multi-family setups: you fan out to three different model families, collect three genuinely diverse answers — and then ask one model to summarize them. That aggregator's alignment acts as a funnel. Its RLHF rewards balanced, agreeable, smoothed-over prose, so it averages the dissent away and re-homogenizes exactly the diversity you paid for. You did the expensive cross-family work and then threw the result through a single-model bottleneck at the last step.

This was identified empirically: when the adversarial round was written by the same aggregator model that ran the rest of the pipeline, an outside-family model pointed out — correctly — that the aggregator had funneled the earlier diversity into its own house style. The remedy: do not let one model both diversify and conclude. Have the adversarial round (axis C) written by an external family, or by the outlier voice from round one — not by the aggregator. Preserve dissent as raw branches rather than collapsing it into a synthesized middle; the median of disagreeing experts is frequently the one position none of them would defend.

The ΔEVD test: is your reframing real or theatrical?

A subtler failure is Prompt Framing Lock-in: when all your families converge on the same weakest link in nearly identical words, the cause may not be that they agree — it may be that your single shared prompt framed the problem so narrowly that no model could escape it. The instinct is to add a "reframing layer" (a router that rewrites the prompt, a per-family persona generator, an outcome-first reformulation). But every reframing option merely moves the lock-in to a different layer; you can't assume it helped. So measure it.

The ΔEVD test (Embedding Variance Delta) makes the decision empirical instead of hopeful:

1. Send the raw prompt to N model families → embed the N answers
   with a neutral embedder → compute MPCD0
   (Mean Pairwise Cosine Distance).
2. Apply ONE reframing option → re-run the N families → MPCD1.
3. Decide on ΔMPCD = MPCD1 - MPCD0:
     > +0.15        keep the reframing (real diversity gain)
     +0.05..+0.15   marginal — hold; try another round/reframing
     <= +0.05       discard it as theatrical noise; keep raw prompt

The harsh-but-honest conclusion from running this test on itself: most reframings fall at or below the +0.05 floor, which means they were theater. Treat the raw prompt as the most honest baseline and only adopt a reframing that clears the +0.15 bar on actual embedding distance. This is the same discipline as not collapsing dissent into a median — you don't get to claim diversity you can't measure.

Honest limitations and when persona councils are still fine

Three caveats keep this honest. First, cross-family diversity is necessary but not sufficient: the academic consensus (e.g. work on multi-agent debate in 2025) is blunt that a debate cannot exceed the accuracy of its strongest participant — diversity surfaces and weights candidate answers, it does not conjure correctness that none of the participants possessed. If all your models are weak on a topic, a council of them is still weak. Second, naive iterative debate and majority voting can actively entrench an initial error through model conformity; the gains come from careful diversity, argument-quality weighting, and preserving dissent — not from more rounds. Third, the 31% / 65% numbers are from one specific persona-council implementation's self-audit; the ceiling will differ across setups, and the cross-family ROI multiplier is a single observed comparison, not a benchmark — treat it as directional.

And don't over-apply this. The three-axis pattern is for decisions where being wrong does real damage: new system designs, claims you're about to ship, intuitions you suspect touch an unsolved problem, a thesis your single-model review couldn't crack. For a quick gut-check on a low-stakes opinion — a color choice, a naming preference, a sanity skim — a single fast call (persona or otherwise) is entirely adequate and the full pipeline is overkill. The skill is matching the verification depth to the cost of being wrong, not running the heavy machinery on everything.

More notes at hexisteme.github.io/notes.

"Claude Code '400: no low surrogate in string' on every turn: repairing a permanently broken session transcript"

John — Mon, 29 Jun 2026 00:00:08 +0000

Originally published on hexisteme notes.

A Claude Code session that returns API Error: 400 ... not valid JSON: no low surrogate in string on every turn is poisoned by a lone UTF-16 surrogate (a code point in U+D800–U+DFFF) baked into its on-disk transcript, and the fix is to close that session, strip only those lone surrogates from the offending line of the .jsonl file while leaving real emoji untouched, re-serialize that one line, and then claude --resume.

The poison is already on disk and is precisely targetable: a normal emoji is a single Python code point (e.g. U+1F9ED) and can never fall inside the surrogate range U+D800–U+DFFF, so deleting only that range removes the broken half-character with zero collateral damage to valid text. A cheap C-level byte pre-filter (scan for the \ud escape or raw ED A0–BF bytes before doing any per-line json.loads) cut a 174-file transcript scan from 3.4s to 1.1s, making it cheap enough to run automatically on every session start.

Why every turn fails: the surrogate is in the transcript, not the network

Claude Code persists each session to a JSONL transcript on disk (one JSON object per line, under your projects directory). Every turn replays the accumulated history back to the API. So if a single byte sequence in that history is invalid, the API rejects every subsequent request with the same error, at the same byte offset — the session is permanently bricked, and reopening it doesn't help because the bad data is reloaded from the file.

The specific failure is 400 The request body is not valid JSON: no low surrogate in string: line 1 column N (the mirror-image variant is no high surrogate). Non-BMP characters — emoji like 🧭, some extended CJK ideographs — are encoded in UTF-16 as a pair of surrogate code units: a high surrogate (U+D800–U+DBFF) followed by a low surrogate (U+DC00–U+DFFF). When a large tool output is truncated by a length limit and the cut lands exactly between the two halves of a pair, one orphaned half survives. That lone surrogate gets written into the transcript, replayed on every turn, and the API's strict JSON parser refuses it.

The triggering pattern is mundane: dumping a big, emoji-heavy blob into the context — a worker log peppered with status emoji, a daily-report job's output, a verbose ingest run — right before a large body that gets truncated mid-pair. Content-heavy projects (anything generating a lot of natural-language or creative text with non-BMP characters) re-hit this per session, not once.

The safe repair: strip only U+D800–U+DFFF, leave real emoji intact

The key insight that makes the fix safe is a property of Python's str: a valid emoji is a single code point (🧭 is U+1F9ED), so it can never land in the surrogate range U+D800–U+DFFF. Anything you find in that range is, by definition, a broken half. So you can delete exactly those code points and every legitimate character — emoji included — is left byte-for-byte untouched. You are not "removing emoji"; you are removing the orphaned halves that should never have been on disk.

The manual version of the fix, when you don't have a script handy:

import json

# read the one offending line, parse it, walk every string,
# drop only lone surrogates, re-serialize compactly
obj = json.loads(line)

def strip(o):
    if isinstance(o, str):
        return "".join(c for c in o if not (0xD800 <= ord(c) <= 0xDFFF))
    if isinstance(o, list):
        return [strip(x) for x in o]
    if isinstance(o, dict):
        return {k: strip(v) for k, v in o.items()}
    return o

fixed = json.dumps(strip(obj), ensure_ascii=False, separators=(",", ":"))

Write fixed back as that single line, leaving every other line of the transcript exactly as it was. Always copy the file to a .bak first. Re-serializing only the broken line keeps the diff minimal and preserves the rest of the conversation verbatim. After the rewrite, claude --resume and the 400 is gone.

Close the session before you touch the file

This is the step people skip, and it silently undoes the repair. While a session is open, Claude Code holds the transcript and will overwrite your edited file from its in-memory state — your fix vanishes the moment the next turn flushes. You must close the target session first, or verify nothing holds the file:

lsof -- ~/.claude/projects//.jsonl

Empty output means it's safe to edit. A good repair tool checks this for you and refuses (skips) any transcript that is currently held open, so an automated pass can never corrupt a live session. The trade-off: the one session you most want to fix — the one throwing 400s right now — may be the one you have open, so a fully automatic pass can skip exactly that file. That's the case where you fall back to the manual close-then-fix once.

Make it cheap enough to auto-repair on every session start

Because a content-heavy project re-breaks per session, a one-time manual fix isn't durable — you want the repair to run automatically before you ever see the error. The natural place is a SessionStart hook that scans recently-modified transcripts and silently cleans the closed ones. The problem is that per-line json.loads over every transcript in your projects tree is too slow to run on every launch.

The fix is a cheap byte-level pre-filter that runs before any JSON parsing. A lone surrogate only persists to disk in two shapes, and both are detectable by scanning raw bytes:

The JSON escape \ud... (some serializers emit unpaired surrogates as a \uXXXX escape; valid UTF-8 text never produces a literal \ud on disk).
The raw three-byte UTF-8 surrogate encoding ED A0–BF (valid UTF-8 only allows ED followed by 80–9F, so ED + A0–BF is unambiguously a surrogate).

A file with neither signal is provably clean and skips the expensive parse entirely. In practice this matters: a 174-file scan dropped from 3.4s to 1.1s, cheap enough to run on every session start. Run it scoped to recent days only, quiet unless something was actually fixed, and skipping any open transcript via lsof. A representative one-shot invocation in a hook:

python3 fix-jsonl-surrogates.py --fix-all --recent 3 --quiet

## Why you can't prevent it upstream (honest limitation)

This is a repair pattern, not a prevention pattern, and that's a deliberate concession to where the truncation happens. The cut that orphans a surrogate occurs inside the harness's own truncation logic, between the model output and the transcript write. The user-facing hook lifecycle (PreToolUse, PostToolUse, Stop, SessionStart, and so on) fires around lifecycle events, not in the middle of serializing the request body — so there is no interception point that can stop the bad byte from being written in the first place. Making the recovery fast and automatic is the only lever you actually control.

The pre-filter approach also has a real gap: if you immediately --resume the very session that's broken, the auto-repair pass may find the file held open and skip it (correctly, to avoid clobbering live state), so coverage is not 100% — that's the one case needing a manual close-then-fix. And do not lean on /compact as a fix: if a lone surrogate survives into the summary, the 400 persists; if it works, it worked by luck, not by design. The only genuine frequency-reduction is behavioral: avoid dumping huge emoji-saturated blobs (verbose logs, status-emoji-heavy output) into the context wholesale, or ASCII-ify such logs at the source.

More notes at hexisteme.github.io/notes.

"If an LLM Extracts the Inputs, Is Your Deterministic Score Really Deterministic? Stopping Provenance Laundering"

John — Sat, 27 Jun 2026 00:00:06 +0000

Originally published on hexisteme notes.

No — a scoring function that consumes whatever values an LLM hands it is only deterministic in name; the LLM's judgment launders straight through the "deterministic" gate, and you close the hole with three rules (host-verified FACT sourcing, FACT-only scoring, and an asymmetric penalty where bad signals are penalized regardless of provenance while good signals only score when verified) plus multi-round adversarial testing.

The load-bearing trick is an asymmetric-penalty mechanism: an unverified input can only ever lower a score, never raise it. Bad signals are penalized regardless of where they came from (so you can't dodge a penalty by routing the bad news through a weak source), while good signals are credited only when they carry a FACT provenance. We hardened this through three rounds of adversarial review, with an attacker-satisfaction score climbing 58 → 71 → 96 as each round peeled back a deeper laundering channel — and the decisive case was a candidate that scored 94/ADOPT on 9 web-sourced (unverified) signals plus a single real FACT, which the fixed gate correctly flipped to AVOID.

The trap: "the LLM only extracts, Python decides" is not enough

A common and sensible architecture for trustworthy automation is to split labor: a language model reads messy sources (web pages, docs, API responses) and extracts structured inputs, then a plain deterministic function scores those inputs. The appeal is obvious — the model never gets to invent the verdict, so the verdict is reproducible and auditable. Teams describe this as "the LLM extracts, the code decides," and treat the output as deterministic.

The problem is that determinism of the function says nothing about the integrity of its inputs. If the scoring code consumes any value that is present, then the model's judgment re-enters through the input channel and the "deterministic gate" becomes a laundering channel for exactly the judgment you tried to remove. The function f(x) is perfectly deterministic; it is x that the model controls. Calling the result deterministic is a category error unless you also gate what x is allowed to be.

Concretely, imagine a generic scoring pipeline that rates a candidate (a library to adopt, a vendor to onboard, a data record to act on) on a 0–100 scale and emits ADOPT / HOLD / AVOID. An LLM fills the input fields by reading sources. The day you ship it, the score is reproducible — and also completely steerable by whatever the model decided to write into the fields.

Three laundering paths a deterministic score is exposed to

Even when each input carries a provenance tag — say a 4-level ladder FACT | INFERENCE | ASSUMPTION | UNKNOWN where FACT is supposed to mean "machine-fetched from a primary source" — there are three distinct ways arbitrary or model-generated input still poisons a deterministic output:

(a) Label forgery. The provenance tag is itself just another field the model writes. If the code trusts the tag, the model can stamp FACT on a value it guessed. Nothing was fetched; the label lied.
(b) Score laundering. The scoring function consumes non-FACT values as if they were FACT — "the field is filled, so use it." A floor like "require at least N FACTs" does not stop this, because counting FACTs never inspects which values the score actually consumed. You can satisfy the FACT count with one real fact and still drive the whole score off nine inferences.
(c) Penalty dodging. If a bad signal is only penalized when it arrives as a FACT, an adversary routes the bad news through a weak source (label it INFERENCE) and the penalty silently disappears. This is a perverse incentive baked into the gate: weak sourcing becomes a way to suppress red flags.

The measured failure that made this concrete: a candidate built from 9 web-sourced (unverified) signals plus a single genuine FACT scored 94 → ADOPT. A FACT-count floor passed it. The score was "deterministic" and wrong, because it had laundered nine pieces of model judgment into a confident green light.

The fix: three rules plus one asymmetry

You close all three paths with three rules and a single asymmetry. The rules establish what counts as trustworthy input; the asymmetry guarantees that anything failing those rules can only ever hurt a score, never help it.

1. FACT means machine-fetched AND host-verified — fail closed. Do not trust the source label. Verify that the value's URL host is on a whitelist for the claimed source type, and reject subdomain spoofing. An unregistered host means the value cannot be FACT: demote it and record a gap. This kills path (a).

def host_matches(source, url):
    allow = HOST_WHITELIST.get(source)   # canonical hosts per source type
    if not allow:
        return False                     # fail-closed: unknown source type
    h = host_of(url)
    # exact host or a true subdomain; blocks evil-example.com.attacker.net
    return any(h == a or h.endswith("." + a) for a in allow)

2. The score is driven by FACT only. The scoring function trusts a value only when provenance is FACT; every non-FACT value resolves to a conservative default, never to its raw model-supplied number. Ban "the field is known, so use it." This kills path (b).

def num(ev, default):
    if ev.provenance is not Provenance.FACT or ev.value is None:
        return default, True   # non-FACT -> conservative value, unverified=True
    return float(ev.value), False

3. Asymmetric penalty. A bad signal is penalized regardless of provenance — so you cannot dodge a penalty by laundering the bad news through a weak source. A good signal is credited only when it is FACT (rule 2 already guarantees this). The combined effect is the load-bearing invariant: unverified input can lower a score but never raise it. This kills path (c).

# bad signal: penalize on is_known, provenance-independent
if maintainer_count.is_known and maintainer_count.value <= 1:
    demote("single-maintainer-risk")
# good signal: crediting is FACT-only, enforced by num() above

## Why one round of review is not enough: 58 → 71 → 96

These defects do not surface in a single pass. They are layers of the same threat — "this deterministic gate is laundering judgment" — and each adversarial round peels back the next layer. In our hardening of a generic adoption-scoring pipeline, an adversarial reviewer (you can use a second model, a colleague, or a structured red-team checklist) was asked each round to break the determinism claim, and its satisfaction climbed across three rounds: 58 → 71 → 96.

Round 1 (58): caught label forgery — the gate trusted the provenance tag. Fix: host-verified whitelist (rule 1).
Round 2 (71): caught source laundering — values claiming an official-docs origin were accepted without host checks, so the attacker spoofed the source type rather than the label. Fix: fail-closed host matching extended to every source class.
Round 3 (96): caught score laundering — the FACT-count floor passed a score driven by non-FACT values. This is the deepest layer and the one most teams miss: a count of trusted inputs is not the same as a score built only from trusted inputs. Fix: FACT-only consumption (rule 2) plus the asymmetric penalty.

A single round of review would have shipped after fixing label forgery and declared victory at 58, leaving the actual laundering channel wide open. The lesson generalizes: when the threat is "my safety boundary is being bypassed," iterate the adversary until its satisfaction plateaus, because each fix exposes the next assumption.

Pin the defense with an attack matrix regression test

Once you have the three rules, freeze them as tests so a future refactor cannot quietly re-open a laundering path. Encode the attack matrix directly: each row is an adversarial input, each expected outcome is the gate's correct response.

# attack matrix as a regression suite
cases = [
    ("normal FACT, verified host",        EXPECT_PASS),
    ("label forgery (INFERENCE -> FACT)",  EXPECT_DEMOTE_AND_GAP),
    ("source spoof (fake official host)",  EXPECT_REJECT),
    ("score laundering (FACT count met,"
     " score driven by non-FACT values)", EXPECT_NO_CREDIT),
    ("penalty dodge (bad signal as weak"
     " source)",                          EXPECT_PENALTY_APPLIED),
    ("missing key signal",                 EXPECT_DEMOTE),
]

The two tests most worth writing are the ones that catch the subtle paths: (1) a candidate whose FACT count is satisfied but whose numeric inputs are non-FACT must receive conservative defaults, not credit; and (2) a bad signal arriving as INFERENCE must still trigger its penalty. If both pass, your gate has the invariant "unverified input can only lower the score." Re-run the matrix on every change to the gate.

Honest limitation: code cannot close the injection boundary

These three rules close the laundering paths that live inside the scoring pipeline. They do not — and cannot — close the boundary one layer up: the honesty of the session that injects the inputs. Host verification proves a value came from registry.example.com; it does not prove the model fetched the right page, reported the response faithfully, or didn't omit a contradicting field it saw. A sufficiently capable injecting agent that controls its own tool calls can fetch a real primary source and still cherry-pick or misreport from it.

The correct posture is to separate this residual boundary honestly rather than pretend code closed it. Mark it explicitly as an intrinsic trust boundary in your threat model — the same way you would label "we trust the OS kernel" — and mitigate it with controls that live outside the deterministic function: pinning the exact fetch URL and re-fetching it independently at verification time, diffing the model's reported value against a second independent fetch, sampling outputs for human spot-checks, and logging raw responses so a claimed FACT is auditable after the fact. Claiming a deterministic gate fully neutralizes a dishonest injector is itself a form of laundering — of your assurance to whoever consumes the score.

More notes at hexisteme.github.io/notes.

"macOS: nslookup works but curl and Python \"Could not resolve host\" — the mDNSResponder zombie"

John — Fri, 26 Jun 2026 00:00:06 +0000

Originally published on hexisteme notes.

If nslookup resolves a host fine but curl, pip, and Python (requests/httpx) fail with "Could not resolve host," your mDNSResponder daemon has almost certainly entered a non-responsive "zombie" state — and the fix is to restart it (sudo killall -9 mDNSResponder), not to touch your API keys, SDK versions, or code.

This failure is maddening because every signal points the wrong way. nslookup example.com returns a clean IP, so DNS "works." Your network is up. Your code didn't change. Yet curl, pip install, and every Python HTTP call die with "Could not resolve host." People burn hours rotating API keys, downgrading SDKs, and editing config files. None of that is the problem.

The two-DNS-path asymmetry (your fastest diagnosis)

macOS resolves DNS over two independent paths, and the asymmetry is the diagnosis:

Path	Who uses it	Goes through mDNSResponder?
Direct	`nslookup`, `dig`	No — queries DNS servers directly
System resolver	`curl`, Python, `pip`, most apps (`getaddrinfo()`)	Yes — routes to the daemon

The daemon can keep its PID alive while silently refusing to answer. So the direct path (nslookup) succeeds and the system-resolver path (curl) fails on the exact same host. When one path works and the other fails, you are not looking at a network, key, or code problem — you are looking at a sick daemon.

Confirm it in 10 seconds

Use the daemon path directly so you're testing the same route curl uses:

# Uses the mDNSResponder path (same as curl/Python). Empty result = zombie.
dscacheutil -q host -a name example.com

# ...while the direct path still returns a valid IP:
nslookup example.com

If dscacheutil comes back empty but nslookup returns an IP, the daemon is confirmed. One more check proves routing and TLS are fine and isolates the daemon as the sole cause:

curl --resolve example.com:443:<IP-from-nslookup> https://example.com
# Succeeds? Then DNS resolution is the only broken thing.

Fix it weakest-tool-first

# 1) flush + reload (least disruptive)
sudo dscacheutil -flushcache && sudo killall -HUP mDNSResponder

# 2) kickstart the service
sudo launchctl kickstart -k system/com.apple.mDNSResponder

# 3) universal hammer (if kickstart prints "Could not find service" —
#    the service path differs across macOS versions)
sudo killall -9 mDNSResponder mDNSResponderHelper
#    launchd's KeepAlive immediately restarts it with a fresh PID.

# verify
pgrep -l mDNSResponder && dscacheutil -q host -a name example.com

Why agents and long-lived dev boxes hit this repeatedly

The zombie is usually triggered by connection-pool exhaustion. Many concurrent outbound long-poll connections — several MCP servers, a local inference/LLM server, and a background job all holding sockets open at once — push mDNSResponder into high CPU until it stops answering.

Measured case: 30+ concurrent long-poll connections drove mDNSResponder to 77% CPU and into the non-responsive state. The recovery alone doesn't stop recurrence — you have to find and cut the connection source. Watch for the active connection count climbing (lsof -i -P -n | wc -l in the 80+ range) and mDNSResponder CPU above ~30%.

"Disabled" in config does not mean the process is dead

A subtle trap when hunting the connection source: a server can be flagged off at the code/config level (an ENABLED=False switch) while its OS process keeps running and keeps holding its long-poll connections. The flag stopped new work from being dispatched but never killed the process, so it kept feeding the pile-up. When auditing, check ps/pgrep for the actual process and its elapsed time — not just the config flag.

pgrep -lf <server-name>        # still there?
ps -o pid,etime,%cpu,command -p <pid>

Prevent recurrence — and the honest limitation

Two durable fixes: reduce concurrent long-poll load (trim always-on servers/MCP endpoints, kill stale inference servers, and on a box that runs for many days, restart mDNSResponder on a schedule), or install a watchdog LaunchDaemon that kickstarts the daemon when its CPU crosses a threshold.

Honest caveat: this exact two-path asymmetry is macOS-specific — it's mDNSResponder/getaddrinfo behavior, not Linux's nsswitch/resolv.conf. And a watchdog that calls launchctl kickstart needs root: implement it as a LaunchDaemon running as root with a narrowly scoped script, not a broad sudo NOPASSWD rule. The watchdog is a band-aid — the real fix is capping concurrent connections.

More notes on building & running an AI agent fleet at hexisteme.github.io/notes.

A file-based work-bus for orchestrating a fleet of agent CLIs — coordination without a message broker

John — Wed, 24 Jun 2026 00:00:05 +0000

Originally published on hexisteme notes, part of a series on building and running an AI agent fleet.

To coordinate a fleet of independent AI agent CLIs without a message broker or a heavy framework, use a filesystem work-bus: the orchestrator decomposes a goal into a graph of subtasks, writes a Task file per subtask, and polls for the Result file each worker writes back — every file written atomically. The durable coordination state lives on disk as files: language-agnostic, debuggable with ls, surviving restarts, and self-healing because an absent worker is skipped and logged instead of failing the run.

Say you have several AI agents, each an independent installed CLI — one gathers information, one writes copy, one builds an app scaffold — and you want to run a goal that needs several of them in sequence. The heavyweight answers are an in-process framework (LangGraph, an AutoGPT-style loop) or a message broker (Redis, Kafka, RabbitMQ). Both are more than a single-operator fleet needs: a framework couples your workers into one process and one language, and a broker is infrastructure you now have to run, secure, and monitor.

There's a lighter primitive that fits this shape: a work-bus made of files.

The mechanism

A conductor process owns a shared directory — the bus. To run a goal:

Decompose the goal into a directed acyclic graph of subtasks (e.g. gather → narrate → build).
For each ready subtask, write a Task file into the bus, tagged with the capability it needs.
Poll for the matching Result file, with a short backoff.
Absorb each result, validate it, and release the next subtasks in the graph.

The one rule that makes this safe is atomic writes: write each record to a temp path and rename it into place. Rename is atomic on POSIX filesystems, so a reader either sees the whole file or nothing — never a half-written record. Task and Result are typed records (a small pydantic schema), and the conductor keeps a registry of what's in flight.

# atomic publish — a reader never sees a partial record
def publish(path, record):
    tmp = path.with_suffix(".tmp")
    tmp.write_text(record.model_dump_json())
    tmp.rename(path)          # atomic on POSIX

# the conductor loop
for task in topo_order(dag):
    publish(bus / f"{task.id}.task.json", task)
    result = poll(bus / f"{task.id}.result.json", backoff=...)   # durable: waits for the file
    absorb(result)

This is state, not events — by design

It's fair to ask whether a file work-bus is just an event bus in disguise. It isn't, and the distinction is the reason it works. An event bus is push: producers emit ephemeral events, and anything not listening at that instant misses them. A file work-bus is state: the Task and Result records are durable files that stay until consumed. A worker that starts late, or restarts mid-run, still finds its task waiting. (I argued the same principle for monitoring a fleet in state is truth, events are rumors — here it shows up again for coordinating one.)

Why push is fine here but not for monitoring. You build and control these workers, so you can make them read and write the bus. Monitoring is the opposite case — you watch components you don't control, so you pull their state instead. Coordination of owned workers via durable files, monitoring of unowned components via state scans: both lean on durable state over ephemeral events.

Routing by capability, not by name

The conductor doesn't hard-wire "send step 2 to worker X." Each worker advertises capabilities; each subtask declares the capability it needs; the conductor matches them at dispatch time by finding a healthy worker that advertises the required capability. Add or remove a worker and routing adapts — there's no wiring diagram to edit. This is what lets one conductor coordinate a heterogeneous, changing fleet through a single uniform contract.

Graceful degradation: skip the absent worker

The most important behavior for a fleet that's still being built: a missing worker must not fail the run. If a subtask needs a capability no healthy worker currently advertises, the conductor marks that node skipped (a logged worker_absent), continues the rest of the graph, and synthesizes from whatever completed. On day one, when most workers don't exist yet, the conductor still runs end-to-end and produces partial output — and the skip log is a precise to-do list of which capabilities to build next. A gap is reported, not crashed on.

Trust the bus like a network boundary

Worker output is untrusted input crossing a boundary, and the bus treats it that way. Every result is parsed into a strict schema before absorption; mismatches (say, casing differences between the wire format and internal enums) are coerced and normalized at the seam. Load-bearing claims carry a provenance label and must include evidence — a claim that arrives marked FACT with no evidence IDs is rejected at parse time, not trusted. The typed contract is what lets independent workers, written in different languages by you at different times, interoperate without the conductor having to trust any of them blindly.

The honest limitation

⚠️ No stop condition on re-routing. Capability-based routing has a sharp edge: if a node can be re-routed to "any worker advertising capability C" and results keep failing validation, a naive conductor can re-route in an unbounded loop. A file work-bus needs an explicit per-node attempt budget (and a dead-letter outcome) or it can spin. Durability and decoupling are the wins; a bounded retry policy is the cost you must pay to claim them safely.

When to reach for a real broker

This pattern fits a small, heterogeneous fleet running tasks that take seconds to minutes, coordinated by one operator. If you need high-throughput, low-latency fan-out across many producers and consumers, run a real message bus — the file-bus's polling and single-conductor model won't keep up. Match the tool to the failure that hurts: for a solo fleet, the pain is operational overhead and brittle coupling, and a directory of atomic files removes both.

More notes on building an AI agent fleet — why I rejected an event bus for monitoring, labeling facts vs inferences, reusable decision units — at hexisteme.github.io/notes.

How to make an AI research agent label facts vs inferences — a deterministic provenance pipeline

John — Mon, 22 Jun 2026 23:23:16 +0000

Originally published on hexisteme notes, part of a series on building and running an AI agent fleet.

To stop an AI research or RAG agent from presenting its own inferences as retrieved facts, split the work so the LLM never decides what is a fact: let the LLM only extract and summarize, and let a deterministic, non-LLM pipeline do all scoring, cross-checking, and labeling. Tag a claim FACT only when a rule is satisfied — corroboration by ≥2 independent sources, or one official API — and downgrade everything else to INFERENCE. Because labeling is rule-based, the agent can't launder a guess into a fact, and the same query produces the same labels every run.

An AI agent that gathers information has two kinds of output tangled together: things it retrieved and things it concluded. A web page said the market was 1.2 trillion won (retrieved); the agent inferred the market is "growing fast" (concluded). Both come out in the same confident prose. For anything you'll act on, that blend is the problem — you can't tell which sentence is grounded and which is the model filling a gap.

The fix isn't a better prompt ("only state facts you can cite"). Prompts are probabilistic; under pressure the model reverts. The fix is structural: take the fact/inference decision away from the model entirely and put it in code.

The split: LLM extracts, code judges

Draw a hard line through the pipeline:

The LLM does	Deterministic code does
Extract claims from a fetched page; summarize a passage	Score, cross-check, sort, deduplicate, label FACT/INFERENCE, decide freshness

The LLM is excellent at reading messy text and pulling out a structured claim. It is unreliable at judging that claim — ask it to "rate confidence 0–1" and it will turn a guess into 0.85, and give a different number next run. So nothing downstream of extraction is allowed to be an LLM call. Scores are token matches, source counts, and recency math. Labels are rule outputs. This buys two things at once: reproducibility (same query → same labels, which you can unit-test) and no laundering (the model can't promote its own inference to a fact, because it never holds the pen on labeling).

Reproducibility is the tell. If your research agent gives different confidence on the same question across runs, an LLM is scoring somewhere in the pipeline. Find it and replace it with a function. The goal is: re-run the exact query, get the exact same FACT/INFERENCE split.

A six-phase pipeline

Make the stages explicit so each is testable in isolation:

PLAN → HARVEST → NORMALIZE → CORROBORATE → SCORE → RENDER

PLAN — turn the question into concrete sub-queries and the sources to try.
HARVEST — fetch from multiple paths (see below). LLM-free; just collection.
NORMALIZE — LLM extracts structured claims from each fetched item. This is the only place the model touches the data.
CORROBORATE — group claims; count independent sources per claim.
SCORE — assign labels and scores by rule.
RENDER — emit FACTs, INFERENCEs, and an explicit gap list.

The FACT gate: earn the label

FACT is not a default; it's a status a claim must earn, enforced as a type invariant:

# A claim constructed as FACT without evidence is a bug, not a soft warning.
Claim(provenance=FACT, evidence_ids=[])   # -> raises

# The corroboration rule (the knob is the count; the principle is independence)
def label(claim):
    independent = count_independent_sources(claim)   # distinct domains, not pages
    if independent >= 2 or claim.from_official_api:
        return FACT          # carries the evidence_ids that corroborated it
    return INFERENCE         # single-source or model-derived

"Independent" is doing real work: one blog quoting another blog is one source, not two. Two different domains, or a single authoritative API (a government dataset, an exchange's own endpoint), clear the bar. Everything else is rendered as INFERENCE — visible to the reader as exactly that.

⚠️ Watch for order-dependence. An early version of this scored a cross-corroborated FACT lower than a single-source INFERENCE because the score depended on processing order. That silently breaks reproducibility. Scores must be a pure function of the claim and its evidence, independent of the order claims were processed.

Multi-path harvest, without redundancy

Diversity of sources is what makes corroboration meaningful, but firing every source at once is wasteful and noisy. Use escalation, not broadcast: try a primary search, and only escalate to the next path when the first is insufficient.

Path	Order
Web search	primary → escalate to a news-grade engine (ad/spam pollution) → escalate to a semantic engine (papers, near-duplicates)
Official API	a government/first-party dataset; one official source may stand alone as FACT

Never send the same query to three engines simultaneously — read the first result, then decide whether to escalate. And when a source fails or is rate-limited, log the failure and the escalation; never substitute a guess for a missing fetch.

Freshness and gaps are first-class

Two more rules complete the provenance picture. Freshness: every datum carries a confirmation date, and a rule marks it stale when it ages past a threshold — a fact true last quarter is labeled as such, not silently presented as current. Gaps: the render step emits an explicit list of what was asked but not found or not corroborated. A silent gap reads as completeness and is the most dangerous output a research agent can produce; surfacing it is what makes the FACT list trustworthy.

Why this is worth the structure

The payoff is a research output a reader (or a downstream AI) can trust per-claim: every FACT points at the independent sources that earned it, every INFERENCE is flagged as the agent's own leap, stale data says so, and the gaps are named. The model still does what it's good at — reading and extracting — but it never gets to decide what's true. In an era where AI answers are increasingly cited as sources themselves, the agents worth citing are the ones that label their own confidence honestly, by rule, and reproducibly.

More notes on building an AI agent fleet — falsifier-driven AI decisions, reusable decision units, a file-based agent work-bus — at hexisteme.github.io/notes.