Question: completion proof boundary in multi-agent workflows #7763
Replies: 6 comments 4 replies
-
|
Yeah, we keep these two separate and it's saved us a lot of grief. An agent saying it finished is just a claim, and I treat it as untrusted input. "Done" only counts when the receiving system hands back a durable receipt (a persisted record, a queue ack, an idempotency-keyed response), not a bare 200. Where it goes: the proof has to sit at the application/integration layer, since only the system doing the work can mint it, not the agent. So in the framework the agent never sets completed, only submitted. A separate deterministic check (not another LLM) promotes it once the receipt lands. In the eval harness I just assert completed => receipt exists. A run where the agent claims done with no receipt is a fail, not a pass, which is exactly the case you're hitting. Pattern that's worked for me: idempotency key per task, attempt -> await confirm, store the receipt under that key. I came to this from trading agents, where a submitted order and a filled order are really not the same thing. |
Beta Was this translation helpful? Give feedback.
-
|
The "submitted ≠ completed" boundary is exactly the same primitive we use in our security harness for test verdict attestation. In adversarial testing, agents hallucinate completion roughly 8-12% of the time when the underlying operation failed silently — file not written, API returned 200 with an error body, or a tool executed the wrong command. What we've found works: treat agent-reported completion as testimony, not fact. The factual state requires a receipt that binds four fields: (agent_id, task_hash, actual_outcome, timestamp). We use an ERC-8004-style attestation structure with Merkle anchoring so the receipt is machine-verifiable by any downstream evaluator without re-executing the task. Your three-case model maps cleanly to this:
The gap I'd flag: if every eval harness invents its own receipt format, you can't compose harnesses or cross-validate results. A minimal standard — just those four fields plus a hash chain — would let AutoGen workflows emit receipts that external evaluators can verify without knowing AutoGen internals. We've implemented this pattern for MCP/A2A/L402 test outcomes. If a reference receipt schema would help your fixture, happy to share. |
Beta Was this translation helpful? Give feedback.
-
|
Great question - this is one of those subtle reliability issues that doesn't get enough attention. Short answer: This boundary should be explicit in all three layers, but the primary responsibility belongs in the agent framework. Why the agent framework:
Pattern we use: # Agent reports attempt, not completion
result = await agent.execute_task(task)
# result.status can be: attempted | confirmed | failed
# Framework provides verification hook
verification = await verifier.check(result.receipt)
if verification.confirmed:
state.mark_complete(task, proof=verification.proof)
else:
state.mark_needs_retry(task, reason=verification.reason)Why the eval harness matters:
Why the target application matters:
Recommendation: The agent framework should provide the mechanism (completion states, verification hooks, proof objects). The eval harness should provide the measurement (false completion rates, verification latency). The application should provide the policy (what counts as confirmation for each task type). In practice, we've found that treating completion as a 3-state model (attempted/confirmed/failed) rather than binary (done/not-done) eliminates most of these issues. |
Beta Was this translation helpful? Give feedback.
-
|
Yeah we distinguish them. An agent saying a task is done is just a claim. Completion only counts once the receiving system returns something durable you can point at later, a persisted record, queue ack, idempotency keyed response, that kind of thing. Not a bare 200 and not the model writing TERMINATE in chat. In AutoGen terms, built in termination ( On where the boundary belongs, I'd split it across all three but the receipt itself has to come from the target application. Only the system doing the work can mint real proof. The agent can't self certify completion. The framework's job is to carry state and wire verification. Agent reports attempt or submitted. A deterministic check, not another LLM, promotes to completed once the receipt shows up. That might be tool orchestration code, a custom termination condition that reads external state, or a small state machine next to your team run. AutoGen doesn't ship a submitted → confirmed flow out of the box yet, you build that in your orchestration layer. The eval harness enforces the rule. If a run is marked completed, a receipt must exist. Agent claimed done with no receipt is a fail. Pattern that's worked for me. Idempotency key per task. Agent returns submitted, not completed. Wait for the receipt keyed on that id. Only then mark the task done. Same idea as submitted order vs filled order. So short version. Yes distinguish them. App mints proof, framework wires the lifecycle, eval judges whether completion was honest. |
Beta Was this translation helpful? Give feedback.
-
|
Great question — this is the distinction between agent-reported completion and system-verified completion, and conflating the two is where most multi-agent reliability failures start. In our testing, the failure modes look like:
A pattern we have found useful is the three-layer proof model:
For AutoGen, this maps naturally to:
The orchestrator should treat agent claims as signals to verify, not as proof of completion. This shifts the trust boundary from "trust the agent" to "trust the verification layer." We have been testing task-result poisoning and recursive-delegation attacks in an open multi-agent security harness. Happy to share specific test vectors if helpful. |
Beta Was this translation helpful? Give feedback.
-
|
This would be very helpful :) I’m testing a small AoN proof/review record for exactly this boundary: separating agent-reported completion from system-verified completion. If you’re comfortable sharing one specific test vector, I’d like to convert it into an AoN-style record:
I’d then share the converted record back here and ask whether it captures the failure mode correctly. A small synthetic or sanitized example is enough. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi AutoGen team :)
quick multi-agent reliability question.
Do you distinguish agent-reported completion from system-returned completion proof in multi-agent workflows?
I’m testing a case where agents can prepare or attempt a task, but completion should only be claimed when the receiving system returns a receipt or durable confirmation.
Curious whether this boundary belongs in the agent framework, an eval harness, or the target application.
Beta Was this translation helpful? Give feedback.
All reactions