Question: completion proof boundary in multi-agent workflows #7763

productmakerjason · 2026-05-28T07:37:17Z

productmakerjason
May 28, 2026

Hi AutoGen team :)

quick multi-agent reliability question.

Do you distinguish agent-reported completion from system-returned completion proof in multi-agent workflows?

I’m testing a case where agents can prepare or attempt a task, but completion should only be claimed when the receiving system returns a receipt or durable confirmation.

Curious whether this boundary belongs in the agent framework, an eval harness, or the target application.

yebof · 2026-05-28T22:49:33Z

yebof
May 28, 2026

Yeah, we keep these two separate and it's saved us a lot of grief. An agent saying it finished is just a claim, and I treat it as untrusted input. "Done" only counts when the receiving system hands back a durable receipt (a persisted record, a queue ack, an idempotency-keyed response), not a bare 200.

Where it goes: the proof has to sit at the application/integration layer, since only the system doing the work can mint it, not the agent. So in the framework the agent never sets completed, only submitted. A separate deterministic check (not another LLM) promotes it once the receipt lands.

In the eval harness I just assert completed => receipt exists. A run where the agent claims done with no receipt is a fail, not a pass, which is exactly the case you're hitting.

Pattern that's worked for me: idempotency key per task, attempt -> await confirm, store the receipt under that key. I came to this from trading agents, where a submitted order and a filled order are really not the same thing.

1 reply

productmakerjason May 29, 2026
Author

Thanks @yebof

Your “submitted order ≠ filled order” analogy is the clearest version of the rule I’m making executable.

I’m narrowing v1 to three cases:

agent claims done → claim / untrusted input
agent submits or attempts, but no durable receipt returns → submitted_or_attempted, not completed
receiving system returns a durable receipt / persisted record / queue ack / idempotency-keyed response → completed_with_receipt

The eval rule is:

completed ⇒ receipt exists

I’m preparing this as a small fixture/eval pattern rather than a broad standard.

Would this three-case shape be useful enough for AutoGen-style workflows or eval harnesses to run against, adapt, or reference?

msaleme · 2026-06-01T14:19:49Z

msaleme
Jun 1, 2026

The "submitted ≠ completed" boundary is exactly the same primitive we use in our security harness for test verdict attestation. In adversarial testing, agents hallucinate completion roughly 8-12% of the time when the underlying operation failed silently — file not written, API returned 200 with an error body, or a tool executed the wrong command.

What we've found works: treat agent-reported completion as testimony, not fact. The factual state requires a receipt that binds four fields: (agent_id, task_hash, actual_outcome, timestamp). We use an ERC-8004-style attestation structure with Merkle anchoring so the receipt is machine-verifiable by any downstream evaluator without re-executing the task.

Your three-case model maps cleanly to this:

claim → testimony, untrusted
submitted_or_attempted → pending attestation
completed_with_receipt → attested outcome with cryptographic binding

The gap I'd flag: if every eval harness invents its own receipt format, you can't compose harnesses or cross-validate results. A minimal standard — just those four fields plus a hash chain — would let AutoGen workflows emit receipts that external evaluators can verify without knowing AutoGen internals.

We've implemented this pattern for MCP/A2A/L402 test outcomes. If a reference receipt schema would help your fixture, happy to share.

https://github.com/msaleme/red-team-blue-team-agent-fabric

2 replies

productmakerjason Jun 1, 2026
Author

This is extremely helpful thank you Msaleme :)

I took a look at your Agent Security Harness, and the framing makes more sense now.

The distinction you make between protocol integrity, operational governance, and decision governance maps closely to the boundary I’m trying to isolate.

“Agent-reported completion as testimony, not fact” is exactly the primitive I’m trying to make executable.

Your three-state model also maps directly to my current fixture:

claim → untrusted testimony
submitted_or_attempted → pending attestation
completed_with_receipt → attested outcome

The receipt structure you mentioned — receipt_id, task_hash, actual_outcome, timestamp, evaluator_signature — is especially useful because it turns completion evidence into something an external evaluator can verify without re-running the task.

I’d be very interested in seeing the reference receipt schema if you’re happy to share it.

For now, I’m keeping my v1 fixture intentionally minimal:

v1: completion requires evidence
v2: evidence should be externally attestable / verifiable

Does that framing match how you think about the boundary?

productmakerjason Jun 4, 2026
Author

This helps sharpen the boundary.

What I’m taking from the replies is:

the agent should report an attempt, not completion
the framework or verifier should run a completion check
only then should the system promote state to done
if every harness has its own receipt/proof shape, cross-validation becomes hard

I built a small Bridge v0 experiment to test that last part:

https://github.com/productmakerjason/the-agents-of-nations/tree/main/experiments/bridge-v0

It maps two representative completion-proof artifacts into one minimal profile.

This is not a standard or official adapter.

The question I’m testing is:

if another evaluator, workflow, or future agent market had to consume a completion proof, what field would be missing?

Would a minimal profile like claim / task_id / evidence_type / proof_source / target_reference / validation_result / timestamp / issuer be enough to start, or is the state-transition context essential?

jingchang0623-crypto · 2026-06-03T06:10:04Z

jingchang0623-crypto
Jun 3, 2026

Great question - this is one of those subtle reliability issues that doesn't get enough attention.

Short answer: This boundary should be explicit in all three layers, but the primary responsibility belongs in the agent framework.

Why the agent framework:

Agents need to know the difference between 'I attempted the task' and 'the task is confirmed done'
Without this distinction, agents make downstream decisions based on unverified assumptions
The framework should provide a standard way to express 'awaiting confirmation' vs 'confirmed'

Pattern we use:

# Agent reports attempt, not completion
result = await agent.execute_task(task)
# result.status can be: attempted | confirmed | failed

# Framework provides verification hook
verification = await verifier.check(result.receipt)
if verification.confirmed:
    state.mark_complete(task, proof=verification.proof)
else:
    state.mark_needs_retry(task, reason=verification.reason)

Why the eval harness matters:

You need to measure 'false completion rate' - agents claiming done when the system hasn't confirmed
This is a key metric for multi-agent reliability
Eval harness should track both agent-reported and system-confirmed completion separately

Why the target application matters:

Some tasks have idempotent completion (safe to retry), others don't
The application knows the business rules for what 'confirmed' means
Example: 'payment processed' needs gateway confirmation, 'file uploaded' needs storage confirmation

Recommendation: The agent framework should provide the mechanism (completion states, verification hooks, proof objects). The eval harness should provide the measurement (false completion rates, verification latency). The application should provide the policy (what counts as confirmation for each task type).

In practice, we've found that treating completion as a 3-state model (attempted/confirmed/failed) rather than binary (done/not-done) eliminates most of these issues.

1 reply

productmakerjason Jun 4, 2026
Author

Thanks you,

The attempted / confirmed / failed model is exactly the distinction I was trying to isolate.

I built a small Bridge v0 experiment here:

https://github.com/productmakerjason/the-agents-of-nations/tree/main/experiments/bridge-v0

It maps two representative completion-proof artifacts into one minimal profile.

This is not a standard or official adapter.

My question is:

if another evaluator, workflow, or future agent market had to consume this proof, is the proof object enough?

Or would it also need the state-transition context:

attempted → confirmed → failed?

Also, would you treat false completion rate as a useful core metric for this boundary?

false200 · 2026-06-21T10:44:51Z

false200
Jun 21, 2026

Hi @productmakerjason

Yeah we distinguish them. An agent saying a task is done is just a claim. Completion only counts once the receiving system returns something durable you can point at later, a persisted record, queue ack, idempotency keyed response, that kind of thing. Not a bare 200 and not the model writing TERMINATE in chat.

In AutoGen terms, built in termination (MaxMessageTermination, FunctionCallTermination, custom TerminationCondition) stops the conversation. It doesn't prove the external task actually landed. Those are related but separate layers.

On where the boundary belongs, I'd split it across all three but the receipt itself has to come from the target application. Only the system doing the work can mint real proof. The agent can't self certify completion.

The framework's job is to carry state and wire verification. Agent reports attempt or submitted. A deterministic check, not another LLM, promotes to completed once the receipt shows up. That might be tool orchestration code, a custom termination condition that reads external state, or a small state machine next to your team run. AutoGen doesn't ship a submitted → confirmed flow out of the box yet, you build that in your orchestration layer.

The eval harness enforces the rule. If a run is marked completed, a receipt must exist. Agent claimed done with no receipt is a fail.

Pattern that's worked for me. Idempotency key per task. Agent returns submitted, not completed. Wait for the receipt keyed on that id. Only then mark the task done. Same idea as submitted order vs filled order.

So short version. Yes distinguish them. App mints proof, framework wires the lifecycle, eval judges whether completion was honest.

0 replies

msaleme · 2026-06-25T14:18:46Z

msaleme
Jun 25, 2026

Great question — this is the distinction between agent-reported completion and system-verified completion, and conflating the two is where most multi-agent reliability failures start.

In our testing, the failure modes look like:

Downstream agent poisons the result — Worker-B returns "success" along with fabricated data. The orchestrator treats the success signal as proof the task completed correctly, not just that the agent stopped running.
Recursive delegation fake-out — A → B → A → B creates a circular chain where each agent reports "the other confirmed," manufacturing the appearance of multi-party consensus without any external anchor.
Premature completion claims — an agent reports completion before side effects are actually durable (e.g., "I wrote to the DB" before commit).

A pattern we have found useful is the three-layer proof model:

Layer	What it proves	Who provides it
Agent claim	"I believe I am done"	Agent itself
System verification	"The side effects are durable & match spec"	Orchestrator / sandbox
External confirmation	"An independent witness observed the same result"	Audit / witness agent

For AutoGen, this maps naturally to:

Agent claim → the agent's reply or terminate message
System verification → a post-execution check (did the file actually get written? does the DB row exist? does the output schema validate?)
External confirmation → a second, lightweight "witness" agent that re-executes or re-checks critical steps

The orchestrator should treat agent claims as signals to verify, not as proof of completion. This shifts the trust boundary from "trust the agent" to "trust the verification layer."

We have been testing task-result poisoning and recursive-delegation attacks in an open multi-agent security harness. Happy to share specific test vectors if helpful.

0 replies

productmakerjason · 2026-06-29T13:19:41Z

productmakerjason
Jun 29, 2026
Author

@msaleme

This would be very helpful :)

I’m testing a small AoN proof/review record for exactly this boundary: separating agent-reported completion from system-verified completion.

If you’re comfortable sharing one specific test vector, I’d like to convert it into an AoN-style record:

agent claim
system verification
external confirmation, if present
support level
accepted scope / failure reason
receipt candidate or revision state

I’d then share the converted record back here and ask whether it captures the failure mode correctly.

A small synthetic or sanitized example is enough.

0 replies

Uh oh!

Question: completion proof boundary in multi-agent workflows #7763

Uh oh!

productmakerjason May 28, 2026

Replies: 6 comments · 4 replies

Uh oh!

yebof May 28, 2026

Uh oh!

Uh oh!

productmakerjason May 29, 2026 Author

Uh oh!

msaleme Jun 1, 2026

Uh oh!

productmakerjason Jun 1, 2026 Author

Uh oh!

productmakerjason Jun 4, 2026 Author

Uh oh!

jingchang0623-crypto Jun 3, 2026

Uh oh!

Uh oh!

productmakerjason Jun 4, 2026 Author

Uh oh!

false200 Jun 21, 2026

Uh oh!

msaleme Jun 25, 2026

Uh oh!

productmakerjason Jun 29, 2026 Author

productmakerjason
May 28, 2026

Replies: 6 comments 4 replies

yebof
May 28, 2026

productmakerjason May 29, 2026
Author

msaleme
Jun 1, 2026

productmakerjason Jun 1, 2026
Author

productmakerjason Jun 4, 2026
Author

jingchang0623-crypto
Jun 3, 2026

productmakerjason Jun 4, 2026
Author

false200
Jun 21, 2026

msaleme
Jun 25, 2026

productmakerjason
Jun 29, 2026
Author