Skip to content

feat: add stream_validate() hook to Requirement (#900)#925

Open
planetf1 wants to merge 8 commits intogenerative-computing:mainfrom
planetf1:feat/900-stream-validate
Open

feat: add stream_validate() hook to Requirement (#900)#925
planetf1 wants to merge 8 commits intogenerative-computing:mainfrom
planetf1:feat/900-stream-validate

Conversation

@planetf1
Copy link
Copy Markdown
Contributor

@planetf1 planetf1 commented Apr 24, 2026

Summary

Adds async stream_validate(chunk, *, backend, ctx) -> PartialValidationResult to the base Requirement class as a per-chunk streaming validation hook. The default implementation returns PartialValidationResult("unknown") — subclasses override to inspect accumulated chunks and signal "pass" or "fail" early. Stateful implementations may accumulate state on self; the orchestrator clones the requirement before each attempt so state does not bleed across retries.

Design decisions

Per the agreed Phase 1 spec:

No LLM-as-a-Judge logic is added here. This is a pure hook for custom validation overrides.

Test plan

  • test_default_returns_unknown — base class always returns "unknown"
  • test_default_returns_partial_validation_result_instance — correct return type
  • test_stream_validate_is_coroutine — method is async
  • test_subclass_can_return_pass — subclass override returning "pass" works
  • test_subclass_can_return_fail — subclass override returning "fail" with reason works
  • test_does_not_mutate_requirement — calling the method on base class leaves self unchanged
  • test_stream_validate_idempotent — multiple calls on base class leave self unchanged
  • test_stateful_subclass_accumulates_state — stateful subclass (bullet counter) correctly updates internal state across calls
  • test_stateful_subclass_clone_isolationcopy() of a stateful requirement produces an independent clone, confirming the orchestrator clone pattern

Related

Closes #900. Part of streaming epic #891. Builds on #924 (merged) and #923 (merged).

@github-actions github-actions Bot added the enhancement New feature or request label Apr 24, 2026
@planetf1 planetf1 force-pushed the feat/900-stream-validate branch from 96f1919 to 0c030fb Compare April 24, 2026 12:18
planetf1 added a commit to planetf1/mellea that referenced this pull request Apr 24, 2026
…eam_validate

- Remove "In Phase 1" temporal qualifier from docstring — reworded to
  timeless statement about orchestrator responsibility
- Add type annotations (str, Backend, Context) to test subclass overrides
- Add idempotency test: multiple calls on the same Requirement instance
  leave state unchanged

Assisted-by: Claude Code
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
planetf1 added a commit to planetf1/mellea that referenced this pull request Apr 24, 2026
…eam_validate

- Remove "In Phase 1" temporal qualifier from docstring — reworded to
  timeless statement about orchestrator responsibility
- Add type annotations (str, Backend, Context) to test subclass overrides
- Add idempotency test: multiple calls on the same Requirement instance
  leave state unchanged

Assisted-by: Claude Code
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
@planetf1 planetf1 force-pushed the feat/900-stream-validate branch from d922a2c to 58128a7 Compare April 24, 2026 14:16
…#900)

Add an async `stream_validate(chunk, backend, ctx)` method to the base
`Requirement` class. The default implementation returns
`PartialValidationResult("unknown")`; subclasses override to inspect the
accumulated chunk and return `"pass"` or `"fail"` early.

Per the Phase 1 design: `"pass"` is informational and does not
short-circuit the final `validate()` call. The method must not mutate
`self` — state isolation is the orchestrator's responsibility.

Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
Assisted-by: Claude Code
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
…eam_validate

- Remove "In Phase 1" temporal qualifier from docstring — reworded to
  timeless statement about orchestrator responsibility
- Add type annotations (str, Backend, Context) to test subclass overrides
- Add idempotency test: multiple calls on the same Requirement instance
  leave state unchanged

Assisted-by: Claude Code
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
Prevents positional confusion and makes future parameter additions
to the signature non-breaking for existing subclass overrides.

Assisted-by: Claude Code
@planetf1 planetf1 force-pushed the feat/900-stream-validate branch from 58128a7 to 358e4d1 Compare April 27, 2026 13:12
The docstring incorrectly stated that implementations must not mutate
self. Issue generative-computing#900 spec explicitly allows stateful accumulation and
requires the shallow-copy caveat to be documented. Fix the docstring
to match the spec.

Add two tests required by the issue acceptance criteria:
- test_stateful_subclass_accumulates_state: verifies a subclass can
  accumulate state (bullet counter) across stream_validate calls
- test_stateful_subclass_clone_isolation: verifies copy() gives an
  independent clone, confirming the orchestrator clone pattern

Assisted-by: Claude Code
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
@planetf1 planetf1 marked this pull request as ready for review April 27, 2026 13:19
@planetf1 planetf1 requested review from a team, jakelorocco and nrfulton as code owners April 27, 2026 13:19
The previous implementation overwrote _bullet_count from the full
accumulated chunk on each call — equivalent to a pure function with
no real dependency on prior state.

Use _seen_len to extract only the new portion of each accumulated
chunk, accumulating the count additively. This genuinely requires
prior-call state to know where to slice, making the test name
"accumulates_state" accurate.

Assisted-by: Claude Code
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
In multi-line calls, # type: ignore only suppresses errors on its own
line. The backend=None argument was uncovered; add the ignore there too.

Assisted-by: Claude Code
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
Use the public API for imports: Backend and Context both appear in
mellea.core.__all__, so import from mellea.core rather than the
internal submodules.

Rewrite test_stateful_subclass_clone_isolation to simulate the correct
orchestrator pattern: the original requirement is never called directly;
each attempt clones from the fresh original, giving _calls == 0 at the
start of every attempt. The previous test cloned mid-stream, which
tested shallow-copy isolation but demonstrated the wrong usage pattern.

Assisted-by: Claude Code
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
planetf1 added a commit to planetf1/mellea that referenced this pull request Apr 28, 2026
…eam_validate

- Remove "In Phase 1" temporal qualifier from docstring — reworded to
  timeless statement about orchestrator responsibility
- Add type annotations (str, Backend, Context) to test subclass overrides
- Add idempotency test: multiple calls on the same Requirement instance
  leave state unchanged

Assisted-by: Claude Code
Signed-off-by: Nigel Jones <jonesn@uk.ibm.com>
@planetf1
Copy link
Copy Markdown
Contributor Author

@jakelorocco how does this look for you - I have another stacked PR behind this one too. (I'll just go one level deep)

Copy link
Copy Markdown
Contributor

@jakelorocco jakelorocco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@planetf1, I might've missed this in the original proposal, but passing the accumulated chunks to the requirements is different from what I thought the proposal was. Could you please elaborate more on the design choice?

Comment thread mellea/core/requirement.py Outdated
isolation.

Args:
chunk: The accumulated model output so far (not just the latest token).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This worries me. @nrfulton, your initial proposal was to have the requirement only see new chunks. This would show all accumulated chunks so far.

This forces all streaming requirements to be stateful; all requirements must now keep track of what chunks they have processed. The alternative would be to only provide new chunks to requirements; then streaming validation would be stateless except when needed. Requirements can choose whether they need to store and process multiple chunks, or just check each chunk independently.

I guess the checking each chunk independently is unlikely to be helpful, so I can see why accumulating and forcing requirements to track their own progress doesn't actually add much complexity.

If so, I think we should actually pre-define functions to help with this (either as a new class of requirements or functions that implementors can draw on).

Also, if we are passing the accumulated chunk through, I almost think we should just pass in some point-in-time copy of the model output thunk. Ie one that doesn't get streamed the new chunks but has all the data fields from the point-in-time it was copied at.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest reverting to the simplicity of the original proposal. The change was made incorrectly when reviewing the initial PR. The initial approach is simple - and requirements can maintain state if the need to work on accumulated chunks. If we need to support accumulated content generally we can consider it in a later phase.

Which works best will vary by use case. Per-chunk like works better for

  • checkinf for forbidden words/phrases
  • ensuring a paragraph or sentence is coherent
  • structural checks - code fencing
  • format validation

especially when the MoT manages the semantic chunking (later)

There will be cases where accumulation is better -- these are probably more complex checks - does a story line flow, are we taking the response in an unexpected direction, do we have a complete enough response

Importantly if we stick to per-chunk we could still implement this second approach - albeit not as cleanly.

So in summary - I'll revert to original -- but if you now think that's wrong we can adjust?

Restores the chunk-at-a-time semantics set out in the generative-computing#891 epic and
generative-computing#900 spec: stream_validate is called once per complete chunk produced
by the chunking strategy, and receives that single chunk. Requirements
that need history accumulate it on self.

Commit 315a98c inadvertently flipped this: the BulletCounter test was
rewritten to recover deltas from accumulated text via self._seen_len,
and the docstring was updated to match ("The accumulated model output
so far"). Neither change reflected a design decision — it was drift
during a test fix, and buries a confusing workaround in what should be
a straightforward stateful override.

Changes:
- requirement.py: rewrite chunk Args description to name the
  chunking-strategy-produced delta, clarify that ctx does not contain
  the generated output during streaming, and note the MOT
  single-consumer constraint
- test_stream_validate.py: rewrite BulletCounter to accumulate its own
  running count (no self._seen_len); calls pass delta chunks
  ("\n- one\n- two") rather than re-sending accumulated text

The corresponding orchestrator fix in stream_with_chunking() -- pass
the chunk, iterate per chunk -- is in the stacked Wave 3 branch.

Assisted-by: Claude Code
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat(core): add stream_validate() to Requirement

2 participants