Completely rewrite add_messages_streaming by gvanrossum · Pull Request #277 · microsoft/typeagent-py

gvanrossum · 2026-05-13T20:26:26Z

The throughput is now much higher -- e.g. with concurrency 10 and batch size 10, the Adrian podcast ingest in 40 seconds, compared to 90 on main (with the previous pipelining implementation).

A consequence of the new design is that the message index is now populated at the time messages are added -- the secondary index building no longer needs to do this.

Split embedding strategy (uncached chunk, cached related terms).

Add precomputed-embedding write paths for message and related-term indexes, introducing explicit *_with_embeddings methods in interfaces and both memory/SQLite implementations. Refactor existing add methods to compute embeddings once and delegate, enabling pipeline commit paths to reuse worker-generated embeddings without recomputation.

Previously typechat.Failure from the extractor was a soft error: the message was still committed (with no knowledge) and the failure recorded. Since LLM responses are non-deterministic, a Failure is just as unreliable as a raised exception, so both now stop the pipeline at the failing message and propagate the error. - Remove extraction_failure_msg from ChunkProcessingResult and _ChunkCommitResult; simplify _commit_batch_from_chunk_results - Keep stop_state.exception in sync with stop_at_message_id so it always reflects the lowest-ordinal failing message - Update tests accordingly Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Replace nested try/except* handling with ExceptionGroup handling. - Preserve producer_state and stop_state exceptions and raise a combined ExceptionGroup when multiple distinct failures occur. - Complete ChunkProcessingResult docstring with all class fields and clarify success semantics.

…er ConversationSettings)

…ssages are added.

gvanrossum · 2026-05-13T20:27:45Z

@KRRT7 If you still care about typeagent-py I'd appreciate your review!

KRRT7 · 2026-05-13T23:27:34Z

I'll review it in the morning

gvanrossum-ms and others added 22 commits May 9, 2026 13:02

Implement chunk-level extraction and embedding.

bb69f37

Split embedding strategy (uncached chunk, cached related terms).

Add producer task. Chunk ID is TextLocation.

5a60bad

Add _worker_task().

4510004

Add reassembler task and simplify some data structures.

c8db35d

Fix test (add more mock methods)

423eb57

Use a dispatcher task that spawns one-shot workers.

90091b0

Chunk validation without try/except.

2b320dd

Add _commit_batch_from_chunk_results to ConversationBase.

9f8f51d

Reformat conversation_base.py

4cae56c

Add tests, 100% coverage. Fix one thing in add_messages.py.

7f484a6

Add the new add_messages_streaming()

c45a814

Fix forward reference and run pyright for 3.12/3.14

208529f

A.md update

c8e9a67

Add maxsize=concurrency*2 to Queue(); use only one embedding model (p…

36eec7a

…er ConversationSettings)

Oops, fix tests

8cdbe68

Good docstring for add_messages_streaming()

83a252d

Fix two more docstrings

bf55d96

Eliminate unused 'success' property; rename {_,}NoOpKnowledgeExtractor

ae84430

Swap in new add_messages_streaming. Update message text index when me…

1dea32b

…ssages are added.

gvanrossum requested a review from bmerkle May 13, 2026 20:26

gvanrossum-ms added 4 commits May 13, 2026 14:16

Add skip_failed_messages flag and use in ingest_email.py

96e6934

Print chunk summaries after clipping

927fdae

[Incomplete] Handle ^C

fd14b95

Make second ^C a hard exit

766b655

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Completely rewrite add_messages_streaming#277

Completely rewrite add_messages_streaming#277
gvanrossum wants to merge 26 commits into
microsoft:mainfrom
gvanrossum:pipeline

gvanrossum commented May 13, 2026

Uh oh!

gvanrossum commented May 13, 2026

Uh oh!

KRRT7 commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants