fix: harden long PDF page extraction by gwokhou · Pull Request #85 · VectifyAI/OpenKB

gwokhou · 2026-06-03T10:30:04Z

Summary

This PR hardens long PDF page extraction in index_long_document().

It normalizes page content returned by PageIndex Cloud or the local PDF fallback into OpenKB's expected source JSON shape, including support for common page fields such as page,
page_number, page_num, content, markdown, and text.

Why

Long PDF indexing currently assumes that page extraction returns data in the exact shape OpenKB later writes to wiki/sources/*.json. In practice, cloud/local extractors may return
strings, alternate page-number fields, alternate content fields, invalid image metadata, or empty/unusable page data.

That can lead to brittle downstream behavior during wiki compilation, especially for complex or lengthy PDFs.

Related to #77. This does not replace the default PDF parser, but it improves the resilience of the existing PageIndex/local PDF extraction path.

Changes

Add _normalize_page_content() for PageIndex/local PDF page outputs.
Normalize cloud get_page_content() responses before writing source JSON.
Normalize local PDF fallback output as well.
Fall back to local extraction when cloud page content is empty or invalid.
Raise a clear RuntimeError when both cloud and local extraction produce no usable page content.
Add tests for normalized page shapes, invalid cloud fallback, and empty extraction failure.

Testing

Added focused unit coverage in tests/test_indexer.py.

fix: harden long PDF page extraction

f602f9d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: harden long PDF page extraction#85

fix: harden long PDF page extraction#85
gwokhou wants to merge 1 commit into
VectifyAI:mainfrom
gwokhou:pr/pdf-page-extraction

gwokhou commented Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

gwokhou commented Jun 3, 2026

Summary

Why

Changes

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant