Skip to content

fix: harden long PDF page extraction#85

Merged
KylinMountain merged 1 commit into
VectifyAI:mainfrom
gwokhou:pr/pdf-page-extraction
Jun 4, 2026
Merged

fix: harden long PDF page extraction#85
KylinMountain merged 1 commit into
VectifyAI:mainfrom
gwokhou:pr/pdf-page-extraction

Conversation

@gwokhou
Copy link
Copy Markdown
Contributor

@gwokhou gwokhou commented Jun 3, 2026

Summary

This PR hardens long PDF page extraction in index_long_document().

It normalizes page content returned by PageIndex Cloud or the local PDF fallback into OpenKB's expected source JSON shape, including support for common page fields such as page,
page_number, page_num, content, markdown, and text.

Why

Long PDF indexing currently assumes that page extraction returns data in the exact shape OpenKB later writes to wiki/sources/*.json. In practice, cloud/local extractors may return
strings, alternate page-number fields, alternate content fields, invalid image metadata, or empty/unusable page data.

That can lead to brittle downstream behavior during wiki compilation, especially for complex or lengthy PDFs.

Related to #77. This does not replace the default PDF parser, but it improves the resilience of the existing PageIndex/local PDF extraction path.

Changes

  • Add _normalize_page_content() for PageIndex/local PDF page outputs.
  • Normalize cloud get_page_content() responses before writing source JSON.
  • Normalize local PDF fallback output as well.
  • Fall back to local extraction when cloud page content is empty or invalid.
  • Raise a clear RuntimeError when both cloud and local extraction produce no usable page content.
  • Add tests for normalized page shapes, invalid cloud fallback, and empty extraction failure.

Testing

  • Added focused unit coverage in tests/test_indexer.py.

Copy link
Copy Markdown

@quanqigu quanqigu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks! 🚀

Copy link
Copy Markdown
Collaborator

@KylinMountain KylinMountain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks! 🚀

@KylinMountain KylinMountain merged commit 85dbcb8 into VectifyAI:main Jun 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants