GeneralParsers

Official LangBot parser plugin that extracts structured text from files for KnowledgeEngine plugins (e.g. LangRAG).

Supported Formats

Format	MIME Type	Parser
PDF	`application/pdf`	PyMuPDF-based layout-aware extraction with tables, page markers, and optional vision enhancement
DOCX	`application/vnd.openxmlformats-officedocument.wordprocessingml.document`	python-docx extraction with paragraph/table parsing and optional embedded-image recognition
Markdown	`text/markdown`	Convert to HTML, then structured extraction (headings, lists, code blocks, tables)
HTML	`text/html`	BeautifulSoup extraction (auto-removes script/style)
TXT	`text/plain`	Auto encoding detection (chardet)
Images	`image/png`, `image/jpeg`, `image/webp`, `image/gif`, `image/bmp`, `image/tiff`	Direct vision-based recognition when a vision model is configured

Architecture

┌──────────────────────────────────────────────┐
│  KnowledgeEngine Plugin (e.g. LangRAG)       │
│  Chunk → Embedding → Store → Retrieve        │
└──────────────────┬───────────────────────────┘
                   │ invoke_parser (RPC)
┌──────────────────▼───────────────────────────┐
│          GeneralParsers                      │
│                                              │
│  File bytes → Format detection → Parse       │
│                                              │
│  ParseResult:                                │
│    ├── text: Full extracted text              │
│    ├── sections: Heading-split sections       │
│    │   └── TextSection(content, heading,      │
│    │                   level)                 │
│    └── metadata: filename, MIME type, etc.    │
└──────────────────────────────────────────────┘

Features

Optional Vision Model Support - Configure a vision-capable LLM to OCR scanned PDF pages, recognize embedded PDF/DOCX images, and parse direct image uploads
Improved PDF Parsing - PyMuPDF-based extraction preserves page boundaries, merges tables into output, and emits richer document metadata
Scanned PDF Handling - Detects likely scanned pages and uses the vision model for OCR when configured
Cross-Format Image Recognition - Embedded PDF/DOCX images and direct image uploads can be turned into inline recognition text for downstream retrieval
Header/Footer Filtering - Repeated page headers and footers are detected and removed from PDF output
Section Structure Recognition - Detects Markdown-style headings (# ~ ######) and splits output into leveled sections
Table to Markdown - Tables in PDF/HTML/Markdown are converted to Markdown table format
Async Parsing - File parsing runs in a thread pool to avoid blocking the event loop
Auto Encoding Detection - Uses chardet for encoding detection, supports GBK, UTF-8, etc.
Format Fallback - Unsupported formats are automatically tried as plain text

Configuration

The plugin exposes one optional config item:

vision_llm_model_uuid: a vision-capable LLM used for scanned-page OCR, embedded PDF/DOCX image recognition, and direct image parsing

If this option is left empty, GeneralParsers still works normally, but image understanding falls back to placeholders and PDF parsing uses text/layout extraction only.

Usage

Install this plugin in LangBot
Optionally configure a vision model if you want OCR for scanned PDFs, DOCX/PDF image recognition, or direct image parsing
When uploading files to a knowledge base, select GeneralParsers as the parser
Parse results are automatically passed to the KnowledgeEngine plugin for further processing

Output Shape

GeneralParsers returns a structured ParseResult containing:

text: the full extracted text
sections: heading-aware text sections for chunking strategies that prefer structure
metadata: document metadata such as filename, MIME type, page count, table presence, scanned-page flags, and vision usage stats

Recent PDF parser metadata includes fields such as:

page_count
word_count
has_tables
has_scanned_pages
headers_footers_removed
vision_used
vision_tasks_count
vision_scanned_pages_count
vision_images_described_count

Development

pip install -r requirements.txt
cp .env.example .env

Configure DEBUG_RUNTIME_WS_URL and PLUGIN_DEBUG_KEY in .env, then launch with your IDE debugger.

Contributing

We welcome contributions! Feel free to:

Submit issues for bugs or feature requests
Fork the repo and submit pull requests
Improve documentation or add examples
Share your ideas and feedback

Star the repo if you find it useful!

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
assets		assets
components		components
readme		readme
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
main.py		main.py
manifest.yaml		manifest.yaml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GeneralParsers

Supported Formats

Architecture

Features

Configuration

Usage

Output Shape

Development

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

GeneralParsers

Supported Formats

Architecture

Features

Configuration

Usage

Output Shape

Development

Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages