Skip to content

seconv: VobSub OCR + --time-codes-only for image-based subtitles#11629

Merged
niksedk merged 1 commit into
mainfrom
feature/seconv-vobsub-ocr-and-timecodes-only
Jun 15, 2026
Merged

seconv: VobSub OCR + --time-codes-only for image-based subtitles#11629
niksedk merged 1 commit into
mainfrom
feature/seconv-vobsub-ocr-and-timecodes-only

Conversation

@niksedk

@niksedk niksedk commented Jun 15, 2026

Copy link
Copy Markdown
Member

Addresses the request in #10068: a way to produce a timing-only output file from image-based subtitles without a full OCR.

--time-codes-only

Extracts time codes from image-based sources into any text format, skipping OCR entirely — each entry keeps its timing with empty text, and no OCR engine is created, so it works without Tesseract/Paddle/nOCR/etc. installed.

seconv movie.sup subrip --time-codes-only
1
00:00:01,000 --> 00:00:03,500

2
00:00:04,000 --> 00:00:06,200

The empty-text output re-opens cleanly in Subtitle Edit itself (verified against the actual SubRip and AdvancedSubStationAlpha parsers — both detect the format and reload all cues with timing intact). A few stricter third-party players may drop empty cues; switching the placeholder to e.g. - would be a one-line change if that's ever wanted.

VobSub wired into the text/OCR pipeline

VobSub previously errored with "use the Subtitle Edit UI for now". It's now supported in seconv for both full OCR and --time-codes-only, reusing the existing VobSub bitmap decoder:

  • .sub + .idx pairs (text target)
  • VobSub-in-MKV (S_VOBSUB)
  • VobSub-in-MP4 (handler subp)
seconv movie.sub subrip --ocr-engine:tesseract --ocr-language:eng   # .idx auto-detected
seconv movie.mkv subrip --time-codes-only                          # PGS + VobSub tracks, no OCR

.sub routing fix

A binary VobSub .sub with no .idx companion is now detected via its MPEG pack header (00 00 01 BA) and read directly — VobSubParser.OpenSubIdx already falls back to the stream's own PTS timing with a default palette — emitting a note rather than failing or being misparsed as MicroDVD. A genuine text MicroDVD .sub still routes to the text loader. (Without the .idx, colors use a default palette so OCR accuracy may be slightly lower; timing is accurate and --time-codes-only is unaffected.)

Tests

  • TimeCodesOnlyTest.sup → SRT with timing and no recognised text, no OCR engine needed.
  • ContainerLoaderTest — replaced the old (CI-skipped) "OCRs PGS and skips VobSub" test with a deterministic --time-codes-only test proving both the PGS and VobSub tracks in container_image.mkv now convert.
  • VobSubRoutingTest — binary-vs-text .sub detection, and a MicroDVD .sub (no .idx) still converting as text.

163 tests pass, 0 skipped.

Note (out of scope here)

While testing I found a pre-existing latent bug: with --overwrite, two same-language tracks in one container resolve to the same output filename and the second silently clobbers the first (the track-number disambiguation only runs when !Overwrite). It affects text tracks too and is now easier to hit since VobSub tracks are no longer skipped. Happy to fix in a follow-up — the clean fix is to track output paths written within a single run and disambiguate even under --overwrite.

🤖 Generated with Claude Code

Add a --time-codes-only flag to seconv that extracts time codes from
image-based subtitles into a text format without OCR: each entry keeps
its timing with empty text and no OCR engine is created, so it works
without Tesseract/Paddle/etc. installed. Verified that SE re-opens the
resulting empty-text SRT/ASSA files (timing preserved).

Wire VobSub into the text/OCR pipeline (previously "use the UI"):
- .sub + .idx pairs (text target)
- VobSub-in-MKV (S_VOBSUB)
- VobSub-in-MP4 (handler subp)
Both full OCR and --time-codes-only are supported for all of these,
reusing the existing VobSub bitmap decoder.

Fix .sub routing: a binary VobSub .sub with no .idx companion is now
detected (MPEG pack header) and read directly (stream PTS timing +
default palette, with a note) instead of falling through to the
MicroDVD text loader; a genuine text MicroDVD .sub still routes to the
text loader.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@niksedk niksedk merged commit 770c662 into main Jun 15, 2026
1 of 3 checks passed
@niksedk niksedk deleted the feature/seconv-vobsub-ocr-and-timecodes-only branch June 15, 2026 02:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant