Fix: catch single-script IDN homograph links (#203) by twschiller · Pull Request #215 · pixiebrix/agent-browser-shield

twschiller · 2026-06-07T20:06:31Z

Summary

Addresses red-team audit item #16 from #203 — "link-spoof-annotate single-script homograph and IDN visible text."

The previous rule caught only intra-word mixed-script homoglyphs (Latin + non-Latin adjacent) and ASCII-only domain mismatches. A fully-Cyrillic spoof like аррӏе.com slipped past both: there's no Latin letter to be adjacent to, and the ASCII-only DOMAIN_RE doesn't extract a Unicode candidate.

Changes

`extension/src/lib/confusables.ts` (new)

A curated subset of the Unicode TR39 confusables table (Cyrillic, Greek, Armenian → Latin). The full table has ~12k entries; the subset here covers the codepoints actually observed in URL-bar homograph attacks (Cyrillic а/е/о/р/с/у/х and their uppercase forms account for the bulk).

skeleton("аррӏе.com") === "apple.com"           // every letter Cyrillic
skeleton("Οmega.example") === "omega.example"   // Greek Ο
skeleton("президент.рф")  // still contains Cyrillic — non-confusable chars
                          // (п, з, и, д, н, т, ф) pass through

Confusables without a clear Latin target (Devanagari, Hebrew) intentionally omitted — including them risks false positives on legitimate non-Latin text without any phishing-defense win.

`extension/src/rules/link-spoof-annotate.ts`

DOMAIN_RE Unicode-aware. \p{L}\p{N} for letter/digit runs with explicit lookbehind/lookahead anchors — \b is ASCII-only even under /u, so the boundary check has to be done manually.
New skeleton-based homograph trigger. When a visible domain candidate skeletons to a pure-ASCII Latin string and differs from the input, set homoglyphSkeleton so the chip can surface what Latin shape the domain mimics.
Punycode normalization before PSL comparison. Visible candidate runs through new URL("https://" + d + "/").hostname first so the registrable-domain comparison is apples-to-apples regardless of input form. A legitimate IDN link (visible Unicode ↔ xn-- href) collapses to the same RD on both sides and isn't flagged; an attacker-redirect (visible IDN ↔ unrelated ASCII href) still surfaces.

Three coverage cases now handled, each independently:

visible text	href	trigger
`аррӏе.com`	`https://evil.example/`	both skeleton + text/href mismatch
`аррӏе.com`	`https://xn--80ak6aa92e.com/`	skeleton only (own-IDN attack)
`президент.рф`	`https://xn--d1abbgf6aiiy.xn--p1ai/`	no flag (legitimate IDN)

Docs

docs/src/content/docs/rules.md updated from "two checks" to "three checks" to reflect the new skeleton trigger. Per repo convention the example phrasings stay abstract; the published doc doesn't enumerate the confusables table.

Test plan

node_modules/.bin/jest src/lib/__tests__/confusables.test.ts src/rules/__tests__/link-spoof-annotate.test.ts — 29/29 pass (6 confusables, 9 single-script homograph cases, 2 IDN text/href cases, 1 chip-content case)
node_modules/.bin/jest — full extension suite (1767 tests) passes
bun run check — biome + eslint clean
bun run typecheck — clean
bun run knip — clean
pre-commit run --files docs/src/content/docs/rules.md — mdformat + markdownlint clean

🤖 Generated with Claude Code

Addresses audit item #16. The previous link-spoof-annotate caught only intra-word mixed-script homoglyphs (Latin + non-Latin adjacent) and ASCII-only domain mismatches, so a fully-Cyrillic spoof like "аррӏе.com" passed both checks: no Latin letter to be adjacent to, and ASCII-only DOMAIN_RE didn't extract a candidate. Two changes: - New lib/confusables.ts: curated subset of Unicode TR39 confusables for Cyrillic/Greek/Armenian → Latin. `skeleton(text)` collapses each confusable codepoint to its Latin target so "аррӏе.com" becomes "apple.com" while non-confusable codepoints pass through (e.g. "президент.рф" still contains Cyrillic, so it isn't read as Latin). - link-spoof-annotate.ts: - DOMAIN_RE now Unicode-aware (\p{L}\p{N} with explicit letter/digit lookbehind+lookahead — \b is ASCII-only even under /u). - New skeleton-based homograph trigger fires when a visible domain candidate skeletons to a pure-ASCII Latin string and differs from the input — catches single-script attacks the intra-word regex misses. Chip surfaces the Latin mimic ("аррӏе.com" mimics "apple.com"). - Visible candidate is normalized to punycode via the URL parser before the PSL comparison, so a legitimate IDN link (visible Unicode ↔ xn-- href) doesn't surface as a mismatch, while attacker-redirect cases still do. Docs: rules.md updated from "two checks" to "three checks" to reflect the new skeleton trigger. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

vercel · 2026-06-07T20:06:37Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
agent-browser-shield-demo-site	Ready	Preview, Comment	Jun 7, 2026 8:11pm

confusables.property.test.ts pins: - Idempotence: skeleton(skeleton(s)) === skeleton(s) - Pure-ASCII Latin passthrough — guards against accidental Latin codepoints in the confusables map. - Confusable-only inputs produce pure-ASCII Latin — the load-bearing invariant the rule's /^[a-z0-9.-]+$/ skeleton check relies on. - Non-confusable codepoints pass through. link-spoof-annotate.property.test.ts pins: - Skeleton trigger never fires on pure-ASCII visible text. - Same-host text/href never flags (apex form and www-prefixed href). - Cross-form IDN equivalence: small fixture set of legitimate IDN domains (президент.рф, bücher.de, mañana.es, 香港.hk) never trigger the text/href mismatch branch when href is the punycode form. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

unblocked Bot approved these changes Jun 7, 2026

View reviewed changes

vercel Bot deployed to Preview June 7, 2026 20:11 View deployment

twschiller merged commit 633a262 into main Jun 7, 2026
7 checks passed

twschiller deleted the fix/link-spoof-idn-homograph-203 branch June 7, 2026 20:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix: catch single-script IDN homograph links (#203)#215

Fix: catch single-script IDN homograph links (#203)#215
twschiller merged 2 commits into
mainfrom
fix/link-spoof-idn-homograph-203

twschiller commented Jun 7, 2026

Uh oh!

vercel Bot commented Jun 7, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

twschiller commented Jun 7, 2026

Summary

Changes

extension/src/lib/confusables.ts (new)

extension/src/rules/link-spoof-annotate.ts

Docs

Test plan

Uh oh!

vercel Bot commented Jun 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`extension/src/lib/confusables.ts` (new)

`extension/src/rules/link-spoof-annotate.ts`

vercel Bot commented Jun 7, 2026 •

edited

Loading