Skip to content

Fix: catch single-script IDN homograph links (#203)#215

Merged
twschiller merged 2 commits into
mainfrom
fix/link-spoof-idn-homograph-203
Jun 7, 2026
Merged

Fix: catch single-script IDN homograph links (#203)#215
twschiller merged 2 commits into
mainfrom
fix/link-spoof-idn-homograph-203

Conversation

@twschiller

Copy link
Copy Markdown
Contributor

Summary

Addresses red-team audit item #16 from #203 — "link-spoof-annotate single-script homograph and IDN visible text."

The previous rule caught only intra-word mixed-script homoglyphs (Latin + non-Latin adjacent) and ASCII-only domain mismatches. A fully-Cyrillic spoof like аррӏе.com slipped past both: there's no Latin letter to be adjacent to, and the ASCII-only DOMAIN_RE doesn't extract a Unicode candidate.

Changes

extension/src/lib/confusables.ts (new)

A curated subset of the Unicode TR39 confusables table (Cyrillic, Greek, Armenian → Latin). The full table has ~12k entries; the subset here covers the codepoints actually observed in URL-bar homograph attacks (Cyrillic а/е/о/р/с/у/х and their uppercase forms account for the bulk).

skeleton("аррӏе.com") === "apple.com"           // every letter Cyrillic
skeleton("Οmega.example") === "omega.example"   // Greek Ο
skeleton("президент.рф")  // still contains Cyrillic — non-confusable chars
                          // (п, з, и, д, н, т, ф) pass through

Confusables without a clear Latin target (Devanagari, Hebrew) intentionally omitted — including them risks false positives on legitimate non-Latin text without any phishing-defense win.

extension/src/rules/link-spoof-annotate.ts

  • DOMAIN_RE Unicode-aware. \p{L}\p{N} for letter/digit runs with explicit lookbehind/lookahead anchors — \b is ASCII-only even under /u, so the boundary check has to be done manually.
  • New skeleton-based homograph trigger. When a visible domain candidate skeletons to a pure-ASCII Latin string and differs from the input, set homoglyphSkeleton so the chip can surface what Latin shape the domain mimics.
  • Punycode normalization before PSL comparison. Visible candidate runs through new URL("https://" + d + "/").hostname first so the registrable-domain comparison is apples-to-apples regardless of input form. A legitimate IDN link (visible Unicode ↔ xn-- href) collapses to the same RD on both sides and isn't flagged; an attacker-redirect (visible IDN ↔ unrelated ASCII href) still surfaces.

Three coverage cases now handled, each independently:

visible text href trigger
аррӏе.com https://evil.example/ both skeleton + text/href mismatch
аррӏе.com https://xn--80ak6aa92e.com/ skeleton only (own-IDN attack)
президент.рф https://xn--d1abbgf6aiiy.xn--p1ai/ no flag (legitimate IDN)

Docs

docs/src/content/docs/rules.md updated from "two checks" to "three checks" to reflect the new skeleton trigger. Per repo convention the example phrasings stay abstract; the published doc doesn't enumerate the confusables table.

Test plan

  • node_modules/.bin/jest src/lib/__tests__/confusables.test.ts src/rules/__tests__/link-spoof-annotate.test.ts — 29/29 pass (6 confusables, 9 single-script homograph cases, 2 IDN text/href cases, 1 chip-content case)
  • node_modules/.bin/jest — full extension suite (1767 tests) passes
  • bun run check — biome + eslint clean
  • bun run typecheck — clean
  • bun run knip — clean
  • pre-commit run --files docs/src/content/docs/rules.md — mdformat + markdownlint clean

🤖 Generated with Claude Code

Addresses audit item #16. The previous link-spoof-annotate caught only
intra-word mixed-script homoglyphs (Latin + non-Latin adjacent) and
ASCII-only domain mismatches, so a fully-Cyrillic spoof like
"аррӏе.com" passed both checks: no Latin letter to be adjacent to, and
ASCII-only DOMAIN_RE didn't extract a candidate.

Two changes:

- New lib/confusables.ts: curated subset of Unicode TR39 confusables
  for Cyrillic/Greek/Armenian → Latin. `skeleton(text)` collapses each
  confusable codepoint to its Latin target so "аррӏе.com" becomes
  "apple.com" while non-confusable codepoints pass through (e.g.
  "президент.рф" still contains Cyrillic, so it isn't read as Latin).

- link-spoof-annotate.ts:
  - DOMAIN_RE now Unicode-aware (\p{L}\p{N} with explicit
    letter/digit lookbehind+lookahead — \b is ASCII-only even under /u).
  - New skeleton-based homograph trigger fires when a visible domain
    candidate skeletons to a pure-ASCII Latin string and differs from
    the input — catches single-script attacks the intra-word regex
    misses. Chip surfaces the Latin mimic ("аррӏе.com" mimics
    "apple.com").
  - Visible candidate is normalized to punycode via the URL parser
    before the PSL comparison, so a legitimate IDN link (visible
    Unicode ↔ xn-- href) doesn't surface as a mismatch, while
    attacker-redirect cases still do.

Docs: rules.md updated from "two checks" to "three checks" to reflect
the new skeleton trigger.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@vercel

vercel Bot commented Jun 7, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
agent-browser-shield-demo-site Ready Ready Preview, Comment Jun 7, 2026 8:11pm

Request Review

confusables.property.test.ts pins:
- Idempotence: skeleton(skeleton(s)) === skeleton(s)
- Pure-ASCII Latin passthrough — guards against accidental Latin
  codepoints in the confusables map.
- Confusable-only inputs produce pure-ASCII Latin — the load-bearing
  invariant the rule's /^[a-z0-9.-]+$/ skeleton check relies on.
- Non-confusable codepoints pass through.

link-spoof-annotate.property.test.ts pins:
- Skeleton trigger never fires on pure-ASCII visible text.
- Same-host text/href never flags (apex form and www-prefixed href).
- Cross-form IDN equivalence: small fixture set of legitimate IDN
  domains (президент.рф, bücher.de, mañana.es, 香港.hk) never trigger
  the text/href mismatch branch when href is the punycode form.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@twschiller twschiller merged commit 633a262 into main Jun 7, 2026
7 checks passed
@twschiller twschiller deleted the fix/link-spoof-idn-homograph-203 branch June 7, 2026 20:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant