Fix: catch single-script IDN homograph links (#203)#215
Merged
Conversation
Addresses audit item #16. The previous link-spoof-annotate caught only intra-word mixed-script homoglyphs (Latin + non-Latin adjacent) and ASCII-only domain mismatches, so a fully-Cyrillic spoof like "аррӏе.com" passed both checks: no Latin letter to be adjacent to, and ASCII-only DOMAIN_RE didn't extract a candidate. Two changes: - New lib/confusables.ts: curated subset of Unicode TR39 confusables for Cyrillic/Greek/Armenian → Latin. `skeleton(text)` collapses each confusable codepoint to its Latin target so "аррӏе.com" becomes "apple.com" while non-confusable codepoints pass through (e.g. "президент.рф" still contains Cyrillic, so it isn't read as Latin). - link-spoof-annotate.ts: - DOMAIN_RE now Unicode-aware (\p{L}\p{N} with explicit letter/digit lookbehind+lookahead — \b is ASCII-only even under /u). - New skeleton-based homograph trigger fires when a visible domain candidate skeletons to a pure-ASCII Latin string and differs from the input — catches single-script attacks the intra-word regex misses. Chip surfaces the Latin mimic ("аррӏе.com" mimics "apple.com"). - Visible candidate is normalized to punycode via the URL parser before the PSL comparison, so a legitimate IDN link (visible Unicode ↔ xn-- href) doesn't surface as a mismatch, while attacker-redirect cases still do. Docs: rules.md updated from "two checks" to "three checks" to reflect the new skeleton trigger. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
confusables.property.test.ts pins: - Idempotence: skeleton(skeleton(s)) === skeleton(s) - Pure-ASCII Latin passthrough — guards against accidental Latin codepoints in the confusables map. - Confusable-only inputs produce pure-ASCII Latin — the load-bearing invariant the rule's /^[a-z0-9.-]+$/ skeleton check relies on. - Non-confusable codepoints pass through. link-spoof-annotate.property.test.ts pins: - Skeleton trigger never fires on pure-ASCII visible text. - Same-host text/href never flags (apex form and www-prefixed href). - Cross-form IDN equivalence: small fixture set of legitimate IDN domains (президент.рф, bücher.de, mañana.es, 香港.hk) never trigger the text/href mismatch branch when href is the punycode form. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Addresses red-team audit item #16 from #203 — "
link-spoof-annotatesingle-script homograph and IDN visible text."The previous rule caught only intra-word mixed-script homoglyphs (Latin + non-Latin adjacent) and ASCII-only domain mismatches. A fully-Cyrillic spoof like
аррӏе.comslipped past both: there's no Latin letter to be adjacent to, and the ASCII-onlyDOMAIN_REdoesn't extract a Unicode candidate.Changes
extension/src/lib/confusables.ts(new)A curated subset of the Unicode TR39 confusables table (Cyrillic, Greek, Armenian → Latin). The full table has ~12k entries; the subset here covers the codepoints actually observed in URL-bar homograph attacks (Cyrillic
а/е/о/р/с/у/хand their uppercase forms account for the bulk).Confusables without a clear Latin target (Devanagari, Hebrew) intentionally omitted — including them risks false positives on legitimate non-Latin text without any phishing-defense win.
extension/src/rules/link-spoof-annotate.tsDOMAIN_REUnicode-aware.\p{L}\p{N}for letter/digit runs with explicit lookbehind/lookahead anchors —\bis ASCII-only even under/u, so the boundary check has to be done manually.homoglyphSkeletonso the chip can surface what Latin shape the domain mimics.new URL("https://" + d + "/").hostnamefirst so the registrable-domain comparison is apples-to-apples regardless of input form. A legitimate IDN link (visible Unicode ↔xn--href) collapses to the same RD on both sides and isn't flagged; an attacker-redirect (visible IDN ↔ unrelated ASCII href) still surfaces.Three coverage cases now handled, each independently:
аррӏе.comhttps://evil.example/аррӏе.comhttps://xn--80ak6aa92e.com/президент.рфhttps://xn--d1abbgf6aiiy.xn--p1ai/Docs
docs/src/content/docs/rules.mdupdated from "two checks" to "three checks" to reflect the new skeleton trigger. Per repo convention the example phrasings stay abstract; the published doc doesn't enumerate the confusables table.Test plan
node_modules/.bin/jest src/lib/__tests__/confusables.test.ts src/rules/__tests__/link-spoof-annotate.test.ts— 29/29 pass (6 confusables, 9 single-script homograph cases, 2 IDN text/href cases, 1 chip-content case)node_modules/.bin/jest— full extension suite (1767 tests) passesbun run check— biome + eslint cleanbun run typecheck— cleanbun run knip— cleanpre-commit run --files docs/src/content/docs/rules.md— mdformat + markdownlint clean🤖 Generated with Claude Code