chore: add internal markdown link check#21831
chore: add internal markdown link check#21831Geethapranay1 wants to merge 5 commits intoapache:mainfrom
Conversation
|
@comphead PTAL |
|
|
||
| Run the internal markdown link check locally: | ||
|
|
||
| ```shell |
There was a problem hiding this comment.
for local we need to document mapfile is needed
There was a problem hiding this comment.
and also mapfile is not available on macOS
comphead
left a comment
There was a problem hiding this comment.
Thanks @Geethapranay1 for the PR
I tried to check it locally but with macOS I'm missing some bash commands. Would you help to attach to the PR how bad link would look like.
I'm assuming it should be very clear to the user what link needs to be fixed
|
Thanks @comphead for the detailed review and for testing this on macOS. I will:
|
|
@comphead PTAL, i have changed according to the review |
|
|
||
| ```text | ||
| [docs/source/user-guide/cli/overview.md]: | ||
| [ERROR] file:///.../docs/source/user-guide/cli/missing-page.md | Cannot find file: File not found. Check if file exists and path is correct |
There was a problem hiding this comment.
oh, lychee doesn't refer to a specific line in the md file?
There was a problem hiding this comment.
This is some sample file name, but when there is link broken it will not show line number, but exactly file name and link broken line, but not number.
comphead
left a comment
There was a problem hiding this comment.
Thanks @Geethapranay1 this pr makes a lot of sense
| - name: Install lychee | ||
| run: | | ||
| source ci/scripts/utils/tool_versions.sh | ||
| cargo install lychee --locked --version "${LYCHEE_VERSION}" |
There was a problem hiding this comment.
Looks like it takes 3m (on a beefy machine) to install lychee
https://github.com/apache/datafusion/actions/runs/24900880747/job/72917950438?pr=21831
Maybe we can install a debug version or somehting to speed up the process
There was a problem hiding this comment.
good point @alamb on the installation time. Instead of building a debug version from source, I can update the action to just download the pre-compiled lychee release binary for linux. That should bring the installation step down from 3 minutes to just a couple of seconds! I'll push an update for this shortly.
| "datafusion/core/benches/tpch-csv", | ||
| ] | ||
|
|
||
| exclude = [ |
There was a problem hiding this comment.
why not check http and https links?
There was a problem hiding this comment.
i already had talk with @comphead in the #21747 (comment) i excluded external links because they frequently cause flaky CI runs. Things like temporary network hiccups, third-party site downtime, or github API rate limits (429 Too Many Requests) can lead to false positives and unnecessarily block PRs.
anyway, if you prefer stricter validation and don't mind the occasional flakiness, I'm happy to remove these exclusions, we can always just ignore specific flaky domains as they pop up. Let me know how you'd like to proceed.
|
@alamb PTAL, I updated the workflow to use taiki-e/install-action. It now downloads the pre-built binary instead of compiling it from source, so the installation drops from 3 minutes to a few seconds! https://github.com/apache/datafusion/actions/runs/24936080575 |
Which issue does this PR close?
Rationale for this change
datafusion did not have a CI check for broken links in markdown content, docs workflows build and deploy docs, and dev checks formatting and spelling, but none of them validate link targets.
This pr adds a dedicated link check for internal markdown links so broken references fail early in PRs.
I kept the scope internal-only to avoid flaky CI failures from external websites and rate limits.
Rust doc comments remain covered by the existing rustdoc CI job.
What changes are included in this PR?
dev.yml.LYCHEE_VERSIONpin intool_versions.sh.markdown_link_check.shto run lychee on the selected markdown paths.lychee.tomlwith internal-link policy and exclusions..asf.yaml.roadmap.md49.0.0.mdoverview.mddataframe.mdformat_options.mdAre these changes tested?
Yes,
python3 ci/scripts/check_asf_yaml_status_checks.pypassed.bash -n ci/scripts/markdown_link_check.shpassed.bash ci/scripts/markdown_link_check.shpassed with 0 errors.cargo fmt --all --checkpassed.OK: All 5 required_status_checks match existing GitHub Actions jobs.
🔍 12824 Total (in 0s) ✅ 490 OK 🚫 0 Errors 👻 12334 Excluded
Are there any user-facing changes?
No,
There is one contributor-facing CI change: PRs now fail when internal markdown links break in the checked markdown files.