Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .asf.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,7 @@ github:
contexts:
- "Check License Header"
- "Use prettier to check formatting of documents"
- "Check Markdown Links"
- "Validate required_status_checks in .asf.yaml"
- "Spell Check with Typos"
# needs to be updated as part of the release process
Expand Down
19 changes: 19 additions & 0 deletions .github/workflows/dev.yml
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,9 @@ on:
pull_request:
merge_group:

permissions:
contents: read

concurrency:
group: ${{ github.repository }}-${{ github.head_ref || github.sha }}-${{ github.workflow }}
cancel-in-progress: true
Expand Down Expand Up @@ -51,6 +54,22 @@ jobs:
# if you encounter error, see instructions inside the script
run: ci/scripts/doc_prettier_check.sh

markdown-link-check:
name: Check Markdown Links
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
- name: Load tool versions
run: |
source ci/scripts/utils/tool_versions.sh
echo "LYCHEE_VERSION=${LYCHEE_VERSION}" >> "$GITHUB_ENV"
- name: Install lychee
uses: taiki-e/install-action@055f5df8c3f65ea01cd41e9dc855becd88953486 # v2.75.18
with:
tool: lychee@${{ env.LYCHEE_VERSION }}
Comment thread
github-advanced-security[bot] marked this conversation as resolved.
Fixed
- name: Run markdown link check
run: bash ci/scripts/markdown_link_check.sh

asf-yaml-check:
name: Validate required_status_checks in .asf.yaml
runs-on: ubuntu-latest
Expand Down
33 changes: 33 additions & 0 deletions ci/scripts/markdown_link_check.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
#!/usr/bin/env bash
#
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.

set -euo pipefail

ROOT_DIR="$(git rev-parse --show-toplevel)"

cd "${ROOT_DIR}"

MARKDOWN_FILES=()
while IFS= read -r file; do
MARKDOWN_FILES+=("${file}")
done < <(
git -C "${ROOT_DIR}" ls-files 'README.md' 'CONTRIBUTING.md' 'docs/**/*.md' 'datafusion-cli/README.md' 'datafusion-examples/README.md' 'dev/**/*.md'
)

lychee --no-progress --config "${ROOT_DIR}/lychee.toml" "${MARKDOWN_FILES[@]}"
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

These paths are relative to the repository root, but lychee resolves them from the current working directory. CI is fine because it runs from the repo root, but local runs from subdirectories can silently under-check files. Consider cd "${ROOT_DIR}" before this line, or pass absolute paths.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense @Weijun-H, i just added cd "${ROOT_DIR}" so it always runs from the repo root, even locally. pushed the fix

1 change: 1 addition & 0 deletions ci/scripts/utils/tool_versions.sh
Original file line number Diff line number Diff line change
Expand Up @@ -21,3 +21,4 @@
# It is intended to be sourced by other scripts and should not be executed directly.

PRETTIER_VERSION="2.7.1"
LYCHEE_VERSION="0.23.0"
2 changes: 1 addition & 1 deletion docs/source/contributor-guide/roadmap.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ under the License.

# Roadmap and Improvement Proposals

The [project introduction](../user-guide/introduction) explains the
The [project introduction](../user-guide/introduction.md) explains the
overview and goals of DataFusion, and our development efforts largely
align to that vision.

Expand Down
28 changes: 28 additions & 0 deletions docs/source/contributor-guide/testing.md
Original file line number Diff line number Diff line change
Expand Up @@ -186,6 +186,34 @@ tested in the same way using the [doc_comment] crate. See the end of
[doc_comment]: https://docs.rs/doc-comment/latest/doc_comment
[core/src/lib.rs]: https://github.com/apache/datafusion/blob/main/datafusion/core/src/lib.rs#L583

## Documentation Link Checks

Run the internal markdown link check locally:

```shell
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is very nice

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for local we need to document mapfile is needed

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and also mapfile is not available on macOS

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and cargo install lychee

source ci/scripts/utils/tool_versions.sh
cargo install lychee --locked --version "${LYCHEE_VERSION}"
bash ci/scripts/markdown_link_check.sh
```

Notes:

- The script is run with `bash` and is compatible with the default Bash on macOS (no `mapfile` dependency).
- The CI configuration currently checks internal markdown links only. External `http(s)` and `mailto` links are excluded to avoid flaky failures.

When a link is broken, lychee prints the file and URL/path that failed. For example:

```text
[docs/source/user-guide/cli/overview.md]:
[ERROR] file:///.../docs/source/user-guide/cli/missing-page.md | Cannot find file: File not found. Check if file exists and path is correct
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, lychee doesn't refer to a specific line in the md file?

Copy link
Copy Markdown
Contributor Author

@Geethapranay1 Geethapranay1 Apr 25, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is some sample file name, but when there is link broken it will not show line number, but exactly file name and link broken line, but not number.

```

Rust doc comments are validated by rustdoc in CI and can be checked locally with:

```shell
bash ci/scripts/rust_docs.sh
```

## Benchmarks

### Criterion Benchmarks
Expand Down
2 changes: 1 addition & 1 deletion docs/source/library-user-guide/upgrading/49.0.0.md
Original file line number Diff line number Diff line change
Expand Up @@ -123,7 +123,7 @@ Or via SQL:
SET datafusion.execution.spill_compression = 'zstd';
```

For more details about this configuration option, including performance trade-offs between different compression codecs, see the [Configuration Settings](../../user-guide/configs) documentation.
For more details about this configuration option, including performance trade-offs between different compression codecs, see the [Configuration Settings](../../user-guide/configs.md) documentation.

### Deprecated `map_varchar_to_utf8view` configuration option

Expand Down
4 changes: 2 additions & 2 deletions docs/source/user-guide/cli/overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,5 +41,5 @@ DataFusion CLI v37.0.0
Elapsed 1.969 seconds.
```
For more information, see the [Installation](installation), [Usage Guide](usage)
and [Data Sources](datasources) sections.
For more information, see the [Installation](installation.md), [Usage Guide](usage.md)
and [Data Sources](datasources.md) sections.
2 changes: 1 addition & 1 deletion docs/source/user-guide/dataframe.md
Original file line number Diff line number Diff line change
Expand Up @@ -122,4 +122,4 @@ async fn main() -> Result<()> {
[`collect`]: https://docs.rs/datafusion/latest/datafusion/dataframe/struct.DataFrame.html#method.collect
[library users guide]: ../library-user-guide/using-the-dataframe-api.md
[api reference on docs.rs]: https://docs.rs/datafusion/latest/datafusion/dataframe/struct.DataFrame.html
[expressions reference]: expressions
[expressions reference]: expressions.md
2 changes: 1 addition & 1 deletion docs/source/user-guide/sql/format_options.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ Format-related options can be specified in three ways, in decreasing order of pr
- `COPY` option tuples
- Session-level config defaults

For a list of supported session-level config defaults, see [Configuration Settings](../configs). These defaults apply to all operations but have the lowest level of precedence.
For a list of supported session-level config defaults, see [Configuration Settings](../configs.md). These defaults apply to all operations but have the lowest level of precedence.

If creating an external table, table-specific format options can be specified when the table is created using the `OPTIONS` clause:

Expand Down
32 changes: 32 additions & 0 deletions lychee.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.

timeout = 20
max_retries = 2
retry_wait_time = 2

exclude_path = [
"target",
"docs/build",
"datafusion/core/benches/tpch-csv",
]

exclude = [
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not check http and https links?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i already had talk with @comphead in the #21747 (comment) i excluded external links because they frequently cause flaky CI runs. Things like temporary network hiccups, third-party site downtime, or github API rate limits (429 Too Many Requests) can lead to false positives and unnecessarily block PRs.

anyway, if you prefer stricter validation and don't mind the occasional flakiness, I'm happy to remove these exclusions, we can always just ignore specific flaky domains as they pop up. Let me know how you'd like to proceed.

"^http://",
"^https://",
"^mailto:",
]
Loading