Skip to content

Fix IPartitionStrategy race condition#1517

Merged
zvonand merged 9 commits intoantalya-26.1from
fix_race_condition_partition_strategy
Mar 17, 2026
Merged

Fix IPartitionStrategy race condition#1517
zvonand merged 9 commits intoantalya-26.1from
fix_race_condition_partition_strategy

Conversation

@arthurpassos
Copy link
Copy Markdown
Collaborator

@arthurpassos arthurpassos commented Mar 12, 2026

IPartitionStrategy::computePartitionKey might be called from different threads, and it writes to cached_result concurrently without any sort of protection. It would be easier to add a mutex around it, but we can actually make it lock-free by moving the cache write to the constructor.

Changelog category (leave one):

  • Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Fix race condition on IPartitionStrategy found in https://altinity-build-artifacts.s3.amazonaws.com/json.html?PR=1463&sha=5e3b05142815f9d4dbbb89497badc85d455079e4&name_0=PR&name_1=Stateless+tests+%28amd_tsan%2C+s3+storage%2C+sequential%2C+2%2F2%29&name_2=Tests.

This was introduced in ClickHouse#92844, I'll ship a PR to upstream as well

Documentation entry for user-facing changes

...

CI/CD Options

Exclude tests:

  • Fast test
  • Integration Tests
  • Stateless tests
  • Stateful tests
  • Performance tests
  • All with ASAN
  • All with TSAN
  • All with MSAN
  • All with UBSAN
  • All with Coverage
  • All with Aarch64
  • All Regression
  • Disable CI Cache

Regression jobs to run:

  • Fast suites (mostly <1h)
  • Aggregate Functions (2h)
  • Alter (1.5h)
  • Benchmark (30m)
  • ClickHouse Keeper (1h)
  • Iceberg (2h)
  • LDAP (1h)
  • Parquet (1.5h)
  • RBAC (1.5h)
  • SSL Server (1h)
  • S3 (2h)
  • Tiered Storage (2h)

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Mar 12, 2026

Workflow [PR], commit [c1d0f74]

@arthurpassos arthurpassos changed the title add mutex around cached_result Fix IPartitionStrategy race condition Mar 12, 2026
@arthurpassos
Copy link
Copy Markdown
Collaborator Author

@codex review

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Keep it up!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@arthurpassos arthurpassos added antalya-26.1 antalya port-antalya PRs to be ported to all new Antalya releases labels Mar 13, 2026
mkmkme
mkmkme previously approved these changes Mar 16, 2026
@zvonand zvonand merged commit 50d2a70 into antalya-26.1 Mar 17, 2026
228 of 237 checks passed
@arthurpassos
Copy link
Copy Markdown
Collaborator Author

AI audit note: This review comment was generated by AI (gpt-5.3-codex).

Audit update for PR #1517 ([Fix IPartitionStrategy race condition](https://github.com/Altinity/ClickHouse/pull/1517)):

Confirmed defects:

No confirmed defects in reviewed scope.

Coverage summary:

Scope reviewed: src/Storages/IPartitionStrategy.{h,cpp} plus related test updates in tests/integration/test_export_merge_tree_part_to_object_storage/test.py, tests/integration/test_export_replicated_mt_partition_to_object_storage/test.py, and tests/queries/0_stateless/03572_export_merge_tree_part_limits_and_table_functions.{sh,reference} across merge-base 2fb6ff68..c1d0f747.
Categories failed: none confirmed.
Categories passed: shared-state race path around cached_result publication/use; call-graph entrypoints (PartitionedSink, StorageObjectStorage::import, DeltaLake sink) to partition-key computation; exception/error propagation parity; deterministic vs non-deterministic expression handling; concurrency interleaving checks for concurrent computePartitionKey calls; rollback/partial-update impact (no new mutable state transition in hot path); C++ hazard sweep (lifetime, UB, iterator invalidation, lock-order/deadlock) for changed code.
Assumptions/limits: static code audit only (no runtime fault-injection execution or TSAN run in this pass).

@Selfeer
Copy link
Copy Markdown
Collaborator

Selfeer commented Apr 1, 2026

AI audit note: This review comment was generated by AI (gpt-5.3-codex).

Audit update for PR #1517 (Fix IPartitionStrategy race condition):

Confirmed defects:

No confirmed defects in reviewed scope.

@Selfeer
Copy link
Copy Markdown
Collaborator

Selfeer commented Apr 1, 2026

PR #1517 CI Triage — Fix IPartitionStrategy race condition

Field Value
PR #1517fix_race_condition_partition_strategyantalya-26.1
Author @arthurpassos
Commit c1d0f747746a05801ad58493cdf431dfbcd77264
Regression Run 23850885992
ClickHouse Version 26.1.3.20001 (docker://altinityinfra/clickhouse-server:1517-26.1.3.20001.altinityantalya)
Run Flags --use-keeper --with-analyzer
Date 2026-04-01
Labels antalya, port-antalya, antalya-26.1, antalya-26.1.6.20001
Merge Status MERGED

PR Description

Bug fix for a race condition in IPartitionStrategy::computePartitionKey which could be called from different threads writing to cached_result concurrently without protection. The fix moves the cache write to the constructor, making it lock-free. Originally introduced by ClickHouse#92844.

Changed files (6):

  • src/Storages/IPartitionStrategy.cpp (+63, -21)
  • src/Storages/IPartitionStrategy.h (+7, -7)
  • tests/integration/test_export_merge_tree_part_to_object_storage/test.py (+28, -14)
  • tests/integration/test_export_replicated_mt_partition_to_object_storage/test.py (+38, -2)
  • tests/queries/0_stateless/03572_export_merge_tree_part_limits_and_table_functions.reference (+2)
  • tests/queries/0_stateless/03572_export_merge_tree_part_limits_and_table_functions.sh (+54, -7)

Summary

Category Count Details
PR-caused regression 0
Pre-existing flaky / known 5 settings, aggregate_functions_3, tiered_storage_minio, swarms, ssl_server_3
Infrastructure / GCS performance 1 s3_gcs_2 table function wildcard timeouts
Cancelled (3h timeout) 11 alter_attach_partition_1/2, alter_replace_partition, aggregate_functions_2, rbac_2/3, iceberg_1/2, tiered_storage_gcs, lightweight_delete, parquet
Still running 1 s3_aws_2

Verdict: No failures are caused by this PR. All CI failures are pre-existing or infrastructure-related.

PR-targeted regression suites both passed: s3_minio_export_partition ✅ and s3_minio_export_part ✅ — the two suites most directly relevant to the IPartitionStrategy race condition fix.


CI Jobs Status

Job Name Status Failure Type
settings ❌ failure Pre-existing — snapshot mismatch (input_format_parquet_use_metadata_cache)
aggregate_functions_3 ❌ failure Pre-existing — sumCountState serialization changed
s3_gcs_2 ❌ failure Infrastructure — GCS wildcard query timeouts
tiered_storage_minio ❌ failure Pre-existing — round robin disk selection after adding disk
swarms ❌ failure Pre-existing — known 26.1 issue (#1601)
ssl_server_3 ❌ failure Pre-existing — openssl signing interrupted (exit 130)
alter_attach_partition_1 ⛔ cancelled 3h CI timeout
alter_attach_partition_2 ⛔ cancelled 3h CI timeout
alter_replace_partition ⛔ cancelled 3h CI timeout
aggregate_functions_2 ⛔ cancelled 3h CI timeout
rbac_2 ⛔ cancelled 3h CI timeout
rbac_3 ⛔ cancelled 3h CI timeout
iceberg_1 ⛔ cancelled 3h CI timeout
iceberg_2 ⛔ cancelled 3h CI timeout
tiered_storage_gcs ⛔ cancelled 3h CI timeout
lightweight_delete ⛔ cancelled 3h CI timeout
parquet ⛔ cancelled 3h CI timeout
s3_minio_export_partition success PR-targeted suite
s3_minio_export_part success PR-targeted suite
All other jobs (51 total) ✅ success

Detailed Failure Analysis

1. settings — Pre-existing Snapshot Mismatch

  • Failing test: /settings/default values/input_format_parquet_use_metadata_cache
  • Stats: 1523 scenarios — 1518 ok, 1 failed, 4 xfail (1m 11s)
  • Cause: Snapshot expects {"default":"0"} but server returns {"default":"1"} for input_format_parquet_use_metadata_cache.
  • Relation to PR: None. Parquet input format setting default is unrelated to IPartitionStrategy.
  • Confirmed pre-existing: Same failure on PR #1529 run (23655067918) and scheduled ClickHouse latest run (23773974950). Snapshot needs updating for the new default.

2. aggregate_functions_3 — Pre-existing Serialization Change

  • Failing tests: sumCountState (25 Nullable/LowCardinality type variants) and sumCountMerge (LowCardinality Nullable Float32)
  • Stats: 360 scenarios — 310 ok, 2 failed, 45 skipped, 3 xfail (1h 37m)
  • Cause: sumCountState() serialized output gained a leading 01 byte (e.g. expected 1C04... but got 011C04...). Aggregate state format changed in the ClickHouse build.
  • Relation to PR: None. Aggregate function state serialization is completely unrelated to IPartitionStrategy/export partition.
  • Confirmed pre-existing: Same failure on PR #1529 run (23655067918), scheduled ClickHouse head (23774629703), and scheduled ClickHouse latest (23773974950).

3. s3_gcs_2 — Infrastructure / GCS Performance

  • Failing tests: /s3/gcs/part 2/table function performance/wildcard/nums, nums one invalid, range (120s ExpectTimeoutError), star no match (elapsed 2.017s > 2s threshold)
  • Stats: 70 scenarios — 37 ok, 1 errored, 31 xfail, 1 xerror (1h 35m)
  • Cause: GCS s3() table function queries with brace/glob wildcards hanging or exceeding tight performance thresholds.
  • Relation to PR: None. S3/GCS table function glob performance is unrelated to IPartitionStrategy.
  • Pre-existing signal: s3_gcs_1 failed in scheduled head (23774629703) and scheduled latest (23773974950) runs with similar GCS performance issues. The s3_gcs_2 job specifically did not fail in those runs, but the underlying GCS performance instability is a known pattern.

4. tiered_storage_minio — Pre-existing Disk Selection Issue

  • Failing tests: /tiered storage/with minio/change config norestart/check round robin after adding new disk/ — 4 scenario variants (one_small_disk_main_new_jbod4*)
  • Stats: 94 scenarios — 84 ok, 6 failed, 4 xfail
  • Cause: After hot-adding a disk without restart, system.parts disk_name set doesn't match expected disks. AssertionError: assert disks == used_disks.
  • Relation to PR: None. Tiered storage JBOD disk selection is unrelated to IPartitionStrategy/export partition.
  • Confirmed pre-existing: Same failure on scheduled ClickHouse head (23774629703) and scheduled ClickHouse latest (23773974950).

5. swarms — Pre-existing Known Issue

  • Failing tests:
    • /swarms/feature/node failure/network failure — exit code 0 instead of expected 138
    • /swarms/feature/node failure/initiator out of disk spaceUNKNOWN_DATABASE error (Code: 81)
  • Stats: 1520 scenarios — 1486 ok, 2 failed, 32 xfail (24m 11s)
  • Relation to PR: None. Swarms node failure resilience is completely unrelated to IPartitionStrategy.
  • Confirmed pre-existing: Tracked in Altinity/ClickHouse#1601. Same failure on PR #1529, PR #1576 (23775422047), and PR #1577 (23759752139).

6. ssl_server_3 — Pre-existing OpenSSL Timeout

  • Failing tests: /ssl server/part 3/ca chain/use second intermediate ca/ — 3 variants of all certificates present under server certificate with chain (with/without caconfig) and server certificate without chain (without caconfig)
  • Stats: 34 scenarios — 19 ok, 3 failed, 12 xfail (1h 42m)
  • Cause: openssl x509 -sha256 -req ... -CA sub2/ca.crt process interrupted with exit code 130 (SIGINT) instead of completing successfully. The certificate signing step hangs and gets killed.
  • Relation to PR: None. SSL certificate chain generation is unrelated to IPartitionStrategy.
  • Confirmed pre-existing: Same failure on scheduled ClickHouse head (23774629703, arm rows) and scheduled ClickHouse latest (23773974950, arm rows).

Cancelled Jobs (3h Timeout)

All 11 cancelled jobs exhibit the same pattern: started around 13:30 UTC, cancelled at exactly 16:35 UTC (~3h 5m), with the "Run suite" step cancelled and subsequent steps left pending. This is the CI runner's 3-hour job timeout.

Job Started Cancelled
alter_attach_partition_2 13:30:07 16:35:07
alter_replace_partition 13:31:28 16:36:28
alter_attach_partition_1 13:33:04 16:38:05
aggregate_functions_2 ~13:30 ~16:35
rbac_2, rbac_3 ~13:30 ~16:35
iceberg_1, iceberg_2 ~13:30 ~16:35
tiered_storage_gcs ~13:30 ~16:35
lightweight_delete ~13:30 ~16:35
parquet ~13:30 ~16:35

These suites commonly take >3h on single-threaded --parallel 1 keeper+analyzer runs. The timeouts are not related to PR changes.


PR-Targeted Regression Suites

The regression suites most directly relevant to the IPartitionStrategy race condition fix both passed:

Suite Status Details
s3_minio_export_partition ✅ success Export partition suite — directly tests the fixed code path
s3_minio_export_part ✅ success Export part suite — tests MergeTree part export to S3

@Selfeer Selfeer added the verified Approved for release label Apr 1, 2026
zvonand added a commit that referenced this pull request Apr 21, 2026
…trategy

Fix IPartitionStrategy race condition

Source-PR: #1517 (#1517)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

antalya antalya-26.1 antalya-26.1.6.20001 port-antalya PRs to be ported to all new Antalya releases verified Approved for release

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants