HBASE-30073 Test fixes for some flappers and a reproducible error by haridsv · Pull Request #8057 · apache/hbase

haridsv · 2026-04-10T08:52:36Z

Fixed TestZstdDictionarySplitMerge by adding a missing test dependency
Fixes for TestBlockBytesScannedQuota, TestHRegionWithInMemoryFlush, TestZooKeeper, TestStochasticLoadBalancerRegionReplicaSameHosts and TestRollbackSCP was provided by AI and verified by running 20 times each in a loop.

Here are the AI summaries:

TestStochasticLoadBalancerRegionReplicaSameHosts

Root cause

StochasticLoadBalancer.balanceTable() set cluster.setStopRequestedAt(startTime + maxRunningTime) at the very beginning of the method, before build ing costs (initCosts), the first computeCost, needsBalance, and other setup.

So the entire maxRunningTime budget (e.g. 250 ms in StochasticBalancerTestBase) was counting initialization + search, not just the stochastic walk.

On a slow or busy CI host, setup alone could take ~250 ms or more. Then when the walk started, isStopRequested() was already true (or became true af ter a single rejected move). The loop could exit with step still 0, no accepted moves, and balanceCluster returning null even though region repl icas were still colocated on the same host—so TestStochasticLoadBalancerRegionReplicaSameHosts failed. Locally, setup is usually only tens of ms, so the bug rarely shows up.

Fix

Set setStopRequestedAt only after initialization is done, immediately before the stochastic for loop, using a new timestamp (searchStartTime + maxRunningTime). That way maxRunningTime applies to the search phase only, which matches the intended meaning of the limit and avoids “no search time left” on loaded agents.

TestRollbackSCP

Root cause

Surefire reruns failed tests in the same JVM without running @BeforeClass again. The test used a static INJECTED flag with compareAndSet(false, t rue) so fault injection (and setKillAndToggleBeforeStoreUpdateInRollback) only ran once per JVM.

After the first failure (e.g. IllegalArgumentException: scheduler queue not empty at restartMasterProcedureExecutor), INJECTED stayed true. On reruns, compareAndSet did nothing, so the executor was never killed in rollback and procExec.isRunning() stayed true, which made waitFor(30000, () -> !procExec.isRunning()) always time out.

So the “three different failures” were really: first attempt hit the real error at line 180; attempts 2–3 were misleading timeouts at line 177 caused by broken rerun state, not the original bug.

Fix

In @Before (which does run before each attempt, including reruns), reset test-only state:

INJECTED.set(false) so injection can arm again on each run.
ProcedureTestingUtility.setKillAndToggleBeforeStoreUpdateInRollback(procExec, false) so procedure-executor kill flags from the previous attempt are cleared.

That makes each Surefire attempt behave like a fresh test with respect to
injection, so reruns can pass when the underlying issue is fixed or flaky.

TestZooKeeper

Problem Identified:

Line 78 had a duplicate call to TEST_UTIL.startMiniDFSCluster(2)
This violated the "can only create one" constraint and caused HDFS namespace conflicts when tests ran in parallel

Why it was flaky:

✅ Local/isolated runs: Worked because no leftover HDFS state
❌ CI/parallel runs: Failed when other ZooKeeper tests ran first, leaving behind HDFS data with incompatible namespaceID

Fix Applied:
Removed the duplicate call on line 78. The test now properly:

Starts Mini DFS Cluster (line 71)
Starts Mini ZK Cluster (line 72)
Configures settings (lines 73-77)
Done! (no duplicate cluster start)

This matches the pattern used in other similar tests like TestClusterBootOrder and aligns with the comment's intent: "Test we can first start the ZK cluster by itself."

The test should now be stable across both local and CI environments. You can verify by running the test suite again to ensure TestZooKeeper no longer fails with InconsistentFSStateException.

TestHRegionWithInMemoryFlush

Root Cause Analysis

The TestHRegionWithInMemoryFlush test is flapping due to zombie background threads that continue running after tests timeout, causing directory race conditions with subsequent tests.

What's Happening

testParallelIncrementWithMemStoreFlush and testParallelAppendWithMemStoreFlush create background flush threads
These threads run in a busy loop: while (!incrementDone.get())
When the test times out (after ~4 minutes in CI), an InterruptedException is thrown during thread.join()
The code never reaches the lines that set the done flag and stop the flush thread
The zombie flush thread keeps running, busy-looping trying to flush
Subsequent tests' tearDown() methods clean up directories
The zombie thread tries to access deleted directories → race condition errors

Evidence

Bad run shows only 5 tests completed (vs 136 in good run)
Thread-4940 logs "Skipping flush because closing" repeatedly for over a minute
6 flaky errors about missing directories in unrelated tests like testGet_Basic and testScanner_JoinedScanners
Error messages: "Parent directory doesn't exist", "chmod: cannot access"

The Fix

I've fixed both problematic tests in /Users/hdara/src/apache/hbase/hbase-server/src/test/java/org/apache/hadoop/hbase/regionserver/TestHRegion.java:

Changes made:

Added interrupt check to flush thread loop: while (!done.get() && !Thread.currentThread().isInterrupted())
Wrapped test logic in try-finally to ensure cleanup even on timeout/failure
In finally block:

Set done flag to true
Interrupt the flush thread
Wait up to 5 seconds for thread to stop
Log warning if thread doesn't stop

Files modified:

TestHRegion.java:5322-5368 - testParallelIncrementWithMemStoreFlush
TestHRegion.java:5407-5459 - testParallelAppendWithMemStoreFlush

Why This Only Happened in CI

CI environments are typically slower and more heavily loaded, causing tests to hit the 13-minute timeout threshold. Locally, the tests complete faster, so the threads finish normally before cleanup.

The fix ensures that background threads are always properly stopped, preventing zombie threads from interfering with subsequent tests.

TestBlockBytesScannedQuota

The test is flapping due to a timing/race condition in the quota system:

5-second timeout too short: The testTraffic method only waited 5 seconds for quotas to take effect
Quota cache not fully propagated: On slower systems (like CI), the quota cache refresh can be asynchronous and may not fully propagate in time
Quotas bypassed: When cache isn't refreshed, the logs show "bypass expected false, actual true", meaning all requests succeed instead of being throttled
Insufficient retries: Each iteration takes ~1.3 seconds, so only 3-4 retries fit in 5 seconds, not enough for the quota system to stabilize

Bad run pattern:

Test expects 1 successful request but gets 5 (all succeed because quotas not enforced)
Retries every ~1.3 seconds for 4 attempts
Times out after 5 seconds with "Waiting timed out after [5,000] msec"

Good run pattern:

Quotas enforced immediately
Tests pass quickly (36.97s total vs 63.14s for failed run)

Increased the timeout in testTraffic() from 5,000ms to 30,000ms (line 263). This gives the quota system sufficient time to:

Complete cache refresh
Propagate quota settings across all components
Handle slower CI environments

This is a conservative fix that maintains the retry logic while allowing adequate time for the distributed quota system to stabilize. The 30-second timeout is still reasonable for a test and should handle the asynchronous nature of quota enforcement.

TestBlockBytesScannedQuota

The test is flapping due to a timing/race condition in the quota system:

5-second timeout too short: The testTraffic method only waited 5 seconds for quotas to take effect
Quota cache not fully propagated: On slower systems (like CI), the quota cache refresh can be asynchronous and may not fully propagate in time
Quotas bypassed: When cache isn't refreshed, the logs show "bypass expected false, actual true", meaning all requests succeed instead of being throttled
Insufficient retries: Each iteration takes ~1.3 seconds, so only 3-4 retries fit in 5 seconds, not enough for the quota system to stabilize

Bad run pattern:

Test expects 1 successful request but gets 5 (all succeed because quotas not enforced)
Retries every ~1.3 seconds for 4 attempts
Times out after 5 seconds with "Waiting timed out after [5,000] msec"

Good run pattern:

Quotas enforced immediately
Tests pass quickly (36.97s total vs 63.14s for failed run)

Increased the timeout in testTraffic() from 5,000ms to 30,000ms (line 263). This gives the quota system sufficient time to:

Complete cache refresh
Propagate quota settings across all components
Handle slower CI environments

This is a conservative fix that maintains the retry logic while allowing adequate time for the distributed quota system to stabilize. The 30-second timeout is still reasonable for a test and should handle the asynchronous nature of quota enforcement.

…s and TestRollbackSCP, provided by Cursor AI Ran for 20 times in a loop to verify. Here are the AI summaries for the fixes: # TestStochasticLoadBalancerRegionReplicaSameHosts ## **Root cause** `StochasticLoadBalancer.balanceTable()` set `cluster.setStopRequestedAt(startTime + maxRunningTime)` **at the very beginning** of the method, **before** build ing costs (`initCosts`), the first `computeCost`, `needsBalance`, and other setup. So the **entire** `maxRunningTime` budget (e.g. **250 ms** in `StochasticBalancerTestBase`) was counting **initialization + search**, not just the stochastic walk. On a **slow or busy CI host**, setup alone could take **~250 ms or more**. Then when the walk started, `isStopRequested()` was already true (or became true af ter a single rejected move). The loop could exit with **`step` still 0**, **no accepted moves**, and `balanceCluster` returning **`null`** even though region repl icas were still colocated on the same host—so **`TestStochasticLoadBalancerRegionReplicaSameHosts`** failed. Locally, setup is usually only tens of ms, so the bug rarely shows up. ## **Fix** **Set `setStopRequestedAt` only after initialization is done**, immediately **before** the stochastic `for` loop, using a new timestamp (`searchStartTime + ma xRunningTime`). That way **`maxRunningTime` applies to the search phase only**, which matches the intended meaning of the limit and avoids “no search time left” on loaded agents. # TestRollbackSCP ## **Root cause** **Surefire reruns failed tests in the same JVM** without running `@BeforeClass` again. The test used a **static** `INJECTED` flag with `compareAndSet(false, t rue)` so fault injection (and `setKillAndToggleBeforeStoreUpdateInRollback`) only ran **once per JVM**. After the **first** failure (e.g. `IllegalArgumentException: scheduler queue not empty` at `restartMasterProcedureExecutor`), **`INJECTED` stayed `true`**. On **reruns**, `compareAndSet` did nothing, so the executor was never killed in rollback and **`procExec.isRunning()` stayed true**, which made `waitFor(30000, () -> !procExec.isRunning())` **always time out**. So the “three different failures” were really: **first attempt** hit the real error at line 180; **attempts 2–3** were **misleading timeouts** at line 177 cau sed by broken rerun state, not the original bug. ## *Fix* In **`@Before`** (which **does** run before each attempt, including reruns), reset test-only state: 1. **`INJECTED.set(false)`** so injection can arm again on each run. 2. **`ProcedureTestingUtility.setKillAndToggleBeforeStoreUpdateInRollback(procExec, false)`** so procedure-executor kill flags from the previous attempt are cleared. That makes each Surefire attempt behave like a fresh test with respect to injection, so reruns can pass when the underlying issue is fixed or flaky.

Ran for 20 times in a loop to verify. Here is the AI summary for the fix: **Problem Identified:** - Line 78 had a duplicate call to `TEST_UTIL.startMiniDFSCluster(2)` - This violated the "can only create one" constraint and caused HDFS namespace conflicts when tests ran in parallel **Why it was flaky:** - ✅ **Local/isolated runs**: Worked because no leftover HDFS state - ❌ **CI/parallel runs**: Failed when other ZooKeeper tests ran first, leaving behind HDFS data with incompatible `namespaceID` **Fix Applied:** Removed the duplicate call on line 78. The test now properly: 1. Starts Mini DFS Cluster (line 71) 2. Starts Mini ZK Cluster (line 72) 3. Configures settings (lines 73-77) 4. Done! (no duplicate cluster start) This matches the pattern used in other similar tests like `TestClusterBootOrder` and aligns with the comment's intent: "Test we can first start the ZK cluster by itself." The test should now be stable across both local and CI environments. You can verify by running the test suite again to ensure `TestZooKeeper` no longer fails with `InconsistentFSStateException`.

Ran for 20 times in a loop to verify. Here is the AI summary for the fix: Root Cause Analysis The TestHRegionWithInMemoryFlush test is flapping due to zombie background threads that continue running after tests timeout, causing directory race conditions with subsequent tests. What's Happening 1. testParallelIncrementWithMemStoreFlush and testParallelAppendWithMemStoreFlush create background flush threads 2. These threads run in a busy loop: while (!incrementDone.get()) 3. When the test times out (after ~4 minutes in CI), an InterruptedException is thrown during thread.join() 4. The code never reaches the lines that set the done flag and stop the flush thread 5. The zombie flush thread keeps running, busy-looping trying to flush 6. Subsequent tests' tearDown() methods clean up directories 7. The zombie thread tries to access deleted directories → race condition errors Evidence - Bad run shows only 5 tests completed (vs 136 in good run) - Thread-4940 logs "Skipping flush because closing" repeatedly for over a minute - 6 flaky errors about missing directories in unrelated tests like testGet_Basic and testScanner_JoinedScanners - Error messages: "Parent directory doesn't exist", "chmod: cannot access" The Fix I've fixed both problematic tests in /Users/hdara/src/apache/hbase/hbase-serve r/src/test/java/org/apache/hadoop/hbase/regionserver/TestHRegion.java: Changes made: 1. Added interrupt check to flush thread loop: while (!done.get() && !Thread.currentThread().isInterrupted()) 2. Wrapped test logic in try-finally to ensure cleanup even on timeout/failure 3. In finally block: - Set done flag to true - Interrupt the flush thread - Wait up to 5 seconds for thread to stop - Log warning if thread doesn't stop Files modified: - TestHRegion.java:5322-5368 - testParallelIncrementWithMemStoreFlush - TestHRegion.java:5407-5459 - testParallelAppendWithMemStoreFlush Why This Only Happened in CI CI environments are typically slower and more heavily loaded, causing tests to hit the 13-minute timeout threshold. Locally, the tests complete faster, so the threads finish normally before cleanup. Next Steps 1. Run the test suite in CI to verify the fix 2. Consider reducing thread counts or test iterations in the TestHRegionWithInMemoryFlush override methods (lines 67-78) if timeouts persist The fix ensures that background threads are always properly stopped, preventing zombie threads from interfering with subsequent tests.

Ran for 20 times in a loop to verify. Here is the AI summary for the fix: The test is flapping due to a **timing/race condition** in the quota system: 1. **5-second timeout too short**: The `testTraffic` method only waited 5 seconds for quotas to take effect 2. **Quota cache not fully propagated**: On slower systems (like CI), the quota cache refresh can be asynchronous and may not fully propagate in time 3. **Quotas bypassed**: When cache isn't refreshed, the logs show `"bypass expected false, actual true"`, meaning all requests succeed instead of being throttled 4. **Insufficient retries**: Each iteration takes ~1.3 seconds, so only 3-4 retries fit in 5 seconds, not enough for the quota system to stabilize **Bad run pattern:** - Test expects 1 successful request but gets 5 (all succeed because quotas not enforced) - Retries every ~1.3 seconds for 4 attempts - Times out after 5 seconds with "Waiting timed out after [5,000] msec" **Good run pattern:** - Quotas enforced immediately - Tests pass quickly (36.97s total vs 63.14s for failed run) Increased the timeout in `testTraffic()` from **5,000ms to 30,000ms** (line 263). This gives the quota system sufficient time to: - Complete cache refresh - Propagate quota settings across all components - Handle slower CI environments This is a conservative fix that maintains the retry logic while allowing adequate time for the distributed quota system to stabilize. The 30-second timeout is still reasonable for a test and should handle the asynchronous nature of quota enforcement.

Summry of the fix by AI: Fix directory race condition in parallel test execution When tests run in parallel, HBaseTestingUtil.createSubDirAndSystemProperty() would reuse system property values (like hadoop.log.dir and hadoop.tmp.dir) set by other parallel test instances. This caused a race condition: 1. Test A calls setupDataTestDir(), creates unique directory UUID-A 2. Test A sets System.setProperty("hadoop.log.dir", UUID-A/hadoop-log-dir) 3. Test B calls setupDataTestDir(), creates unique directory UUID-B 4. Test B calls createSubDirAndSystemProperty() for hadoop.log.dir 5. System property already set, so Test B reuses UUID-A/hadoop-log-dir 6. Test A finishes, calls cleanupTestDir(), deletes UUID-A directory 7. Test B fails with "chmod: cannot access '.../WALs': No such file" The original logic was added in HBASE-5064 (2012) to support parallel tests, with the comment "we do nothing but hope that there will be no conflicts" suggesting the authors knew this was imperfect but necessary for mini cluster compatibility. Fixed by checking if the directory referenced by the system property still exists before reusing it: - If exists: reuse it (preserves original behavior for mini clusters) - If deleted: create new unique directory for this test instance This prevents tests from using deleted directories while maintaining compatibility with components that may require shared system properties.

haridsv added 6 commits April 10, 2026 14:02

Fix error running TestZstdDictionarySplitMerge test

694ab0c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HBASE-30073 Test fixes for some flappers and a reproducible error#8057

HBASE-30073 Test fixes for some flappers and a reproducible error#8057
haridsv wants to merge 6 commits intoapache:masterfrom
haridsv:HBASE-30073

haridsv commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

haridsv commented Apr 10, 2026

TestStochasticLoadBalancerRegionReplicaSameHosts

Root cause

Fix

TestRollbackSCP

Root cause

Fix

TestZooKeeper

TestHRegionWithInMemoryFlush

TestBlockBytesScannedQuota

TestBlockBytesScannedQuota

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant