Skip to content

Discard rule results from runs superseded by Shake restarts #4988

Closed
crtschin wants to merge 2 commits into
haskell:masterfrom
crtschin:crtschin/stale-diagnostics
Closed

Discard rule results from runs superseded by Shake restarts #4988
crtschin wants to merge 2 commits into
haskell:masterfrom
crtschin:crtschin/stale-diagnostics

Conversation

@crtschin

@crtschin crtschin commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

Closes #4985.

When any rule is (re)ran, do to being invalidated (in the above issue, due to documentDidChange notifications from typing), we track in our shake setup which shake session rule and rule results belong to. The step bookkeeping is an implementation detail of our shake setup to track whether keys are dirty.

What went wrong here is that a rule result being put in the database for a rule build demand, can interleave with the step incrementing done on shake restarts. This has the consequence that a shake rule may write it's result under a new step value, or leave a now outdated result.

For a single rule R, the following can occur:
    step 1   build A starts rule R, reading its inputs at step 1
    step 2   a restart bumps the step to 2 and marks R dirty
     ...     A finally finishes and, unguarded, writes Clean@2,
               a stale result, but its timestamp claims it is fresh

This PR adds a guard against that by checking that the step value is that same as when the rule started, implying no restart occurred. If one did, we should not overwrite it with the old result, as a new build may have raced and written a new one.

I added the (in the previous setup) flaky test, and reran it multiple times.

@fendor fendor requested a review from soulomoon June 24, 2026 07:14
A restart cancels the running build with an async exception, but the computation
it spawned can outlive it and finish under a later build. 'compute' stamps the
result with whatever the step is *now*, so a value read from stale inputs is
recorded as freshly built and dependents skip recomputing it:

@soulomoon soulomoon Jun 24, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought asyncWithCleanUp would cleanup all the chilren ? So the spawned one won't live after we stop the old session ? Where does it go wrong 🤔 should we fix it instead ?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought asyncWithCleanUp would cleanup all the chilren ?

That's the intention, and it does.

But notice that incDatabase is called in a separate thread relative to all of the reads in compute, which means it can interleave and cause a rule's computation to assume the step of a latter rule build action. This is what causes the superceded diagnostic rule result to survive.

I don't think this is savable from the level of asyncWithCleanup. It's compute that writes the incorrect Clean with the new step value. A check has to occur in the same transaction that sets the status.

@soulomoon soulomoon Jun 25, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By design, incDatabase should only be called after we cancelShakeSession and it should stop all the session's running threads. I think it is bug if any compute interleave incDatabase.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm that sounds reasonable, I'll try this again later today.

@crtschin crtschin changed the title WIP: Discard rule results from runs superseded by Shake restarts Discard rule results from runs superseded by Shake restarts Jun 24, 2026
@crtschin crtschin marked this pull request as ready for review June 24, 2026 19:00
@crtschin crtschin requested a review from wz1000 as a code owner June 24, 2026 19:00
@crtschin crtschin requested a review from soulomoon June 24, 2026 19:00
@crtschin crtschin added the status: needs review This PR is ready for review label Jun 24, 2026
@crtschin crtschin force-pushed the crtschin/stale-diagnostics branch from d09584c to 42382e0 Compare June 24, 2026 19:03
@crtschin

Copy link
Copy Markdown
Collaborator Author

Another data point for consideration my prefrontal cortex whipt up. If some rules survive past restarts attempts, due to (intended) laziness in shake database values. It's conceivable these thunks which capture parts of the state, never get evaluated or cleaned up. Which would also explain some memory usage/space leaks we've been observing.

@soulomoon

soulomoon commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator

Another data point for consideration my prefrontal cortex whipt up. If some rules survive past restarts attempts, due to (intended) laziness in shake database values. It's conceivable these thunks which capture parts of the state, never get evaluated or cleaned up. Which would also explain some memory usage/space leaks we've been observing.

I believe shake database values should always be strict. It must be a bug if laziness happens.

@crtschin

Copy link
Copy Markdown
Collaborator Author

I believe shake database values should always be strict. It must be a bug if laziness happens.

I misread/misinterpreted the splitIO call in builder. It's only lazy while the computation is running, which is triggered right afterwards. Good one, then I still have no clue yet where the orphaned/stale write comes from 😞.

@crtschin

Copy link
Copy Markdown
Collaborator Author

What happens when a shake restart occurs and cancels a running builder call in between the following statements. So after the lazy value is put in the database, but before it is forced?

results <- liftIO $ atomically $ for keys $ \id -> do
-- Spawn the id if needed
status <- SMap.lookup id databaseValues
val <- case viewDirty current $ maybe (Dirty Nothing) keyStatus status of
Clean r -> pure r
Running _ force val _ -> do
modifyTVar' toForce (Wait force :)
pure val
Dirty s -> do
let act = run (refresh db id s)
(force, val) = splitIO (join act)
SMap.focus (updateStatus $ Running current force val s) id databaseValues
modifyTVar' toForce (Spawn force:)
pure val
pure (id, val)
toForceList <- liftIO $ readTVarIO toForce
let waitAll = run $ mapConcurrentlyAIO_ id toForceList
case toForceList of
[] -> return $ Left results
_ -> return $ Right $ do
waitAll
pure results

The lazy value, which will insert threads into the list registered for cleanup by asyncWithCleanup potentially after the scope of asyncWithCleanup already exited through the shake restart's cancel. The unsafePerformIO here does that registration. The comment above builder warns against making this synchronous, which is what making these values fully strict (without the async thread dance) implies.

@soulomoon

soulomoon commented Jun 26, 2026

Copy link
Copy Markdown
Collaborator

Well spotted @crtschin, it does look like we have an scope escape problem here.

BTW, I was experimenting a new hls-graph implementation in #4927, we can have some finer control over those threads by centralize the management of them. Then have have some more goodies: do not need to kill a non-dirty running rule during restart etc. . Would you like to take a look ?

@crtschin

Copy link
Copy Markdown
Collaborator Author

Well spotted @crtschin, it does look like we have an scope escape problem here.

BTW, I was experimenting a new hls-graph implementation in #4927, we can have some finer control over those threads by centralize the management of them. Then have have some more goodies: do not need to kill a non-dirty running rule during restart etc. Would you like to take a look ?

I'll take a look 👍🏼

What do you think about the solution in this PR? I think it's still viable to merge this to solve #4985 in the short term. There's clearly other issues like unforced thunks containing IO-full async values and those being orphaned, which your rewrite tackles. But that doesn't look ready yet until then.

@soulomoon

soulomoon commented Jun 27, 2026

Copy link
Copy Markdown
Collaborator

For running rule that is dirty, we would stop the rule thread and mark it as dirty.
For running rule that is not dirty, we would stop the rule thread but does not mark it as dirty.

What has the problem is the two consective restart cases, for the first restart, it is not dirty, so it leaked the running AIO in the running state.
right before the second restart we force it again, but since it is running under the out of scope AIO. It does not kills this force, we first mark it as dirty, and the hanging force(which should already be killed) mark it clean.

So, the second force is fine even if we have a different step ? What makes it problematic is dirty state being incorrectly overwrite ?
perhaps a slightly better solution is to check if we are going from running -> clean,
or better we pass in the correct AIO state for the forcing so we make it surely killed ? @crtschin

@crtschin

Copy link
Copy Markdown
Collaborator Author

So, the second force is fine even if we have a different step ? What makes it problematic is dirty state being incorrectly overwrite ?

I think the second force is fine, the guard prevents any outdated value from being stored as clean. At worst I think some redundant computations may occur, but that's preferable to invalid results being persisted.

The issue is indeed the newly dirty state being marked clean by rule results from previous steps.

perhaps a slightly better solution is to check if we are going from running -> clean

I don't think only checking that would be enough, we'd still need to check the step numbers to ensure that the correct step's results does the running -> clean transition. Otherwise we still get the old overwrite issue.

or better we pass in the correct AIO state for the forcing so we make it surely killed?

I was curious and sketched this a little while previously investigating. Instead of lazy unsafePerformIO thunks, I tried keeping the async values explicit, to be tracked in the AIO context they get spawned in. This got messy in my attempt, so I dismissed it for this PR.

I do think this might be a good follow-up to continue on with though. The fact that thunks containing threads can just keep existing untouched while invalidated by shake restarts does not sound ideal.

@soulomoon

soulomoon commented Jun 28, 2026

Copy link
Copy Markdown
Collaborator

The propagation of leaked thread might actually create more problems even if the step is correct, e.g. if the step is being bumped and we are in the middle of marking dirty states. A propagation of leaked thread might capture a clean value that should be dirty in the new step. I do not think this can be fixed aftermath. It seems to me that the only possible fix is to prevent leaking thread even though it is not trival.

@crtschin

Copy link
Copy Markdown
Collaborator Author

A propagation of leaked thread might capture a clean value that should be dirty in the new step.

Apologies, I don't think I understand. That's exactly the situation that's fixed by this PR

@soulomoon

Copy link
Copy Markdown
Collaborator

A propagation of leaked thread might capture a clean value that should be dirty in the new step.

Apologies, I don't think I understand. That's exactly the situation that's fixed by this PR

For the leaked thread, a computation can start right after the step is being bumped and before we dirty the necessary rule values.

@crtschin

crtschin commented Jul 1, 2026

Copy link
Copy Markdown
Collaborator Author

For the leaked thread, a computation can start right after the step is being bumped and before we dirty the necessary rule values.

Oh, I see what you mean! Yea, bumping the step and dirtying should ideally be atomic.

@soulomoon

soulomoon commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator

One more thing, there are hooks that would modify the shake state, should only be run only when the dirty rules are fully stop. We should really fix the leaking directly.

@crtschin

crtschin commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator Author

One more thing, there are hooks that would modify the shake state, should only be run only when the dirty rules are fully stop. We should really directly fix the leaking.

I'm convinced, I'll close this PR as the solution's incomplete. I'll give a stab at fixing the leaking threads specifically.

@crtschin crtschin closed this Jul 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

status: needs review This PR is ready for review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Diagnostics don't always refresh on edits.

2 participants