Discard rule results from runs superseded by Shake restarts by crtschin · Pull Request #4988 · haskell/haskell-language-server

crtschin · 2026-06-23T21:18:36Z

Closes #4985.

When any rule is (re)ran, do to being invalidated (in the above issue, due to documentDidChange notifications from typing), we track in our shake setup which shake session rule and rule results belong to. The step bookkeeping is an implementation detail of our shake setup to track whether keys are dirty.

What went wrong here is that a rule result being put in the database for a rule build demand, can interleave with the step incrementing done on shake restarts. This has the consequence that a shake rule may write it's result under a new step value, or leave a now outdated result.

For a single rule R, the following can occur:
    step 1   build A starts rule R, reading its inputs at step 1
    step 2   a restart bumps the step to 2 and marks R dirty
     ...     A finally finishes and, unguarded, writes Clean@2,
               a stale result, but its timestamp claims it is fresh

This PR adds a guard against that by checking that the step value is that same as when the rule started, implying no restart occurred. If one did, we should not overwrite it with the old result, as a new build may have raced and written a new one.

I added the (in the previous setup) flaky test, and reran it multiple times.

soulomoon · 2026-06-24T16:15:33Z

+A restart cancels the running build with an async exception, but the computation
+it spawned can outlive it and finish under a later build. 'compute' stamps the
+result with whatever the step is *now*, so a value read from stale inputs is
+recorded as freshly built and dependents skip recomputing it:


I thought asyncWithCleanUp would cleanup all the chilren ? So the spawned one won't live after we stop the old session ? Where does it go wrong 🤔 should we fix it instead ?

I thought asyncWithCleanUp would cleanup all the chilren ?

That's the intention, and it does.

But notice that incDatabase is called in a separate thread relative to all of the reads in compute, which means it can interleave and cause a rule's computation to assume the step of a latter rule build action. This is what causes the superceded diagnostic rule result to survive.

I don't think this is savable from the level of asyncWithCleanup. It's compute that writes the incorrect Clean with the new step value. A check has to occur in the same transaction that sets the status.

By design, incDatabase should only be called after we cancelShakeSession and it should stop all the session's running threads. I think it is bug if any compute interleave incDatabase.

Hmm that sounds reasonable, I'll try this again later today.

crtschin · 2026-06-25T10:10:53Z

Another data point for consideration my prefrontal cortex whipt up. If some rules survive past restarts attempts, due to (intended) laziness in shake database values. It's conceivable these thunks which capture parts of the state, never get evaluated or cleaned up. Which would also explain some memory usage/space leaks we've been observing.

soulomoon · 2026-06-25T17:55:24Z

Another data point for consideration my prefrontal cortex whipt up. If some rules survive past restarts attempts, due to (intended) laziness in shake database values. It's conceivable these thunks which capture parts of the state, never get evaluated or cleaned up. Which would also explain some memory usage/space leaks we've been observing.

I believe shake database values should always be strict. It must be a bug if laziness happens.

crtschin · 2026-06-25T18:43:15Z

I believe shake database values should always be strict. It must be a bug if laziness happens.

I misread/misinterpreted the splitIO call in builder. It's only lazy while the computation is running, which is triggered right afterwards. Good one, then I still have no clue yet where the orphaned/stale write comes from 😞.

crtschin · 2026-06-25T20:25:05Z

What happens when a shake restart occurs and cancels a running builder call in between the following statements. So after the lazy value is put in the database, but before it is forced?

haskell-language-server/hls-graph/src/Development/IDE/Graph/Internal/Database.hs

Lines 96 to 119 in 18633c7

    
           results <- liftIO $ atomically $ for keys $ \id -> do 
        
                   -- Spawn the id if needed 
        
                   status <- SMap.lookup id databaseValues 
        
                   val <- case viewDirty current $ maybe (Dirty Nothing) keyStatus status of 
        
                       Clean r -> pure r 
        
                       Running _ force val _ -> do 
        
                           modifyTVar' toForce (Wait force :) 
        
                           pure val 
        
                       Dirty s -> do 
        
                           let act = run (refresh db id s) 
        
                               (force, val) = splitIO (join act) 
        
                           SMap.focus (updateStatus $ Running current force val s) id databaseValues 
        
                           modifyTVar' toForce (Spawn force:) 
        
                           pure val 
        
                   pure (id, val) 
        
           toForceList <- liftIO $ readTVarIO toForce 
        
           let waitAll = run $ mapConcurrentlyAIO_ id toForceList 
        
           case toForceList of 
        
               [] -> return $ Left results 
        
               _ -> return $ Right $ do 
        
                       waitAll 
        
                       pure results

The lazy value, which will insert threads into the list registered for cleanup by asyncWithCleanup potentially after the scope of asyncWithCleanup already exited through the shake restart's cancel. The unsafePerformIO here does that registration. The comment above builder warns against making this synchronous, which is what making these values fully strict (without the async thread dance) implies.

soulomoon · 2026-06-26T07:32:43Z

Well spotted @crtschin, it does look like we have an scope escape problem here.

BTW, I was experimenting a new hls-graph implementation in #4927, we can have some finer control over those threads by centralize the management of them. Then have have some more goodies: do not need to kill a non-dirty running rule during restart etc. . Would you like to take a look ?

crtschin · 2026-06-26T17:01:09Z

Well spotted @crtschin, it does look like we have an scope escape problem here.

BTW, I was experimenting a new hls-graph implementation in #4927, we can have some finer control over those threads by centralize the management of them. Then have have some more goodies: do not need to kill a non-dirty running rule during restart etc. Would you like to take a look ?

I'll take a look 👍🏼

What do you think about the solution in this PR? I think it's still viable to merge this to solve #4985 in the short term. There's clearly other issues like unforced thunks containing IO-full async values and those being orphaned, which your rewrite tackles. But that doesn't look ready yet until then.

soulomoon · 2026-06-27T15:44:07Z

For running rule that is dirty, we would stop the rule thread and mark it as dirty.
For running rule that is not dirty, we would stop the rule thread but does not mark it as dirty.

What has the problem is the two consective restart cases, for the first restart, it is not dirty, so it leaked the running AIO in the running state.
right before the second restart we force it again, but since it is running under the out of scope AIO. It does not kills this force, we first mark it as dirty, and the hanging force(which should already be killed) mark it clean.

So, the second force is fine even if we have a different step ? What makes it problematic is dirty state being incorrectly overwrite ?
perhaps a slightly better solution is to check if we are going from running -> clean,
or better we pass in the correct AIO state for the forcing so we make it surely killed ? @crtschin

crtschin · 2026-06-27T16:33:19Z

So, the second force is fine even if we have a different step ? What makes it problematic is dirty state being incorrectly overwrite ?

I think the second force is fine, the guard prevents any outdated value from being stored as clean. At worst I think some redundant computations may occur, but that's preferable to invalid results being persisted.

The issue is indeed the newly dirty state being marked clean by rule results from previous steps.

perhaps a slightly better solution is to check if we are going from running -> clean

I don't think only checking that would be enough, we'd still need to check the step numbers to ensure that the correct step's results does the running -> clean transition. Otherwise we still get the old overwrite issue.

or better we pass in the correct AIO state for the forcing so we make it surely killed?

I was curious and sketched this a little while previously investigating. Instead of lazy unsafePerformIO thunks, I tried keeping the async values explicit, to be tracked in the AIO context they get spawned in. This got messy in my attempt, so I dismissed it for this PR.

I do think this might be a good follow-up to continue on with though. The fact that thunks containing threads can just keep existing untouched while invalidated by shake restarts does not sound ideal.

soulomoon · 2026-06-28T07:57:32Z

The propagation of leaked thread might actually create more problems even if the step is correct, e.g. if the step is being bumped and we are in the middle of marking dirty states. A propagation of leaked thread might capture a clean value that should be dirty in the new step. I do not think this can be fixed aftermath. It seems to me that the only possible fix is to prevent leaking thread even though it is not trival.

crtschin · 2026-06-28T09:31:00Z

A propagation of leaked thread might capture a clean value that should be dirty in the new step.

Apologies, I don't think I understand. That's exactly the situation that's fixed by this PR

soulomoon · 2026-06-30T08:24:57Z

A propagation of leaked thread might capture a clean value that should be dirty in the new step.

Apologies, I don't think I understand. That's exactly the situation that's fixed by this PR

For the leaked thread, a computation can start right after the step is being bumped and before we dirty the necessary rule values.

crtschin · 2026-07-01T19:17:18Z

For the leaked thread, a computation can start right after the step is being bumped and before we dirty the necessary rule values.

Oh, I see what you mean! Yea, bumping the step and dirtying should ideally be atomic.

soulomoon · 2026-07-03T10:41:17Z

One more thing, there are hooks that would modify the shake state, should only be run only when the dirty rules are fully stop. We should really fix the leaking directly.

crtschin · 2026-07-03T16:10:45Z

One more thing, there are hooks that would modify the shake state, should only be run only when the dirty rules are fully stop. We should really directly fix the leaking.

I'm convinced, I'll close this PR as the solution's incomplete. I'll give a stab at fixing the leaking threads specifically.

Add edit flaky stale diagnostic test

ee9614b

fendor requested a review from soulomoon June 24, 2026 07:14

soulomoon reviewed Jun 24, 2026

View reviewed changes

Discard rule results from runs superseded by a restart

42382e0

crtschin changed the title ~~WIP: Discard rule results from runs superseded by Shake restarts~~ Discard rule results from runs superseded by Shake restarts Jun 24, 2026

crtschin marked this pull request as ready for review June 24, 2026 19:00

crtschin requested a review from wz1000 as a code owner June 24, 2026 19:00

crtschin requested a review from soulomoon June 24, 2026 19:00

crtschin added the status: needs review This PR is ready for review label Jun 24, 2026

crtschin force-pushed the crtschin/stale-diagnostics branch from d09584c to 42382e0 Compare June 24, 2026 19:03

crtschin closed this Jul 3, 2026

Uh oh!

Uh oh!

Conversation

crtschin commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

soulomoon Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

crtschin Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

soulomoon Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

crtschin Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

crtschin commented Jun 25, 2026

Uh oh!

soulomoon commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

crtschin commented Jun 25, 2026

Uh oh!

crtschin commented Jun 25, 2026

Uh oh!

soulomoon commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

crtschin commented Jun 26, 2026

Uh oh!

soulomoon commented Jun 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

crtschin commented Jun 27, 2026

Uh oh!

soulomoon commented Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

crtschin commented Jun 28, 2026

Uh oh!

soulomoon commented Jun 30, 2026

Uh oh!

crtschin commented Jul 1, 2026

Uh oh!

soulomoon commented Jul 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

crtschin commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

crtschin commented Jun 23, 2026 •

edited

Loading

soulomoon Jun 24, 2026 •

edited

Loading

soulomoon Jun 25, 2026 •

edited

Loading

soulomoon commented Jun 25, 2026 •

edited

Loading

soulomoon commented Jun 26, 2026 •

edited

Loading

soulomoon commented Jun 27, 2026 •

edited

Loading

soulomoon commented Jun 28, 2026 •

edited

Loading

soulomoon commented Jul 3, 2026 •

edited

Loading