Discard rule results from runs superseded by Shake restarts #4988
Discard rule results from runs superseded by Shake restarts #4988crtschin wants to merge 2 commits into
Conversation
| A restart cancels the running build with an async exception, but the computation | ||
| it spawned can outlive it and finish under a later build. 'compute' stamps the | ||
| result with whatever the step is *now*, so a value read from stale inputs is | ||
| recorded as freshly built and dependents skip recomputing it: |
There was a problem hiding this comment.
I thought asyncWithCleanUp would cleanup all the chilren ? So the spawned one won't live after we stop the old session ? Where does it go wrong 🤔 should we fix it instead ?
There was a problem hiding this comment.
I thought asyncWithCleanUp would cleanup all the chilren ?
That's the intention, and it does.
But notice that incDatabase is called in a separate thread relative to all of the reads in compute, which means it can interleave and cause a rule's computation to assume the step of a latter rule build action. This is what causes the superceded diagnostic rule result to survive.
I don't think this is savable from the level of asyncWithCleanup. It's compute that writes the incorrect Clean with the new step value. A check has to occur in the same transaction that sets the status.
There was a problem hiding this comment.
By design, incDatabase should only be called after we cancelShakeSession and it should stop all the session's running threads. I think it is bug if any compute interleave incDatabase.
There was a problem hiding this comment.
Hmm that sounds reasonable, I'll try this again later today.
d09584c to
42382e0
Compare
|
Another data point for consideration my prefrontal cortex whipt up. If some rules survive past restarts attempts, due to (intended) laziness in shake database values. It's conceivable these thunks which capture parts of the state, never get evaluated or cleaned up. Which would also explain some memory usage/space leaks we've been observing. |
I believe shake database values should always be strict. It must be a bug if laziness happens. |
I misread/misinterpreted the |
|
What happens when a shake restart occurs and cancels a running The lazy value, which will insert threads into the list registered for cleanup by |
|
Well spotted @crtschin, it does look like we have an scope escape problem here. BTW, I was experimenting a new hls-graph implementation in #4927, we can have some finer control over those threads by centralize the management of them. Then have have some more goodies: do not need to kill a non-dirty running rule during restart etc. . Would you like to take a look ? |
I'll take a look 👍🏼 What do you think about the solution in this PR? I think it's still viable to merge this to solve #4985 in the short term. There's clearly other issues like unforced thunks containing IO-full async values and those being orphaned, which your rewrite tackles. But that doesn't look ready yet until then. |
|
For running rule that is dirty, we would stop the rule thread and mark it as dirty. What has the problem is the two consective restart cases, for the first restart, it is not dirty, so it leaked the running AIO in the running state. So, the second force is fine even if we have a different step ? What makes it problematic is dirty state being incorrectly overwrite ? |
I think the second force is fine, the guard prevents any outdated value from being stored as clean. At worst I think some redundant computations may occur, but that's preferable to invalid results being persisted. The issue is indeed the newly dirty state being marked clean by rule results from previous steps.
I don't think only checking that would be enough, we'd still need to check the step numbers to ensure that the correct step's results does the
I was curious and sketched this a little while previously investigating. Instead of lazy I do think this might be a good follow-up to continue on with though. The fact that thunks containing threads can just keep existing untouched while invalidated by shake restarts does not sound ideal. |
|
The propagation of leaked thread might actually create more problems even if the step is correct, e.g. if the step is being bumped and we are in the middle of marking dirty states. A propagation of leaked thread might capture a |
Apologies, I don't think I understand. That's exactly the situation that's fixed by this PR |
For the leaked thread, a computation can start right after the step is being bumped and before we dirty the necessary rule values. |
Oh, I see what you mean! Yea, bumping the step and dirtying should ideally be atomic. |
|
One more thing, there are hooks that would modify the shake state, should only be run only when the dirty rules are fully stop. We should really fix the leaking directly. |
I'm convinced, I'll close this PR as the solution's incomplete. I'll give a stab at fixing the leaking threads specifically. |
Closes #4985.
When any rule is (re)ran, do to being invalidated (in the above issue, due to
documentDidChangenotifications from typing), we track in our shake setup which shake session rule and rule results belong to. Thestepbookkeeping is an implementation detail of our shake setup to track whether keys are dirty.What went wrong here is that a rule result being put in the database for a rule build demand, can interleave with the
stepincrementing done on shake restarts. This has the consequence that a shake rule may write it's result under a newstepvalue, or leave a now outdated result.This PR adds a guard against that by checking that the step value is that same as when the rule started, implying no restart occurred. If one did, we should not overwrite it with the old result, as a new build may have raced and written a new one.
I added the (in the previous setup) flaky test, and reran it multiple times.