2.1 backport logrecovery#6395
Open
amcdonaldccri wants to merge 3 commits into
Open
Conversation
added 3 commits
May 26, 2026 06:19
…ata apache#4873 This commit makes two major changes. First it changed log recovery to use block caches. Second it checks if a tablet has any data in walogs before acquiring the recovery lock. These two changes together really speed up loading tablets that have no data in walogs. These changes introduce an extra opening of the walogs to see if the recovery lock needs to be acquired. Using the block caches for this extra opening should avoid any extra cost. The block caches also help in the case where many tablets with the same walogs are assigned to a tablet server. In some simple test saw an 8x speedup in tablet load times. Anytime a tablet has an unclean shutdown it will have the walogs of the dead tserver assigned to it even if had no data in those walogs. These change make loading tablets in that situation much faster. {"fundingSource": "41201", "team": "FED.ICGSA.OPS.MOE", "fshGit": "dummy-lo", "fshDocker": "sha256:20cf0045"}
In apache#4873 a check was added to inspect walogs during tablet load to see if they had any data for the tablet. This check happens prior to volume replacement that also runs during tablet load. Therefore if volume replacement is needed for the walogs then this check will fail because it can not find the files and the tablet will fail to load. To fix this problem modified the new check to switch volumes if needed prior to running the check. {"fundingSource": "41201", "team": "FED.ICGSA.OPS.MOE", "fshGit": "dummy-lo", "fshDocker": "sha256:20cf0045"}
… log recovery. (apache#4874) The log recovery code would list the sorted walog files multiple times during recovery. These changes modify the code to only list the files once. Also the listing is cached for a short period of time to improve the case of multiple tablet referencing the same walogs. This along with apache#4873 should result in much less traffic to the namenode when an entire accumulo cluster shutsdown and needs to recover. {"fundingSource": "41201", "team": "FED.ICGSA.OPS.MOE", "fshGit": "dummy-lo", "fshDocker": "sha256:20cf0045"}
909d002 to
8333a39
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Hi
Saw this issue #4887 and thought I could help.
And using AI's help I got it to pass the unit tests.
These are the Integration Tests that got stuck or timed out.
org.apache.accumulo.test.fate.zookeeper.FateIT never returned, but ran it again and it passed
org.apache.accumulo.test.tracing.ScanTracingIT timed out twice
org.apache.accumulo.test.functional.MetadataMaxFilesIT timed out twice
org.apache.accumulo.test.functional.TimeoutIT timed out twice
org.apache.accumulo.test.functional.TServerShutdownOptimizationsIT timed out twice
org.apache.accumulo.test.functional.KerberosIT timed out twice
org.apache.accumulo.test.shell.ShellServerIT never returned, but ran it again and it passed twice
[ERROR] Tests run: 11, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 166.9 s <<< FAILURE! -- in org.apache.accumulo.test.functional.KerberosIT
[ERROR] org.apache.accumulo.test.functional.KerberosIT.testGetDelegationTokenDenied -- Time elapsed: 14.48 s <<< ERROR!
java.lang.IllegalStateException: org.apache.hadoop.security.KerberosAuthException: failure to login: using ticket cache file: FILE:/tmp/krb5cc_911602271_DwR0dF javax.security.auth.login.LoginException: java.lang
.IllegalArgumentException: Illegal principal name ajmcdonald@CCRI.COM: org.apache.hadoop.security.authentication.util.KerberosName$NoMatchingRule: No rules applied to ajmcdonald@CCRI.COM
I would like to deploy it to one of our dev environments and do some ingest testing but haven't gotten to it yet.