[Trainer,Dataset/Feature] [DRAFT] Streamable Dataset Cache implementation with lazy loading support#14599
Draft
KohakuBlueleaf wants to merge 7 commits into
Draft
[Trainer,Dataset/Feature] [DRAFT] Streamable Dataset Cache implementation with lazy loading support#14599KohakuBlueleaf wants to merge 7 commits into
KohakuBlueleaf wants to merge 7 commits into
Conversation
…/ComfyUI into streamabledataset
Contributor
Author
|
cc @alexisrolland @rattus128 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
TL;DR:
Intro
This draft pull request propose a new dataset cache format which is better than previous full pickle based dataset cache. While previous cache system have stuff like sharding so in usage it should not blow up user's system ram, but it is still not a best practice.
Therefore, this PR propose a new cache format which utilize a meta/header + tensor data seperation format. The data (conditions and latents) will be encoded as "pickled non tensor data + data structure skeleton" + "safetensors of all tensor data".
As safetensors natively support streaming or key based addressing. This design can achieve less sys ram usage with way better IO overhead and proper streaming of dataset reading. Further more I implemented a lazy loading system upon this streamable dataset impl, basically the latents and the conditions in the dataset can be loaded with only the meta/header part with corresponding safetensors keys info. And only really be loaded/realize when needed.
Implementation detail
conditions and latents are all python native container type with nested structure (may be nested) and the size of them are mainly from the size of the tensor object in them (latent tensor, text embedding tensor...)
Instead of pickle the whole list of them, I try to replace all the tensor in those containers become the key for loading from safetensors file. And put those tensor data into safetensors directly.
Sharding still exist bcuz the safetensors will be large.
Why draft now