Skip to content

[Trainer,Dataset/Feature] [DRAFT] Streamable Dataset Cache implementation with lazy loading support#14599

Draft
KohakuBlueleaf wants to merge 7 commits into
Comfy-Org:masterfrom
KohakuBlueleaf:streamabledataset
Draft

[Trainer,Dataset/Feature] [DRAFT] Streamable Dataset Cache implementation with lazy loading support#14599
KohakuBlueleaf wants to merge 7 commits into
Comfy-Org:masterfrom
KohakuBlueleaf:streamabledataset

Conversation

@KohakuBlueleaf

Copy link
Copy Markdown
Contributor

TL;DR:

  1. This PR propose a streaming friendly dataset cache format
  2. This PR propose a lazy loading mechanism of 1
  3. This PR modify nodes_train implementation based on 1 and 2

Intro

This draft pull request propose a new dataset cache format which is better than previous full pickle based dataset cache. While previous cache system have stuff like sharding so in usage it should not blow up user's system ram, but it is still not a best practice.

Therefore, this PR propose a new cache format which utilize a meta/header + tensor data seperation format. The data (conditions and latents) will be encoded as "pickled non tensor data + data structure skeleton" + "safetensors of all tensor data".

As safetensors natively support streaming or key based addressing. This design can achieve less sys ram usage with way better IO overhead and proper streaming of dataset reading. Further more I implemented a lazy loading system upon this streamable dataset impl, basically the latents and the conditions in the dataset can be loaded with only the meta/header part with corresponding safetensors keys info. And only really be loaded/realize when needed.

Implementation detail

conditions and latents are all python native container type with nested structure (may be nested) and the size of them are mainly from the size of the tensor object in them (latent tensor, text embedding tensor...)

Instead of pickle the whole list of them, I try to replace all the tensor in those containers become the key for loading from safetensors file. And put those tensor data into safetensors directly.
Sharding still exist bcuz the safetensors will be large.

Why draft now

  1. The exact scheme of streamable dataset may need some adjustment
  2. The mechanism of lazy loading may not be best practice here
    • or we just not need it

@KohakuBlueleaf

Copy link
Copy Markdown
Contributor Author

cc @alexisrolland @rattus128
Need suggestions/comments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant