[Trainer,Dataset/Feature] [DRAFT] Streamable Dataset Cache implementation with lazy loading support by KohakuBlueleaf · Pull Request #14599 · Comfy-Org/ComfyUI

KohakuBlueleaf · 2026-06-23T13:42:32Z

TL;DR:

This PR propose a streaming friendly dataset cache format
This PR propose a lazy loading mechanism of 1
This PR modify nodes_train implementation based on 1 and 2

Intro

This draft pull request propose a new dataset cache format which is better than previous full pickle based dataset cache. While previous cache system have stuff like sharding so in usage it should not blow up user's system ram, but it is still not a best practice.

Therefore, this PR propose a new cache format which utilize a meta/header + tensor data seperation format. The data (conditions and latents) will be encoded as "pickled non tensor data + data structure skeleton" + "safetensors of all tensor data".

As safetensors natively support streaming or key based addressing. This design can achieve less sys ram usage with way better IO overhead and proper streaming of dataset reading. Further more I implemented a lazy loading system upon this streamable dataset impl, basically the latents and the conditions in the dataset can be loaded with only the meta/header part with corresponding safetensors keys info. And only really be loaded/realize when needed.

Implementation detail

conditions and latents are all python native container type with nested structure (may be nested) and the size of them are mainly from the size of the tensor object in them (latent tensor, text embedding tensor...)

Instead of pickle the whole list of them, I try to replace all the tensor in those containers become the key for loading from safetensors file. And put those tensor data into safetensors directly.
Sharding still exist bcuz the safetensors will be large.

Why draft now

The exact scheme of streamable dataset may need some adjustment
The mechanism of lazy loading may not be best practice here
- or we just not need it

…/ComfyUI into streamabledataset

KohakuBlueleaf · 2026-06-25T00:43:43Z

cc @alexisrolland @rattus128
Need suggestions/comments

KohakuBlueleaf added 7 commits May 24, 2026 17:55

Add new dataset impl designed for streaming

5102760

Merge branch 'master' into streamabledataset

4af1600

Add new dataset impl designed for streaming

fc4e64e

Merge branch 'streamabledataset' of https://github.com/KohakuBlueleaf…

c21bd24

…/ComfyUI into streamabledataset

Lazy loading implementation of new dataset cache format

19c81ab

The utilization of new lazy/stream format of dataset

27bebb3

avoid python import module duplication

e593318

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Trainer,Dataset/Feature] [DRAFT] Streamable Dataset Cache implementation with lazy loading support#14599

[Trainer,Dataset/Feature] [DRAFT] Streamable Dataset Cache implementation with lazy loading support#14599
KohakuBlueleaf wants to merge 7 commits into
Comfy-Org:masterfrom
KohakuBlueleaf:streamabledataset

KohakuBlueleaf commented Jun 23, 2026

Uh oh!

KohakuBlueleaf commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

KohakuBlueleaf commented Jun 23, 2026

Intro

Implementation detail

Why draft now

Uh oh!

KohakuBlueleaf commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant