Fix five pipeline-correctness bugs by evasnow1992 · Pull Request #19 · NVIDIA-BioNeMo/KERMT

evasnow1992 · 2026-07-02T02:55:42Z

Summary

Five correctness fixes found during a focused review of the training, finetuning, and inference pipelines. Every fix was verified in-container (torch 2.9.1) before committing. No refactors, no behavior changes beyond the bug being corrected.

Fixes (one per commit)

fix(predict): filter invalid SMILES before featurization task/predict.py built valid_indices as list(range(len(test_data))), i.e. it never filtered — the None-insertion/realignment code below it was dead. Any unparseable or empty SMILES (or a zero-heavy-atom molecule) in a prediction CSV would crash featurization instead of being skipped. Now validity is checked with RDKit (MolFromSmiles, ≥1 heavy atom) exactly as the training path does, so invalid rows are dropped and predictions align to the surviving molecules.
fix(finetune): zero MTLLoss.log_sigma grad each step (--use_mtl_loss) MTLLoss.log_sigma is a standalone nn.Parameter appended directly to the optimizer (param_groups[1]); it is not a submodule of model, so the per-batch model.zero_grad() never cleared its gradient. The gradient accumulated across every batch, corrupting the learned task-uncertainty weights. Added mtl_loss.zero_grad() alongside model.zero_grad(). Verified with the real MTLLoss: log_sigma.grad went 1→2→3 across steps before the fix, constant per-step after.
fix(eval): use BCE on probabilities for classification eval loss In eval the model applies sigmoid to classification outputs, so the eval predictions are already probabilities. The eval-loss path then passed them to BCEWithLogitsLoss, which sigmoids a second time — a double sigmoid that produced a wrong loss (off by up to ~7x on confident predictions). Now the classification branch uses plain BCE on the probabilities; regression is untouched. Verified the corrected loss matches BCEWithLogits on the raw logits to within 1e-7.
fix(eval): keep per-task alignment when a task has no valid targets In evaluate_predictions, a task with zero valid targets hit continue without appending to results, so results came out shorter than num_tasks and every later task's metric silently shifted down one index. Now appends float('nan') before continue, matching the all-0s/all-1s branch directly above it.
fix(finetune): don't double-wrap DataLoader for ensemble_size > 1 The training loop assigned the wrapped loader back to train_data, the same name holding the underlying dataset. The first ensemble model was fine, but the second iteration then did DataLoader(DataLoader(...)), which crashes with "'DataLoader' object is not subscriptable". Use a separate train_loader local and keep train_data as the dataset. Verified with a 2-iteration repro (old crashes on iter 1, fixed runs both).

Behavior impact (what changes)

Predict (Docker container build error #1): CSVs with invalid SMILES now run to completion instead of crashing; output rows correspond to the valid molecules only.
Finetune with --use_mtl_loss (make cuik-molmaker optional #2): the multi-task uncertainty weights now train correctly. Any prior --use_mtl_loss run had corrupted task weighting. --use_mtl_loss is off by default.
Classification eval/val loss value (Molecules clustering code (PCA and K-means) for training datasets: #3): now correct. Metric-based (AUC) selection and predictions are unchanged; loss-based checkpoint selection and HPO for classification now see the correct loss.
Multi-task eval with partially-labeled splits (Fix TypeError when dynamic_depth="truncnorm" by using rvs() instead of rvs(1) #4): per-task metrics are now correctly aligned. Single-task and fully-labeled multi-task runs unaffected.
Ensemble finetune (Disallow inconsistent usage of finetune and predict workflows #5): --ensemble_size > 1 now works; it was previously broken and could not have completed.

make_predictions passed skip_invalid_smiles=False and then set valid_indices to the full range with no validity check, so any empty/unparseable SMILES reached MolGraph (which raises ValueError) and aborted the entire prediction run; the None-insertion in write_prediction was consequently dead code. Compute valid_indices with the same criterion as utils.filter_invalid_smiles (non-empty, RDKit-parseable, >=1 heavy atom). Invalid rows are dropped before featurization and the written predictions realign to the original input rows via the existing None-insertion. Signed-off-by: Eva Xue <evax@nvidia.com>

MTLLoss.log_sigma is a standalone nn.Parameter registered directly in the optimizer (param_groups[1]), not a submodule of `model`, so the per-batch `model.zero_grad()` never cleared its gradient. It accumulated across every batch, corrupting the learned task-uncertainty weights. Add `mtl_loss.zero_grad()` next to `model.zero_grad()`. Only affects finetuning with --use_mtl_loss (off by default). Verified with the real MTLLoss: log_sigma.grad went 1->2->3 across steps before, constant per-step after. Signed-off-by: Eva Xue <evax@nvidia.com>

In eval the model applies sigmoid to classification outputs, so batch_preds are already probabilities. The eval-loss path then fed them to BCEWithLogitsLoss, which sigmoids a second time, producing a wrong loss (off by up to ~7x on confident predictions). Branch on dataset_type and use plain BCE on the probabilities for classification; regression is unchanged. Affects only the reported/aggregated classification eval+val loss value, not predictions or metric-based (AUC) selection. Loss-based checkpoint selection and HPO for classification now see the correct loss. Verified: corrected loss matches BCEWithLogits on the raw logits to within 1e-7. Signed-off-by: Eva Xue <evax@nvidia.com>

In evaluate_predictions, a task with zero valid targets hit `continue` without appending to `results`, so `results` came out shorter than num_tasks and every later task's metric shifted down one index. Append float('nan') before continue, matching the all-0s/all-1s branch just above. Only affects multi-task datasets where at least one task has no labeled rows in an eval split; single-task and fully-labeled multi-task runs are unaffected. Per-task metrics for such runs were mislabeled/misaligned before. Signed-off-by: Eva Xue <evax@nvidia.com>

The training loop assigned the wrapped loader back to `train_data`, the same name holding the underlying dataset. On the first ensemble model this was fine, but the second iteration then did DataLoader(DataLoader(...)), which crashes with "'DataLoader' object is not subscriptable". Use a separate `train_loader` local and keep `train_data` as the dataset. Fixes --ensemble_size > 1 for finetuning, which was previously broken (any prior multi-model ensemble finetune could not have completed). Single-model (default ensemble_size=1) runs are unaffected. Verified with a 2-iteration DataLoader repro: old crashes on iter 1, fixed version runs both. Signed-off-by: Eva Xue <evax@nvidia.com>

evasnow1992 added 5 commits July 1, 2026 15:53

evasnow1992 requested a review from sveccham July 2, 2026 02:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix five pipeline-correctness bugs#19

Fix five pipeline-correctness bugs#19
evasnow1992 wants to merge 5 commits into
NVIDIA-BioNeMo:mainfrom
evasnow1992:evax/pipeline-correctness-fixes

evasnow1992 commented Jul 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

evasnow1992 commented Jul 2, 2026

Summary

Fixes (one per commit)

Behavior impact (what changes)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant