Fix five pipeline-correctness bugs#19
Open
evasnow1992 wants to merge 5 commits into
Open
Conversation
make_predictions passed skip_invalid_smiles=False and then set valid_indices to the full range with no validity check, so any empty/unparseable SMILES reached MolGraph (which raises ValueError) and aborted the entire prediction run; the None-insertion in write_prediction was consequently dead code. Compute valid_indices with the same criterion as utils.filter_invalid_smiles (non-empty, RDKit-parseable, >=1 heavy atom). Invalid rows are dropped before featurization and the written predictions realign to the original input rows via the existing None-insertion. Signed-off-by: Eva Xue <evax@nvidia.com>
MTLLoss.log_sigma is a standalone nn.Parameter registered directly in the optimizer (param_groups[1]), not a submodule of `model`, so the per-batch `model.zero_grad()` never cleared its gradient. It accumulated across every batch, corrupting the learned task-uncertainty weights. Add `mtl_loss.zero_grad()` next to `model.zero_grad()`. Only affects finetuning with --use_mtl_loss (off by default). Verified with the real MTLLoss: log_sigma.grad went 1->2->3 across steps before, constant per-step after. Signed-off-by: Eva Xue <evax@nvidia.com>
In eval the model applies sigmoid to classification outputs, so batch_preds are already probabilities. The eval-loss path then fed them to BCEWithLogitsLoss, which sigmoids a second time, producing a wrong loss (off by up to ~7x on confident predictions). Branch on dataset_type and use plain BCE on the probabilities for classification; regression is unchanged. Affects only the reported/aggregated classification eval+val loss value, not predictions or metric-based (AUC) selection. Loss-based checkpoint selection and HPO for classification now see the correct loss. Verified: corrected loss matches BCEWithLogits on the raw logits to within 1e-7. Signed-off-by: Eva Xue <evax@nvidia.com>
In evaluate_predictions, a task with zero valid targets hit `continue`
without appending to `results`, so `results` came out shorter than
num_tasks and every later task's metric shifted down one index. Append
float('nan') before continue, matching the all-0s/all-1s branch just above.
Only affects multi-task datasets where at least one task has no labeled
rows in an eval split; single-task and fully-labeled multi-task runs are
unaffected. Per-task metrics for such runs were mislabeled/misaligned before.
Signed-off-by: Eva Xue <evax@nvidia.com>
The training loop assigned the wrapped loader back to `train_data`, the same name holding the underlying dataset. On the first ensemble model this was fine, but the second iteration then did DataLoader(DataLoader(...)), which crashes with "'DataLoader' object is not subscriptable". Use a separate `train_loader` local and keep `train_data` as the dataset. Fixes --ensemble_size > 1 for finetuning, which was previously broken (any prior multi-model ensemble finetune could not have completed). Single-model (default ensemble_size=1) runs are unaffected. Verified with a 2-iteration DataLoader repro: old crashes on iter 1, fixed version runs both. Signed-off-by: Eva Xue <evax@nvidia.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Five correctness fixes found during a focused review of the training, finetuning, and inference pipelines. Every fix was verified in-container (torch 2.9.1) before committing. No refactors, no behavior changes beyond the bug being corrected.
Fixes (one per commit)
fix(predict): filter invalid SMILES before featurization
task/predict.pybuiltvalid_indicesaslist(range(len(test_data))), i.e. it never filtered — the None-insertion/realignment code below it was dead. Any unparseable or empty SMILES (or a zero-heavy-atom molecule) in a prediction CSV would crash featurization instead of being skipped. Now validity is checked with RDKit (MolFromSmiles, ≥1 heavy atom) exactly as the training path does, so invalid rows are dropped and predictions align to the surviving molecules.fix(finetune): zero MTLLoss.log_sigma grad each step (--use_mtl_loss)
MTLLoss.log_sigmais a standalone nn.Parameter appended directly to the optimizer (param_groups[1]); it is not a submodule ofmodel, so the per-batchmodel.zero_grad()never cleared its gradient. The gradient accumulated across every batch, corrupting the learned task-uncertainty weights. Addedmtl_loss.zero_grad()alongsidemodel.zero_grad(). Verified with the real MTLLoss: log_sigma.grad went 1→2→3 across steps before the fix, constant per-step after.fix(eval): use BCE on probabilities for classification eval loss In eval the model applies sigmoid to classification outputs, so the eval predictions are already probabilities. The eval-loss path then passed them to BCEWithLogitsLoss, which sigmoids a second time — a double sigmoid that produced a wrong loss (off by up to ~7x on confident predictions). Now the classification branch uses plain BCE on the probabilities; regression is untouched. Verified the corrected loss matches BCEWithLogits on the raw logits to within 1e-7.
fix(eval): keep per-task alignment when a task has no valid targets In
evaluate_predictions, a task with zero valid targets hitcontinuewithout appending toresults, soresultscame out shorter than num_tasks and every later task's metric silently shifted down one index. Now appends float('nan') before continue, matching the all-0s/all-1s branch directly above it.fix(finetune): don't double-wrap DataLoader for ensemble_size > 1 The training loop assigned the wrapped loader back to
train_data, the same name holding the underlying dataset. The first ensemble model was fine, but the second iteration then did DataLoader(DataLoader(...)), which crashes with "'DataLoader' object is not subscriptable". Use a separatetrain_loaderlocal and keeptrain_dataas the dataset. Verified with a 2-iteration repro (old crashes on iter 1, fixed runs both).Behavior impact (what changes)