Fix --limit subject filter dropped for chunked tables in sqlite import.py by Chessing234 · Pull Request #1993 · MIT-LCP/mimic-code

Chessing234 · 2026-04-17T12:30:17Z

Bug

mimic-iv/buildmimic/sqlite/import.py's --limit N option is silently ignored for any table large enough to trigger chunked reading (chartevents, labevents, emar, …). The resulting SQLite database contains every row of those tables, defeating the purpose of --limit.

https://github.com/MIT-LCP/mimic-code/blob/5706978/mimic-iv/buildmimic/sqlite/import.py#L160-L170

if os.path.getsize(f) < THRESHOLD_SIZE:
    df = pd.read_csv(f, dtype=mimic_dtypes)
    df = process_dataframe(df, subjects=subjects)
    df.to_sql(tablename, connection, index=False)
    row_counts[tablename] += len(df)
else:
    # If the file is too large, let's do the work in chunks
    for chunk in pd.read_csv(f, chunksize=CHUNKSIZE, low_memory=False, dtype=mimic_dtypes):
        chunk = process_dataframe(chunk)
        chunk.to_sql(tablename, connection, if_exists="append", index=False)

The small-file branch passes subjects=subjects to process_dataframe, but the chunked branch calls process_dataframe(chunk). process_dataframe's subjects parameter defaults to None, and its df.loc[df['subject_id'].isin(subjects)] filter is a no-op when subjects is None.

Root cause

Missing subjects=subjects argument in the chunked call, so chunked tables skip the subject_id filter entirely.

Fix

Pass subjects=subjects in the chunked branch, matching the non-chunked branch so the --limit N filter is applied consistently.

main() passes subjects=subjects to process_dataframe for small tables (read_csv in one shot), but the chunked branch for large tables calls process_dataframe(chunk) with no subjects argument. The default subjects=None disables the 'df.loc[df["subject_id"].isin(subjects)]' filter inside process_dataframe, so --limit N correctly trims small tables (admissions, patients, etc.) but silently imports every row of any table large enough to trigger the chunked path (chartevents, labevents, emar, etc.). Pass subjects=subjects to process_dataframe in the chunked branch so the subject filter is applied consistently.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix --limit subject filter dropped for chunked tables in sqlite import.py#1993

Fix --limit subject filter dropped for chunked tables in sqlite import.py#1993
Chessing234 wants to merge 1 commit intoMIT-LCP:mainfrom
Chessing234:fix/sqlite-import-chunk-subjects-filter

Chessing234 commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Chessing234 commented Apr 17, 2026

Bug

Root cause

Fix

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant