Skip to content

Fix --limit subject filter dropped for chunked tables in sqlite import.py#1993

Open
Chessing234 wants to merge 1 commit intoMIT-LCP:mainfrom
Chessing234:fix/sqlite-import-chunk-subjects-filter
Open

Fix --limit subject filter dropped for chunked tables in sqlite import.py#1993
Chessing234 wants to merge 1 commit intoMIT-LCP:mainfrom
Chessing234:fix/sqlite-import-chunk-subjects-filter

Conversation

@Chessing234
Copy link
Copy Markdown

Bug

mimic-iv/buildmimic/sqlite/import.py's --limit N option is silently ignored for any table large enough to trigger chunked reading (chartevents, labevents, emar, …). The resulting SQLite database contains every row of those tables, defeating the purpose of --limit.

https://github.com/MIT-LCP/mimic-code/blob/5706978/mimic-iv/buildmimic/sqlite/import.py#L160-L170

if os.path.getsize(f) < THRESHOLD_SIZE:
    df = pd.read_csv(f, dtype=mimic_dtypes)
    df = process_dataframe(df, subjects=subjects)
    df.to_sql(tablename, connection, index=False)
    row_counts[tablename] += len(df)
else:
    # If the file is too large, let's do the work in chunks
    for chunk in pd.read_csv(f, chunksize=CHUNKSIZE, low_memory=False, dtype=mimic_dtypes):
        chunk = process_dataframe(chunk)
        chunk.to_sql(tablename, connection, if_exists="append", index=False)

The small-file branch passes subjects=subjects to process_dataframe, but the chunked branch calls process_dataframe(chunk). process_dataframe's subjects parameter defaults to None, and its df.loc[df['subject_id'].isin(subjects)] filter is a no-op when subjects is None.

Root cause

Missing subjects=subjects argument in the chunked call, so chunked tables skip the subject_id filter entirely.

Fix

Pass subjects=subjects in the chunked branch, matching the non-chunked branch so the --limit N filter is applied consistently.

main() passes subjects=subjects to process_dataframe for small tables
(read_csv in one shot), but the chunked branch for large tables calls
process_dataframe(chunk) with no subjects argument. The default
subjects=None disables the 'df.loc[df["subject_id"].isin(subjects)]'
filter inside process_dataframe, so --limit N correctly trims small
tables (admissions, patients, etc.) but silently imports every row of
any table large enough to trigger the chunked path (chartevents,
labevents, emar, etc.).

Pass subjects=subjects to process_dataframe in the chunked branch so the
subject filter is applied consistently.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant