Links & Self-Guided Review

Polars User Guide – official docs with eager + lazy API examples
Apache Arrow Columnar Format – why columnar memory layouts matter
Parquet Fundamentals – format internals and predicate pushdown
Real Python: Working With Large CSVs – diagnosing MemoryError
VS Code: Python performance tips – environment setup + profiling
ASCII Clouds - not all visualizations have to mean something

Why Memory Limits Sneak Up On Us

Dataset vs laptop memory

Chart shows estimated in-memory size; raw on-disk sizes are in the table below.

Health datasets outgrow laptop RAM quickly: a handful of CSVs with vitals, labs, and encounters can exceed 16 GB once loaded. Attempting to “just read the file” leads to system thrash, swap usage, and eventually Python MemoryErrors that interrupt the workflow.

Laptop specs vs dataset footprints

Dataset	Typical raw size	In-memory pandas size	Fits on 16 GB laptop?
Intake forms (CSV)	250 MB	~1.2 GB (due to dtype inflation)	✅
Longitudinal vitals (CSV)	6 GB	~14 GB	⚠️ borderline
EHR encounter log (CSV)	18 GB	~42 GB	❌
Imaging metadata (Parquet)	9 GB	~9 GB	⚠️ if other apps closed
Claims archive (partitioned Parquet)	120 GB	streamed	✅ (with streaming)

Warning signs you are hitting RAM limits

top or Activity Monitor shows Python ballooning toward total RAM
Fans spin, everything slows, disk swap spikes
OS kills kernel/terminal; MemoryError or Killed: 9 messages
Notebook kernel restarts when running seemingly “simple” cells