Pandas HDF5 Format Support – Store and Query Large Datasets Efficiently
Introduction – Why Use HDF5 with Pandas?
HDF5 (Hierarchical Data Format v5) is a binary file format designed to store and organize large numerical datasets. It’s ideal for high-performance read/write operations, supports compression, and can store multiple DataFrames in one file. Pandas provides built-in functions to read/write HDF5 files via the PyTables library.
In this guide, you’ll learn:
- How to use
to_hdf()andread_hdf()in Pandas - Work with multiple DataFrames in a single HDF5 file
- Perform fast queries using data selectors
- Optimize HDF5 usage with compression and format options
1. Save a DataFrame to HDF5 Using to_hdf()
import pandas as pd
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie'],
'Score': [85, 90, 78]
})
df.to_hdf('data.h5', key='students', mode='w')
key acts like a name for the DataFrame inside the HDF5 file.
2. Load HDF5 Data Using read_hdf()
df2 = pd.read_hdf('data.h5', key='students')
print(df2)
Output:
Name Score
0 Alice 85
1 Bob 90
2 Charlie 78
Must use the same key that was used when saving.
3. File Modes
| Mode | Description |
|---|---|
'w' | Write (overwrite file) |
'a' | Append (default) |
'r' | Read-only |
'r+' | Read/write without overwriting |
4. Save Multiple DataFrames in One HDF5 File
df1.to_hdf('store.h5', key='students')
df2.to_hdf('store.h5', key='grades', mode='a') # append mode
Use different keys to store multiple tables in one file.
5. Queryable (Table) Format for Efficient Access
df.to_hdf('data.h5', key='students', format='table', mode='w')
Required for row-wise querying and filtering using selectors.
6. Query HDF5 Data with Conditions
df_filtered = pd.read_hdf('data.h5', key='students', where='Score > 80')
print(df_filtered)
Fast filtering, even with millions of rows—only works with format='table'.
🗜️ 7. Enable Compression for Smaller File Sizes
df.to_hdf('data_compressed.h5', key='students', complevel=9, complib='zlib')
Supported complib: 'zlib', 'bzip2', 'lzo', 'blosc'.
8. Install Required Library
pip install tables
Pandas HDF5 support depends on PyTables backend.
Summary – Recap & Next Steps
HDF5 is a robust choice for working with big tabular data. With Pandas, you get fast reads/writes, compression, multiple table storage, and SQL-like querying—ideal for scientific computing and machine learning pipelines.
Key Takeaways:
- Use
to_hdf()/read_hdf()for high-performance disk storage - Store multiple tables with unique keys
- Enable
format='table'for fast queries usingwhere - Compress data with
complevelandcomplib
Real-world relevance: Used in finance, sensor data, genomics, simulations, and batch ML pipelines.
FAQs – HDF5 with Pandas
What is the default format in to_hdf()?
'fixed' – faster to write but not queryable. Use 'table' for filtering.
Can I overwrite a key in an HDF5 file?
Yes. Use:
df.to_hdf('file.h5', key='mydata', mode='a')
It will replace the key if it exists.
What makes HDF5 better than CSV or Excel?
It’s faster, binary, compressible, and supports multiple datasets and queries.
How to list all keys in an HDF5 file?
with pd.HDFStore('data.h5') as store:
print(store.keys())
Can I use HDF5 for non-numeric data?
Yes, but it’s best for numerical, structured tabular data.
Share Now :
