🧳 Pandas HDF5 Format Support – Store and Query Large Datasets Efficiently
🧲 Introduction – Why Use HDF5 with Pandas?
HDF5 (Hierarchical Data Format v5) is a binary file format designed to store and organize large numerical datasets. It’s ideal for high-performance read/write operations, supports compression, and can store multiple DataFrames in one file. Pandas provides built-in functions to read/write HDF5 files via the PyTables
library.
🎯 In this guide, you’ll learn:
- How to use
to_hdf()
andread_hdf()
in Pandas - Work with multiple DataFrames in a single HDF5 file
- Perform fast queries using data selectors
- Optimize HDF5 usage with compression and format options
📥 1. Save a DataFrame to HDF5 Using to_hdf()
import pandas as pd
df = pd.DataFrame({
'Name': ['Alice', 'Bob', 'Charlie'],
'Score': [85, 90, 78]
})
df.to_hdf('data.h5', key='students', mode='w')
✅ key
acts like a name for the DataFrame inside the HDF5 file.
📤 2. Load HDF5 Data Using read_hdf()
df2 = pd.read_hdf('data.h5', key='students')
print(df2)
👉 Output:
Name Score
0 Alice 85
1 Bob 90
2 Charlie 78
✅ Must use the same key that was used when saving.
🧠 3. File Modes
Mode | Description |
---|---|
'w' | Write (overwrite file) |
'a' | Append (default) |
'r' | Read-only |
'r+' | Read/write without overwriting |
📦 4. Save Multiple DataFrames in One HDF5 File
df1.to_hdf('store.h5', key='students')
df2.to_hdf('store.h5', key='grades', mode='a') # append mode
✅ Use different keys to store multiple tables in one file.
🔍 5. Queryable (Table) Format for Efficient Access
df.to_hdf('data.h5', key='students', format='table', mode='w')
✅ Required for row-wise querying and filtering using selectors.
🎯 6. Query HDF5 Data with Conditions
df_filtered = pd.read_hdf('data.h5', key='students', where='Score > 80')
print(df_filtered)
✅ Fast filtering, even with millions of rows—only works with format='table'
.
🗜️ 7. Enable Compression for Smaller File Sizes
df.to_hdf('data_compressed.h5', key='students', complevel=9, complib='zlib')
✅ Supported complib
: 'zlib'
, 'bzip2'
, 'lzo'
, 'blosc'
.
⚠️ 8. Install Required Library
pip install tables
✅ Pandas HDF5 support depends on PyTables backend.
📌 Summary – Recap & Next Steps
HDF5 is a robust choice for working with big tabular data. With Pandas, you get fast reads/writes, compression, multiple table storage, and SQL-like querying—ideal for scientific computing and machine learning pipelines.
🔍 Key Takeaways:
- Use
to_hdf()
/read_hdf()
for high-performance disk storage - Store multiple tables with unique keys
- Enable
format='table'
for fast queries usingwhere
- Compress data with
complevel
andcomplib
⚙️ Real-world relevance: Used in finance, sensor data, genomics, simulations, and batch ML pipelines.
❓ FAQs – HDF5 with Pandas
❓ What is the default format in to_hdf()
?
✅ 'fixed'
– faster to write but not queryable. Use 'table'
for filtering.
❓ Can I overwrite a key in an HDF5 file?
✅ Yes. Use:
df.to_hdf('file.h5', key='mydata', mode='a')
It will replace the key if it exists.
❓ What makes HDF5 better than CSV or Excel?
✅ It’s faster, binary, compressible, and supports multiple datasets and queries.
❓ How to list all keys in an HDF5 file?
with pd.HDFStore('data.h5') as store:
print(store.keys())
❓ Can I use HDF5 for non-numeric data?
✅ Yes, but it’s best for numerical, structured tabular data.
Share Now :