3️⃣ 📂 Pandas Reading & Writing Files (I/O Tools)

Estimated reading: 3 minutes 40 views

🧳 Pandas HDF5 Format Support – Store and Query Large Datasets Efficiently

🧲 Introduction – Why Use HDF5 with Pandas?

HDF5 (Hierarchical Data Format v5) is a binary file format designed to store and organize large numerical datasets. It’s ideal for high-performance read/write operations, supports compression, and can store multiple DataFrames in one file. Pandas provides built-in functions to read/write HDF5 files via the PyTables library.

🎯 In this guide, you’ll learn:

How to use to_hdf() and read_hdf() in Pandas
Work with multiple DataFrames in a single HDF5 file
Perform fast queries using data selectors
Optimize HDF5 usage with compression and format options

📥 1. Save a DataFrame to HDF5 Using `to_hdf()`

import pandas as pd

df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Score': [85, 90, 78]
})

df.to_hdf('data.h5', key='students', mode='w')

✅ key acts like a name for the DataFrame inside the HDF5 file.

📤 2. Load HDF5 Data Using `read_hdf()`

df2 = pd.read_hdf('data.h5', key='students')
print(df2)

👉 Output:

     Name  Score
0   Alice     85
1     Bob     90
2  Charlie     78

✅ Must use the same key that was used when saving.

🧠 3. File Modes

Mode	Description
`'w'`	Write (overwrite file)
`'a'`	Append (default)
`'r'`	Read-only
`'r+'`	Read/write without overwriting

📦 4. Save Multiple DataFrames in One HDF5 File

df1.to_hdf('store.h5', key='students')
df2.to_hdf('store.h5', key='grades', mode='a')  # append mode

✅ Use different keys to store multiple tables in one file.

🔍 5. Queryable (Table) Format for Efficient Access

df.to_hdf('data.h5', key='students', format='table', mode='w')

✅ Required for row-wise querying and filtering using selectors.

🎯 6. Query HDF5 Data with Conditions

df_filtered = pd.read_hdf('data.h5', key='students', where='Score > 80')
print(df_filtered)

✅ Fast filtering, even with millions of rows—only works with format='table'.

🗜️ 7. Enable Compression for Smaller File Sizes

df.to_hdf('data_compressed.h5', key='students', complevel=9, complib='zlib')

✅ Supported complib: 'zlib', 'bzip2', 'lzo', 'blosc'.

⚠️ 8. Install Required Library

pip install tables

✅ Pandas HDF5 support depends on PyTables backend.

📌 Summary – Recap & Next Steps

HDF5 is a robust choice for working with big tabular data. With Pandas, you get fast reads/writes, compression, multiple table storage, and SQL-like querying—ideal for scientific computing and machine learning pipelines.

🔍 Key Takeaways:

Use to_hdf() / read_hdf() for high-performance disk storage
Store multiple tables with unique keys
Enable format='table' for fast queries using where
Compress data with complevel and complib

⚙️ Real-world relevance: Used in finance, sensor data, genomics, simulations, and batch ML pipelines.

❓ FAQs – HDF5 with Pandas

❓ What is the default format in to_hdf()?
✅ 'fixed' – faster to write but not queryable. Use 'table' for filtering.

❓ Can I overwrite a key in an HDF5 file?
✅ Yes. Use:

df.to_hdf('file.h5', key='mydata', mode='a')

It will replace the key if it exists.

❓ What makes HDF5 better than CSV or Excel?
✅ It’s faster, binary, compressible, and supports multiple datasets and queries.

❓ How to list all keys in an HDF5 file?

with pd.HDFStore('data.h5') as store:
    print(store.keys())

❓ Can I use HDF5 for non-numeric data?
✅ Yes, but it’s best for numerical, structured tabular data.

« Previous Next »

Share Now :