3️⃣ 📂 Pandas Reading & Writing Files (I/O Tools)
Estimated reading: 3 minutes 25 views

🧳 Pandas HDF5 Format Support – Store and Query Large Datasets Efficiently


🧲 Introduction – Why Use HDF5 with Pandas?

HDF5 (Hierarchical Data Format v5) is a binary file format designed to store and organize large numerical datasets. It’s ideal for high-performance read/write operations, supports compression, and can store multiple DataFrames in one file. Pandas provides built-in functions to read/write HDF5 files via the PyTables library.

🎯 In this guide, you’ll learn:

  • How to use to_hdf() and read_hdf() in Pandas
  • Work with multiple DataFrames in a single HDF5 file
  • Perform fast queries using data selectors
  • Optimize HDF5 usage with compression and format options

📥 1. Save a DataFrame to HDF5 Using to_hdf()

import pandas as pd

df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Score': [85, 90, 78]
})

df.to_hdf('data.h5', key='students', mode='w')

key acts like a name for the DataFrame inside the HDF5 file.


📤 2. Load HDF5 Data Using read_hdf()

df2 = pd.read_hdf('data.h5', key='students')
print(df2)

👉 Output:

     Name  Score
0   Alice     85
1     Bob     90
2  Charlie     78

✅ Must use the same key that was used when saving.


🧠 3. File Modes

ModeDescription
'w'Write (overwrite file)
'a'Append (default)
'r'Read-only
'r+'Read/write without overwriting

📦 4. Save Multiple DataFrames in One HDF5 File

df1.to_hdf('store.h5', key='students')
df2.to_hdf('store.h5', key='grades', mode='a')  # append mode

✅ Use different keys to store multiple tables in one file.


🔍 5. Queryable (Table) Format for Efficient Access

df.to_hdf('data.h5', key='students', format='table', mode='w')

✅ Required for row-wise querying and filtering using selectors.


🎯 6. Query HDF5 Data with Conditions

df_filtered = pd.read_hdf('data.h5', key='students', where='Score > 80')
print(df_filtered)

✅ Fast filtering, even with millions of rows—only works with format='table'.


🗜️ 7. Enable Compression for Smaller File Sizes

df.to_hdf('data_compressed.h5', key='students', complevel=9, complib='zlib')

✅ Supported complib: 'zlib', 'bzip2', 'lzo', 'blosc'.


⚠️ 8. Install Required Library

pip install tables

✅ Pandas HDF5 support depends on PyTables backend.


📌 Summary – Recap & Next Steps

HDF5 is a robust choice for working with big tabular data. With Pandas, you get fast reads/writes, compression, multiple table storage, and SQL-like querying—ideal for scientific computing and machine learning pipelines.

🔍 Key Takeaways:

  • Use to_hdf() / read_hdf() for high-performance disk storage
  • Store multiple tables with unique keys
  • Enable format='table' for fast queries using where
  • Compress data with complevel and complib

⚙️ Real-world relevance: Used in finance, sensor data, genomics, simulations, and batch ML pipelines.


❓ FAQs – HDF5 with Pandas

❓ What is the default format in to_hdf()?
'fixed' – faster to write but not queryable. Use 'table' for filtering.

❓ Can I overwrite a key in an HDF5 file?
✅ Yes. Use:

df.to_hdf('file.h5', key='mydata', mode='a')

It will replace the key if it exists.

❓ What makes HDF5 better than CSV or Excel?
✅ It’s faster, binary, compressible, and supports multiple datasets and queries.

❓ How to list all keys in an HDF5 file?

with pd.HDFStore('data.h5') as store:
    print(store.keys())

❓ Can I use HDF5 for non-numeric data?
✅ Yes, but it’s best for numerical, structured tabular data.


Share Now :

Leave a Reply

Your email address will not be published. Required fields are marked *

Share

Pandas HDF5 Format Support

Or Copy Link

CONTENTS
Scroll to Top