3️⃣ 📂 Pandas Reading & Writing Files (I/O Tools)
Estimated reading: 3 minutes 272 views

Pandas HDF5 Format Support – Store and Query Large Datasets Efficiently


Introduction – Why Use HDF5 with Pandas?

HDF5 (Hierarchical Data Format v5) is a binary file format designed to store and organize large numerical datasets. It’s ideal for high-performance read/write operations, supports compression, and can store multiple DataFrames in one file. Pandas provides built-in functions to read/write HDF5 files via the PyTables library.

In this guide, you’ll learn:

  • How to use to_hdf() and read_hdf() in Pandas
  • Work with multiple DataFrames in a single HDF5 file
  • Perform fast queries using data selectors
  • Optimize HDF5 usage with compression and format options

1. Save a DataFrame to HDF5 Using to_hdf()

import pandas as pd

df = pd.DataFrame({
    'Name': ['Alice', 'Bob', 'Charlie'],
    'Score': [85, 90, 78]
})

df.to_hdf('data.h5', key='students', mode='w')

key acts like a name for the DataFrame inside the HDF5 file.


2. Load HDF5 Data Using read_hdf()

df2 = pd.read_hdf('data.h5', key='students')
print(df2)

Output:

     Name  Score
0   Alice     85
1     Bob     90
2  Charlie     78

Must use the same key that was used when saving.


3. File Modes

ModeDescription
'w'Write (overwrite file)
'a'Append (default)
'r'Read-only
'r+'Read/write without overwriting

4. Save Multiple DataFrames in One HDF5 File

df1.to_hdf('store.h5', key='students')
df2.to_hdf('store.h5', key='grades', mode='a')  # append mode

Use different keys to store multiple tables in one file.


5. Queryable (Table) Format for Efficient Access

df.to_hdf('data.h5', key='students', format='table', mode='w')

Required for row-wise querying and filtering using selectors.


6. Query HDF5 Data with Conditions

df_filtered = pd.read_hdf('data.h5', key='students', where='Score > 80')
print(df_filtered)

Fast filtering, even with millions of rows—only works with format='table'.


🗜️ 7. Enable Compression for Smaller File Sizes

df.to_hdf('data_compressed.h5', key='students', complevel=9, complib='zlib')

Supported complib: 'zlib', 'bzip2', 'lzo', 'blosc'.


8. Install Required Library

pip install tables

Pandas HDF5 support depends on PyTables backend.


Summary – Recap & Next Steps

HDF5 is a robust choice for working with big tabular data. With Pandas, you get fast reads/writes, compression, multiple table storage, and SQL-like querying—ideal for scientific computing and machine learning pipelines.

Key Takeaways:

  • Use to_hdf() / read_hdf() for high-performance disk storage
  • Store multiple tables with unique keys
  • Enable format='table' for fast queries using where
  • Compress data with complevel and complib

Real-world relevance: Used in finance, sensor data, genomics, simulations, and batch ML pipelines.


FAQs – HDF5 with Pandas

What is the default format in to_hdf()?
'fixed' – faster to write but not queryable. Use 'table' for filtering.

Can I overwrite a key in an HDF5 file?
Yes. Use:

df.to_hdf('file.h5', key='mydata', mode='a')

It will replace the key if it exists.

What makes HDF5 better than CSV or Excel?
It’s faster, binary, compressible, and supports multiple datasets and queries.

How to list all keys in an HDF5 file?

with pd.HDFStore('data.h5') as store:
    print(store.keys())

Can I use HDF5 for non-numeric data?
Yes, but it’s best for numerical, structured tabular data.


Share Now :
Share

Pandas HDF5 Format Support

Or Copy Link

CONTENTS
Scroll to Top