Skip to content

aalhadsawane/gpuIO

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

14 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Automater for running Experiments

Just follow this readme and run the code snippets, everything is automated.

ALL data including raw CSV, graphs, decision tree PDFs, rules of the tree and node info for each experiment run are stored in the CSV folder of the repo.

0. directory structure

This code assumes a user called gpuio is present in the/home directory.

cd /home/gpuio/
git clone https://github.com/aalhadsawane/gpuIO.git
cd gpuIO

this folder (gpuIO) is recommended workspace folder

1. install hdf5 with vfd_gds support

./build_hdf5_with_vfd_gds.sh

Build Instructions:

  • This script builds HDF5 with MPI and Direct VFD (vfd_gds) support
  • Installation location: /home/gpuio/gpuIO/hdf5_install (within workspace)
  • Builds from source in /home/gpuio/hdf5_build/hdf5
  • Automatically sets up environment variables in ~/.bashrc

Key HDF5 build flags used:

cmake -DCMAKE_INSTALL_PREFIX=$HDF5_HOME \
      -DCMAKE_C_COMPILER=mpicc \           # Use MPI compiler
      -DHDF5_ENABLE_PARALLEL=ON \          # Enable MPI parallel I/O
      -DHDF5_ENABLE_THREADSAFE=ON \        # Enable thread safety
      -DHDF5_ENABLE_DIRECT_VFD=ON \        # Enable Direct VFD (vfd_gds)
      -DHDF5_ENABLE_Z_LIB_SUPPORT=ON \     # Enable compression
      -DHDF5_ENABLE_SZIP_SUPPORT=ON \      # Enable SZIP compression
      -DALLOW_UNSUPPORTED=ON \             # Allow experimental features
      ..

Installation location: /home/gpuio/gpuIO/hdf5_install (within workspace)

2. install h5bench - the benchmark suite (with CUDA support)

./build_h5bench.sh

Build Instructions:

  • This script builds a custom version of h5bench with CUDA memory allocation support
  • Uses the HDF5 installation from step 1 (/home/gpuio/gpuIO/hdf5_install)
  • Builds in benchmarks/h5bench/build_cuda/ directory
  • Links against CUDA runtime libraries for GPU memory allocation
  • All executables are built with MPI and CUDA support

This builds a custom version of h5bench with CUDA memory allocation support for GPU Direct Storage benchmarking.

What's different in our custom h5bench:

πŸ”§ Core Modifications Made

1. Memory Allocation Changes

  • Replaced all malloc()/calloc() with cudaMalloc() in dataset buffer allocation
  • Replaced all free() with cudaFree() for GPU memory deallocation
  • Added cudaMemcpy() for data initialization (CPUβ†’GPU transfer)
  • Files modified:
    • commons/h5bench_util.c: Core memory allocation functions
    • h5bench_patterns/h5bench_write.c: Write pattern data preparation
    • h5bench_patterns/h5bench_read.c: Read pattern memory allocation
    • h5bench_patterns/h5bench_append.c: Append pattern memory allocation

2. Build System Changes

  • Updated CMakeLists.txt to include CUDA support:
    • Added find_package(CUDA REQUIRED)
    • Linked CUDA libraries to all executables
    • Added CUDA include directories
    • Fixed HDF5 include/library paths for proper linking

3. Read Operations Enhancement

  • Read operations now work with GPU memory (not just writes)
  • HDF5 reads directly into GPU VRAM when using vfd_gds
  • Data remains on GPU for processing after read operations
  • Both read and write patterns use the same GPU memory allocation strategy

4. vfd_gds Integration

  • Added programmatic vfd_gds configuration in h5bench_util.c
  • Automatic detection of HDF5_DRIVER=gds environment variable
  • Graceful fallback to traditional HDF5 driver when vfd_gds unavailable
  • Runtime switching between GPU Direct Storage and traditional paths

5. Data Path Comparison

  • Traditional Path: SSD β†’ CPU DRAM β†’ GPU VRAM (via cudaMemcpy)
  • Direct Path: SSD β†’ GPU VRAM (via vfd_gds + GPU memory allocation)
  • Automatic benchmarking of both paths for performance comparison

πŸ“ Code Change Examples

Before (Stock h5bench):

// CPU memory allocation
void *buf = malloc(size);
// ... use buf for HDF5 operations ...
free(buf);

After (Custom h5bench):

// GPU memory allocation
void *buf;
cudaError_t err = cudaMalloc(&buf, size);
if (err != cudaSuccess) {
    fprintf(stderr, "cudaMalloc failed: %s\n", cudaGetErrorString(err));
    exit(1);
}
// ... use buf for HDF5 operations (now in GPU VRAM) ...
cudaFree(buf);

Data Initialization Pattern:

// Allocate and initialize on CPU first
float *data_cpu = malloc(particle_cnt * sizeof(float));
// ... initialize data_cpu with random values ...

// Allocate GPU memory
float *data_gpu;
cudaMalloc(&data_gpu, particle_cnt * sizeof(float));

// Copy from CPU to GPU
cudaMemcpy(data_gpu, data_cpu, particle_cnt * sizeof(float), cudaMemcpyHostToDevice);

// Free CPU memory
free(data_cpu);

// Use data_gpu for HDF5 operations

πŸ” Key Technical Details

  • Memory Alignment: cudaMalloc() provides 4KB alignment required for GDS I/O
  • Error Handling: All CUDA operations include proper error checking
  • Memory Management: CPU structs still use malloc(), GPU data arrays use cudaMalloc()
  • Compatibility: Falls back gracefully when vfd_gds is not available
  • Performance: Enables true zero-copy SSDβ†’GPU transfers when using vfd_gds

H5bench Data Pattern Numbering

The benchmark uses specific pattern numbers that determine which memory preparation function is called:

typedef enum write_pattern {
    WRITE_PATTERN_INVALID,     // 0
    CONTIG_CONTIG_1D,          // 1 - calls prepare_data_contig_1D()
    CONTIG_COMPOUND_1D,        // 2
    COMPOUND_CONTIG_1D,        // 3
    COMPOUND_COMPOUND_1D,      // 4
    CONTIG_CONTIG_STRIDED_1D,  // 5
    CONTIG_CONTIG_2D,          // 6
    CONTIG_COMPOUND_2D,        // 7
    COMPOUND_CONTIG_2D,        // 8
    COMPOUND_COMPOUND_2D,      // 9
    CONTIG_CONTIG_3D,          // 10 - calls prepare_data_contig_3D()
} write_pattern;

Important: The benchmark configuration MEM_PATTERN=CONTIG and FILE_PATTERN=CONTIG with NUM_DIMS=3 maps to pattern 10 (CONTIG_CONTIG_3D), not pattern 1 (CONTIG_CONTIG_1D). This is why we modified both prepare_data_contig_1D() and prepare_data_contig_3D() functions.

GPU Memory Verification

True verification of GPU usage is through nvidia-smi dmon (not snapshots):

# Real-time GPU monitoring (better than nvidia-smi snapshots)
nvidia-smi dmon -s pucvmet -c 100

# Or use our monitoring script
./monitor_gpu.sh -d 3600 -i 5

Expected output when GPU memory is allocated:

# gpu   pwr gtemp mtemp    sm   mem   enc   dec  mclk  pclk
# Idx     W     C     C     %     %     %     %   MHz   MHz
    0    22    42     -     0    42     0     0  6001   300

The mem column shows GPU memory utilization percentage, and pwr shows increased power consumption.

πŸ“ Detailed File Modifications

CMakeLists.txt

  • Added find_package(CUDA REQUIRED) for CUDA detection
  • Linked ${CUDA_LIBRARIES} to all executable targets
  • Added ${CUDA_INCLUDE_DIRS} for CUDA headers
  • Fixed HDF5 include/library paths for proper compilation

commons/h5bench_util.c

  • prepare_contig_memory(): Replaced malloc() with cudaMalloc() for all data arrays
  • prepare_contig_memory_multi_dim(): Same GPU memory allocation pattern
  • free_contig_memory(): Replaced free() with cudaFree() for GPU arrays
  • configure_vfd_gds_fapl(): New function for programmatic vfd_gds setup

h5bench_patterns/h5bench_write.c

  • prepare_data_interleaved(): CPU allocation β†’ initialization β†’ GPU copy β†’ CPU free
  • prepare_data_contig_1D(): Multiple GPU arrays with individual CPUβ†’GPU transfers
  • data_free(): Updated to use cudaFree() for GPU-allocated data

h5bench_patterns/h5bench_read.c

  • Added CUDA headers for GPU memory support
  • Read operations now work with GPU-allocated buffers

h5bench_patterns/h5bench_append.c

  • Added CUDA headers for consistency
  • Append operations support GPU memory allocation

commons/h5bench_util.h

  • Added declaration for configure_vfd_gds_fapl() function

βœ… Build Status

Custom h5bench with CUDA support has been successfully built!

Available Executables:

  • h5bench_write - Main write benchmark with CUDA memory allocation
  • h5bench_read - Read benchmark with GPU memory support
  • h5bench_append - Append benchmark with CUDA support
  • h5bench_overwrite - Overwrite benchmark with GPU memory
  • h5bench_write_unlimited - Unlimited write benchmark
  • h5bench_write_var_normal_dist - Variable normal distribution write

Build Requirements Met:

  • βœ… MPI-enabled HDF5 installed (libhdf5-openmpi-dev)
  • βœ… CUDA runtime and libraries linked
  • βœ… All memory allocation functions converted to GPU memory
  • βœ… MPI headers added to all pattern files
  • βœ… Build configuration updated for OpenMPI HDF5

βœ… Final Build Status:

The custom h5bench with CUDA support has been successfully built! All executables are working correctly with proper MPI support.

Key achievements:

  • βœ… Full MPI I/O support: Using custom-built HDF5 with MPI enabled
  • βœ… CUDA memory allocation: All dataset buffers use GPU memory (cudaMalloc)
  • βœ… GPU Direct Storage ready: vfd_gds plugin integration prepared
  • βœ… All executables functional: Tested and working correctly

Remaining warnings (non-critical):

  • Format string mismatches (cosmetic, doesn't affect functionality)
  • Volatile qualifier warnings (non-critical, doesn't affect functionality)

The build is production-ready for GPU Direct Storage benchmarking!

πŸš€ What This Enables

Performance Benefits

  • Zero-copy transfers: Direct SSDβ†’GPU VRAM when using vfd_gds
  • Reduced memory bandwidth: Eliminates CPU DRAM bottleneck
  • Lower latency: Fewer memory copies in the data path
  • Higher throughput: GPU memory bandwidth utilization

Benchmarking Capabilities

  • Direct comparison: Traditional vs GPU Direct Storage paths
  • Real-world scenarios: Actual HDF5 workload performance testing
  • Scalability analysis: Multi-threaded GPU I/O performance
  • Bottleneck identification: CPU vs GPU memory transfer analysis

Research Applications

  • GPU Direct Storage evaluation: Performance characteristics of vfd_gds
  • HDF5 optimization: Memory allocation strategy impact
  • I/O pattern analysis: Different access patterns with GPU memory
  • System tuning: Optimal configuration for GPU-accelerated HDF5

NOTE: What has been done till now (Step 0 to 2) needs to be done on a node only once as setup.

The steps ahead will be reused every experiment.

3. Set experiment parameters

Edit benchmark_config.conf in source dir of repo.

Set number of IO threads, dataset sizes, block sizes and modes (GPU/CPU) there.

Available modes:

  • GPU: Uses GPU Direct Storage (vfd_gds) for direct SSD β†’ GPU VRAM transfers
  • CPU: Uses traditional path (SSD β†’ CPU DRAM β†’ GPU VRAM via cudaMemcpy)

NOTE: These IO threads, are the hardware threads not virtual and for colvas they max out to 20. Do not set number of io threads more than 20.

4. Run and get h5bench data in csv

For WRITE benchmarks:

nohup ./run_h5bench_write.sh > h5bench_write_output.log 2>&1 &

For READ benchmarks:

nohup ./run_h5bench_read.sh > h5bench_read_output.log 2>&1 &

With GPU Monitoring (recommended):

# Run with GPU VRAM monitoring to confirm CUDA memory allocation
./run_benchmark_with_monitoring.sh run_h5bench_write.sh
./run_benchmark_with_monitoring.sh run_h5bench_read.sh

Manual GPU monitoring:

# Show current GPU status
./monitor_gpu.sh -s

# Monitor GPU during benchmarks (optimized for long runs)
./monitor_gpu.sh                    # Monitor for 2 hours (10s intervals)
./monitor_gpu.sh -d 3600           # Monitor for 1 hour
./monitor_gpu.sh -i 5              # Sample every 5 seconds

# Analyze existing monitoring data
./monitor_gpu.sh -a                # Analyze gpu_demon.log
./monitor_gpu.sh -a custom.log     # Analyze custom log file

What the monitoring tracks:

  • GPU Memory: VRAM usage (confirms CUDA memory allocation)
  • GPU Utilization: SM%, Memory bandwidth usage
  • GPU Clocks: Graphics, Memory, SM clock speeds
  • Temperature & Power: Thermal and power consumption
  • CPU Usage: Overall CPU utilization and load average
  • Processes: Running processes on GPU (PIDs, memory usage)
  • Timestamps: Precise timing for correlation with benchmarks

The scripts automatically:

  • Run both GPU and CPU modes as configured in benchmark_config.conf
  • Switch between vfd_gds (GPU Direct Storage) and traditional paths
  • Compare performance between direct SSDβ†’GPU vs SSDβ†’CPUβ†’GPU transfers

The output (raw_output.csv) is stored in a new folder /home/gpuio/gpuIO/CSV/Run*/

where Run* indicates latest Run folder (made automatically)

A txt file called nodename.txt inside the Run folder is also made which stores the hostname (ex colva2).

u can check the h5bench_*_output.log file to check where the experiment has reached. (Look for Benchmark progress).

To kill the experiment midway: ps aux | grep run_h5bench and kill -9 the PID. You can also send a SIGINT (ctrl+c) signal with kill -2 to skip a single benchmark that might be stuck.

5. Plot the csv

Create a venv with

virtualenv .venv
source .venv/bin/activate
pip install -r requirements.txt

My plotting script is given as plot.py in source dir of repo. Please make a copy of it if u wanna edit it; dont change this one.

Usage:

sudo apt-get install graphviz
python plot.py

It will take the input csv as CSV/Run*/raw_output.csv which is the latest output CSV made by the run_h5bench.sh script.

It will store all PNG images in CSV/Run*/graphs where Run* indicates latest Run.

It will also make a decision tree as a PDF and a txt file defining rules of the decision tree in CSV/RUN*,

###$ Custom csv path and output path:

plot.py takes 2 arguments:

--data-path : default it will use the raw_output.csv in latest run folder CSV/Run*

--output-dir : default it will make a graphs dir in the latest run folder CSV/Run*

example of using custom output directory for graphs:

python plot.py --output-dir /home/gpuio/gpuIO/custom_graphs

6. Push to github

If u make any large file, please make sure to add them to .gitignore to avoid committing them to the repository.

Push after every succesful run.

7. Handling Anomalies in Benchmark Results

Sometimes individual benchmark configurations may fail or produce anomalous results (blank entries or 0 values) in the raw_output.csv file. writing a script to handle that, should be ready soon.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors