[Issue]: using hipMemcpy leading to a SIGSEGV crash

### Problem Description

When executing a program compiled with `hipcc` it crashes after attempting to call hipMemcpy

```bash
$ hipcc test.cu
$ AMD_LOG_LEVEL=7  ./a.out
Max size: 4096
:3:rocdevice.cpp            :468 : 0865319405 us: [pid:3017  tid:0x7782cc1ff180] Initializing HSA stack.
:3:rocdevice.cpp            :554 : 0865326945 us: [pid:3017  tid:0x7782cc1ff180] Enumerated GPU agents = 1
:3:rocdevice.cpp            :232 : 0865326984 us: [pid:3017  tid:0x7782cc1ff180] Numa selects cpu agent[0]=0x55e59ce8df10(fine=0x55e59ce89ac0,coarse=0x55e59ce89fe0) for gpu agent=0x55e59ce8a880 CPU<->GPU XGMI=0
:3:rocsettings.cpp          :290 : 0865326992 us: [pid:3017  tid:0x7782cc1ff180] Using dev kernel arg wa = 0
:3:comgrctx.cpp             :33  : 0865326997 us: [pid:3017  tid:0x7782cc1ff180] Loading COMGR library.
:3:comgrctx.cpp             :126 : 0865327028 us: [pid:3017  tid:0x7782cc1ff180] Loaded COMGR library version 2.8.
:3:rocdevice.cpp            :1809: 0865327202 us: [pid:3017  tid:0x7782cc1ff180] Gfx Major/Minor/Stepping: 8/0/3
:3:rocdevice.cpp            :1811: 0865327209 us: [pid:3017  tid:0x7782cc1ff180] HMM support: 0, XNACK: 0, Direct host access: 0
:3:rocdevice.cpp            :1813: 0865327214 us: [pid:3017  tid:0x7782cc1ff180] Max SDMA Read Mask: 0x3, Max SDMA Write Mask: 0x3
:4:rocdevice.cpp            :2221: 0865327682 us: [pid:3017  tid:0x7782cc1ff180] Allocate hsa host memory 0x5003ff000, size 0x101000, numa_node = 0
:4:rocdevice.cpp            :2221: 0865328294 us: [pid:3017  tid:0x7782cc1ff180] Allocate hsa host memory 0x5006ff000, size 0x101000, numa_node = 0
:4:rocdevice.cpp            :2221: 0865329215 us: [pid:3017  tid:0x7782cc1ff180] Allocate hsa host memory 0x500c00000, size 0x400000, numa_node = 0
:4:rocdevice.cpp            :2221: 0865329377 us: [pid:3017  tid:0x7782cc1ff180] Allocate hsa host memory 0x204000, size 0x38, numa_node = 0
:4:runtime.cpp              :85  : 0865329459 us: [pid:3017  tid:0x7782cc1ff180] init
:3:hip_context.cpp          :49  : 0865329465 us: [pid:3017  tid:0x7782cc1ff180] Direct Dispatch: 1
:3:hip_event.cpp            :333 : 0865329614 us: [pid:3017  tid:0x7782cc1ff180]  hipEventCreate ( 0x7fff7f99f7e0 )
:3:hip_event.cpp            :339 : 0865329630 us: [pid:3017  tid:0x7782cc1ff180] hipEventCreate: Returned hipSuccess : event:0x55e59cebf370
:3:hip_event.cpp            :333 : 0865329636 us: [pid:3017  tid:0x7782cc1ff180]  hipEventCreate ( 0x7fff7f99f7d8 )
:3:hip_event.cpp            :339 : 0865329642 us: [pid:3017  tid:0x7782cc1ff180] hipEventCreate: Returned hipSuccess : event:0x55e59cebf640
:3:hip_memory.cpp           :615 : 0867968592 us: [pid:3017  tid:0x7782cc1ff180]  hipMalloc ( 0x7fff7f99f768, 67108864 )
:4:rocdevice.cpp            :2379: 0867969348 us: [pid:3017  tid:0x7782cc1ff180] Allocate hsa device memory 0x501400000, size 0x4000000
:3:rocdevice.cpp            :2418: 0867969356 us: [pid:3017  tid:0x7782cc1ff180] Device=0x55e59ceb7330, freeMem_ = 0xfc000000
:3:hip_memory.cpp           :617 : 0867969368 us: [pid:3017  tid:0x7782cc1ff180] hipMalloc: Returned hipSuccess : 0x501400000: duration: 776 us
:3:hip_memory.cpp           :615 : 0867969374 us: [pid:3017  tid:0x7782cc1ff180]  hipMalloc ( 0x7fff7f99f760, 67108864 )
:4:rocdevice.cpp            :2379: 0867969806 us: [pid:3017  tid:0x7782cc1ff180] Allocate hsa device memory 0x505800000, size 0x4000000
:3:rocdevice.cpp            :2418: 0867969814 us: [pid:3017  tid:0x7782cc1ff180] Device=0x55e59ceb7330, freeMem_ = 0xf8000000
:3:hip_memory.cpp           :617 : 0867969818 us: [pid:3017  tid:0x7782cc1ff180] hipMalloc: Returned hipSuccess : 0x505800000: duration: 444 us
:3:hip_memory.cpp           :615 : 0867969826 us: [pid:3017  tid:0x7782cc1ff180]  hipMalloc ( 0x7fff7f99f710, 67108864 )
:4:rocdevice.cpp            :2379: 0867970266 us: [pid:3017  tid:0x7782cc1ff180] Allocate hsa device memory 0x509c00000, size 0x4000000
:3:rocdevice.cpp            :2418: 0867970274 us: [pid:3017  tid:0x7782cc1ff180] Device=0x55e59ceb7330, freeMem_ = 0xf4000000
:3:hip_memory.cpp           :617 : 0867970279 us: [pid:3017  tid:0x7782cc1ff180] hipMalloc: Returned hipSuccess : 0x509c00000: duration: 453 us
:3:hip_memory.cpp           :615 : 0867970285 us: [pid:3017  tid:0x7782cc1ff180]  hipMalloc ( 0x7fff7f99f7d0, 67108864 )
:4:rocdevice.cpp            :2379: 0867970647 us: [pid:3017  tid:0x7782cc1ff180] Allocate hsa device memory 0x50e000000, size 0x4000000
:3:rocdevice.cpp            :2418: 0867970654 us: [pid:3017  tid:0x7782cc1ff180] Device=0x55e59ceb7330, freeMem_ = 0xf0000000
:3:hip_memory.cpp           :617 : 0867970658 us: [pid:3017  tid:0x7782cc1ff180] hipMalloc: Returned hipSuccess : 0x50e000000: duration: 373 us
:3:hip_memory.cpp           :690 : 0867970676 us: [pid:3017  tid:0x7782cc1ff180]  hipMemcpy ( 0x501400000, 0x7782affff010, 67108864, hipMemcpyHostToDevice )
:3:rocdevice.cpp            :3026: 0867970693 us: [pid:3017  tid:0x7782cc1ff180] Number of allocated hardware queues with low priority: 0, with normal priority: 0, with high priority: 0, maximum per priority is: 4
:4:command.cpp              :347 : 0868002516 us: [pid:3017  tid:0x7782cc1ff180] Command (CopyHostToDevice) enqueued: 0x55e59cec0ef0
fish: Job 1, 'AMD_LOG_LEVEL=7  ./a.out' terminated by signal SIGSEGV (Address boundary error)

$ sudo dmesg | tail
[  867.972499] a.out[3017]: segfault at 18 ip 00007782ccbca89d sp 00007fff7f99ef40 error 4 in libamdhip64.so.6.2.41134-65d174c3e[3ca89d,7782cc824000+430000] likely on CPU 10 (core 5, socket 0)
[  867.972514] Code: 00 31 f6 48 83 c5 08 e8 31 04 00 00 49 39 ec 75 ec 48 8b 83 48 01 00 00 48 8b a8 a0 01 00 00 48 8b 05 f7 64 1c 00 64 48 8b 10 <48> 8b 45 18 4c 8d 65 18 a8 01 0f 85 9b 03 00 00 48 89 c1 48 83 c9
[ 1561.179983] a.out[3682]: segfault at 18 ip 00007a57651b0f3d sp 00007ffce2aba720 error 4 in libamdhip64.so.6.1.40093-bd86f1708[3b0f3d,7a5764e23000+416000] likely on CPU 11 (core 6, socket 0)
[ 1561.179998] Code: 00 31 f6 48 83 c5 08 e8 61 04 00 00 49 39 ec 75 ec 48 8b 83 48 01 00 00 48 8b a8 a0 01 00 00 48 8b 05 67 de 1b 00 64 48 8b 10 <48> 8b 45 18 4c 8d 65 18 a8 01 0f 85 9b 03 00 00 48 89 c1 48 83 c9

```

The same program when run on `gfx90c` runs successfully.

I attempted to try to use the patch at [clr:pull#97](https://github.com/ROCm/clr/pull/97) but it still doesnt work.

I know `gfx803` is long deprecated but thats what I have at the moment, also I have successfully used the same device earlier(3-4 months) to run similar code which compiled and worked right away then.   


This is the program I was running, it fails at `line:78` at `hipMemcpy`  
```c
#include <hip/hip_runtime.h>
#include <stdio.h>
#include <stdlib.h>

#define CEIL_DIV(a, b) (((a) + (b) - 1) / (b))

#define hipCheck(ans) { hipAssert((ans), __FILE__, __LINE__); }
inline void hipAssert(hipError_t code, const char *file, int line, bool abort = true) {
    if (code != hipSuccess) {
        printf("HIP error: %s in file %s at line %d\n", hipGetErrorString(code), file, line);
        if (abort) exit(code);
    }
}

__global__ void sgemm_naive(int M, int N, int K, float alpha, const float *A,
                            const float *B, float beta, float *C) {
  const uint x = blockIdx.x * blockDim.x + threadIdx.x;
  const uint y = blockIdx.y * blockDim.y + threadIdx.y;

  // if statement is necessary to make things work under tile quantization
  if (x < M && y < N) {
    float tmp = 0.0;
    for (int i = 0; i < K; ++i) {
      tmp += A[x * K + i] * B[i * N + y];
    }
    // C = α*(A@B)+β*C
    C[x * N + y] = alpha * tmp + beta * C[x * N + y];
  }
}
void run_sgemm_naive(int M, int N, int K, float alpha, float *A, float *B,
                     float beta, float *C) {
  dim3 gridDim(CEIL_DIV(M, 32), CEIL_DIV(N, 32));
  dim3 blockDim(32, 32);
  sgemm_naive<<<gridDim, blockDim>>>(M, N, K, alpha, A, B, beta, C);
}

void randomize_matrix(float *mat, int N) {
	for (int i = 0; i < N; i++) {
		float tmp = (float)(rand() % 5) + 0.01 * (rand() % 5);
		tmp = (rand() % 2 == 0) ? tmp : tmp * (-1.);
		mat[i] = tmp;
	}
}

int main() {
	int N = 6;
        int SIZE[] = {128, 256, 512, 1024, 2048, 4096};

	long m, n, k, max_size;
	max_size = SIZE[N - 1];
	printf("Max size: %ld\n", max_size);

	float alpha = 0.5, beta = 3.0; // GEMM input parameters, C=α*AB+β*C

	float *A = nullptr, *B = nullptr, *C = nullptr, *C_ref = nullptr; // host matrices
	float *dA = nullptr, *dB = nullptr, *dC = nullptr, *dC_ref = nullptr; // device matrices

	float elapsed_time;
	hipEvent_t beg, end;
	hipCheck(hipEventCreate(&beg));
	hipCheck(hipEventCreate(&end));

	A = (float *)malloc(sizeof(float) * max_size * max_size);
	B = (float *)malloc(sizeof(float) * max_size * max_size);
	C = (float *)malloc(sizeof(float) * max_size * max_size);
	C_ref = (float *)malloc(sizeof(float) * max_size * max_size);


	randomize_matrix(A, max_size * max_size);
	randomize_matrix(B, max_size * max_size);
	randomize_matrix(C, max_size * max_size);

	hipCheck(hipMalloc(&dA, sizeof(float) * max_size * max_size));
	hipCheck(hipMalloc(&dB, sizeof(float) * max_size * max_size));
	hipCheck(hipMalloc(&dC, sizeof(float) * max_size * max_size));
	hipCheck(hipMalloc(&dC_ref, sizeof(float) * max_size * max_size));

	hipCheck(hipMemcpy(dA, A, sizeof(float) * max_size * max_size, hipMemcpyHostToDevice));
	hipCheck(hipMemcpy(dB, B, sizeof(float) * max_size * max_size, hipMemcpyHostToDevice));
	hipCheck(hipMemcpy(dC, C, sizeof(float) * max_size * max_size, hipMemcpyHostToDevice));
	hipCheck(hipMemcpy(dC_ref, C_ref, sizeof(float) * max_size * max_size, hipMemcpyHostToDevice));

	int repeat_times = 5;
	for (int s=0; s<N; s++) {
		int size = SIZE[s];
		m = n = k = size;
		
		printf("dimensions(m=n=k) %ld, alpha: %f, beta: %f\n", m, alpha, beta);
		
		hipCheck(hipEventRecord(beg));

		for (int j = 0; j < repeat_times; j++) {
			// We don't reset dC between runs to save time
			run_sgemm_naive(m, n, k, alpha, dA, dB, beta, dC);
		}

		hipCheck(hipEventRecord(end));
		hipCheck(hipEventSynchronize(beg));
		hipCheck(hipEventSynchronize(end));
		hipCheck(hipEventElapsedTime(&elapsed_time, beg, end));
		elapsed_time /= 1000.; // Convert to seconds

		long flops = 2 * m * n * k;
		printf(
        "Average elapsed time: (%7.6f) s, performance: (%7.1f) GFLOPS. size: "
        "(%ld).\n",
        elapsed_time / repeat_times,
        (repeat_times * flops * 1e-9) / elapsed_time, m);
	    fflush(stdout);

		hipCheck(hipMemcpy(dC, dC_ref, sizeof(float) * m * n, hipMemcpyDeviceToDevice));

	}

	free(A);
	free(B);
	free(C);
	free(C_ref);
	hipCheck(hipFree(dA));
	hipCheck(hipFree(dB));
	hipCheck(hipFree(dC));
	hipCheck(hipFree(dC_ref));

    return 0;
}
```

### Operating System

Arch Linux

### CPU

AMD Ryzen 5 1600 Six-Core Processor

### GPU

AMD Radeon RX 570 Series

### ROCm Version

ROCm 6.1.0

### ROCm Component

clr

### Steps to Reproduce

1. Compile a program with hipMemcpy using `hipcc`
```
	hipCheck(hipMemcpy(dA, A, sizeof(float) * max_size * max_size, hipMemcpyHostToDevice));
```
2. Execute the program

### (Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

ROCk module is loaded
=====================
HSA System Attributes
=====================
Runtime Version:         1.1
Runtime Ext Version:     1.6
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE
System Endianness:       LITTLE
Mwaitx:                  DISABLED
DMAbuf Support:          YES

==========
HSA Agents
==========
*******
Agent 1
*******
  Name:                    AMD Ryzen 5 1600 Six-Core Processor
  Uuid:                    CPU-XX
  Marketing Name:          AMD Ryzen 5 1600 Six-Core Processor
  Vendor Name:             CPU
  Feature:                 None specified
  Profile:                 FULL_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        0(0x0)
  Queue Min Size:          0(0x0)
  Queue Max Size:          0(0x0)
  Queue Type:              MULTI
  Node:                    0
  Device Type:             CPU
  Cache Info:
    L1:                      32768(0x8000) KB
  Chip ID:                 0(0x0)
  ASIC Revision:           0(0x0)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   3200
  BDFID:                   0
  Internal Node ID:        0
  Compute Unit:            12
  SIMDs per CU:            0
  Shader Engines:          0
  Shader Arrs. per Eng.:   0
  WatchPts on Addr. Ranges:1
  Memory Properties:
  Features:                None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: FINE GRAINED
      Size:                    16322532(0xf90fe4) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
    Pool 2
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    16322532(0xf90fe4) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
    Pool 3
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    16322532(0xf90fe4) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:4KB
      Alloc Alignment:         4KB
      Accessible by all:       TRUE
  ISA Info:
*******
Agent 2
*******
  Name:                    gfx803
  Uuid:                    GPU-XX
  Marketing Name:          AMD Radeon RX 570 Series
  Vendor Name:             AMD
  Feature:                 KERNEL_DISPATCH
  Profile:                 BASE_PROFILE
  Float Round Mode:        NEAR
  Max Queue Number:        128(0x80)
  Queue Min Size:          64(0x40)
  Queue Max Size:          131072(0x20000)
  Queue Type:              MULTI
  Node:                    1
  Device Type:             GPU
  Cache Info:
    L1:                      16(0x10) KB
  Chip ID:                 26591(0x67df)
  ASIC Revision:           1(0x1)
  Cacheline Size:          64(0x40)
  Max Clock Freq. (MHz):   1244
  BDFID:                   10496
  Internal Node ID:        1
  Compute Unit:            32
  SIMDs per CU:            4
  Shader Engines:          4
  Shader Arrs. per Eng.:   1
  WatchPts on Addr. Ranges:4
  Coherent Host Access:    FALSE
  Memory Properties:
  Features:                KERNEL_DISPATCH
  Fast F16 Operation:      TRUE
  Wavefront Size:          64(0x40)
  Workgroup Max Size:      1024(0x400)
  Workgroup Max Size per Dimension:
    x                        1024(0x400)
    y                        1024(0x400)
    z                        1024(0x400)
  Max Waves Per CU:        40(0x28)
  Max Work-item Per CU:    2560(0xa00)
  Grid Max Size:           4294967295(0xffffffff)
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)
    y                        4294967295(0xffffffff)
    z                        4294967295(0xffffffff)
  Max fbarriers/Workgrp:   32
  Packet Processor uCode:: 730
  SDMA engine uCode::      58
  IOMMU Support::          None
  Pool Info:
    Pool 1
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED
      Size:                    4194304(0x400000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 2
      Segment:                 GLOBAL; FLAGS: EXTENDED FINE GRAINED
      Size:                    4194304(0x400000) KB
      Allocatable:             TRUE
      Alloc Granule:           4KB
      Alloc Recommended Granule:2048KB
      Alloc Alignment:         4KB
      Accessible by all:       FALSE
    Pool 3
      Segment:                 GROUP
      Size:                    64(0x40) KB
      Allocatable:             FALSE
      Alloc Granule:           0KB
      Alloc Recommended Granule:0KB
      Alloc Alignment:         0KB
      Accessible by all:       FALSE
  ISA Info:
    ISA 1
      Name:                    amdgcn-amd-amdhsa--gfx803
      Machine Models:          HSA_MACHINE_MODEL_LARGE
      Profiles:                HSA_PROFILE_BASE
      Default Rounding Mode:   NEAR
      Default Rounding Mode:   NEAR
      Fast f16:                TRUE
      Workgroup Max Size:      1024(0x400)
      Workgroup Max Size per Dimension:
        x                        1024(0x400)
        y                        1024(0x400)
        z                        1024(0x400)
      Grid Max Size:           4294967295(0xffffffff)
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)
        y                        4294967295(0xffffffff)
        z                        4294967295(0xffffffff)
      FBarrier Max Size:       32
*** Done ***

### Additional Information

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Issue]: using hipMemcpy leading to a SIGSEGV crash #4189

Problem Description

Operating System

CPU

GPU

ROCm Version

ROCm Component

Steps to Reproduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

ROCk module is loaded

HSA System Attributes

==========
HSA Agents

Additional Information

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Issue]: using hipMemcpy leading to a SIGSEGV crash #4189

Description

Problem Description

Operating System

CPU

GPU

ROCm Version

ROCm Component

Steps to Reproduce

(Optional for Linux users) Output of /opt/rocm/bin/rocminfo --support

ROCk module is loaded

HSA System Attributes

========== HSA Agents

Additional Information

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

==========
HSA Agents