Skip to content

Conversation

@chriselrod
Copy link
Contributor

VectorizationBae.cache_size returns the size per core, so for shared caches, the number is much smaller.

My update therefore assumes that it is shared.

What is the desired behavior?
With this commit, it'll give a size much larger than any actual size on many Ryzen systems, but I think that may be desirable; we don't want the arrays, divided among cores, to fit in the l3s local to the CCXs.

`VectorizationBae.cache_size` returns the size per core, so for shared caches, the number is much smaller.

My update therefore assumes that it is shared.

What is the desired behavior?
With this commit, it'll give a size much larger than any actual size on many Ryzen systems, but I think that may be desirable; we don't want the arrays, divided among cores, to fit in the l3s local to the CCXs.
@carstenbauer
Copy link
Member

carstenbauer commented May 22, 2024

VectorizationBae.cache_size returns the size per core, so for shared caches, the number is much smaller.

Thanks for pointing this out. We should change the default then.

My update therefore assumes that it is shared.

IIUC, it assumes that the (last-level) cache is shared among all cores. While this is typically true for consumer machines, on clusters it surely isn't. There, multiplying by the number of cores per CPU/socket is probably more appropriate. On the other hand, given that there are almost always only two sockets, being a factor of two off is perhaps fine.

Of course, we could probably use Hwloc to try to just get more detailed information about the system.

What is the desired behavior?
With this commit, it'll give a size much larger than any actual size on many Ryzen systems, but I think that may be desirable; we don't want the arrays, divided among cores, to fit in the l3s local to the CCXs.

Yes, we don't. But, on the other end of the spectrum, we also want to make sure that the arrays fit into memory. In this regard, note that there also is a (somewhat artificial) factor of 4 in https://github.com/JuliaPerf/STREAMBenchmark.jl/blob/master/src/benchmarks.jl#L4

@carstenbauer carstenbauer added the enhancement New feature or request label May 22, 2024
@chriselrod
Copy link
Contributor Author

chriselrod commented May 22, 2024

I think it's unlikely that fitting into memory will be a problem.

FWIW, I get an extreme value for bandwidth currently when multithreaded, as the array fits in local caches.

I think you should use the number of cores * last cache per core, not the total last cache size.

For example, an AMD system with 8 l3s will be able to comfortably fit the vector into its l3 caches when running the benchmark multithreaded: each chiplet needs only 1/8th of the array, so that's all each l3 needs to store.

The problem would be less extreme than what my desktop has on the master branch (the array currently fits it the CPU's l2 caches), but still give invalid results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants