Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
IQ1_BN
andIQ2_BN
to have per row scales. In that way we can handle Bitnet models with and without separate tensor scalesIQ1_TN
andIQ2_TN
. With the above change these are now redundant.IQ1_BN
andIQ2_BN
are also faster, so no reason to keep these aroundbuild_bitnet()
to use the standardllm_build_kv()
function for the self attention portion. I was hoping this would also allow to use FA, but nope, the Bitnet models have a strange head size of 100 that is not supported by the FA implementations.Everything works except - can you guess? - Metal. There is something wrong with the dot product kernels and I simply don't see what. I have to fix Metal before merging.
On CUDA (RTX-4080) we now get 368 t/s for TG-128 with the 3.3B Bitnet model (
IQ2_BN
). When I first added Bitnet support we were at ~320 t/s, so quite an improvement since then.Update
I wasted quite some time trying to figure out why the Bitnet changes don't work on Metal. At the end it turned out that it is PR #98 that breaks the Metal back-end. So, this PR reverts #98.
@agray3 Do you have the ability to investigate why #98 breaks the Metal back-end?