Skip to content

Conversation

taozha2
Copy link

@taozha2 taozha2 commented Aug 4, 2025

No description provided.

@taozha2 taozha2 marked this pull request as draft August 4, 2025 08:05
@taozha2 taozha2 marked this pull request as ready for review August 5, 2025 05:57
Copy link

@jiyang1011 jiyang1011 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

#endif

template <class T, class = void>
struct ScaleType {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same. And this should be a type alias.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did you marks this as resolved. you didn't address it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as we talked before, i will make this PR as draft because peter's PR may cover my improvement. i will re-check this PR after peter's PR merged.

@rolandschulz rolandschulz requested a review from petercad August 12, 2025 17:40
@rolandschulz
Copy link

Some of the comments should really go into #472. You should not upload PRs depending on other PRs and ask for review.

@petercad
Copy link

@taozha2 Do you have some performance data we can look at?

@taozha2
Copy link
Author

taozha2 commented Aug 13, 2025

Some of the comments should really go into #472. You should not upload PRs depending on other PRs and ask for review.

ok, i change this PR to draft, will rebase this PR and request reviewing after the #472 merged.

@taozha2
Copy link
Author

taozha2 commented Aug 13, 2025

@taozha2 Do you have some performance data we can look at?

yes, #472 enable the benchmark and we can get the performance baseline, current PR can show the performance improvement when you run mixed data type benchmark. I can share some performance data after #472 merged.

@taozha2 taozha2 marked this pull request as draft August 13, 2025 01:01
@taozha2 taozha2 marked this pull request as ready for review August 26, 2025 23:59
@taozha2
Copy link
Author

taozha2 commented Aug 27, 2025

@taozha2 Do you have some performance data we can look at?

yes, #472 enable the benchmark and we can get the performance baseline, current PR can show the performance improvement when you run mixed data type benchmark. I can share some performance data after #472 merged.

@rolandschulz @petercad
This is the performance with this PR:

PvcMixedPrecisionGemmFP16U4FP16F16FP16S4_RCR_1/mixed_dtype_int4/32x14336x4096x1/manual_time       0.137 ms        0.143 ms         5055 alpha=1 avg_runtime_ms=0.136794 avg_tflops=27.4727 avg_throughput=229.961 best_bandwidth=237.421 best_runtime_ms=0.132496 best_tflop=28.3638 beta=0 k=4.096k l=1 m=32 n=14.336k total_runtime_ms=691.551 worst_runtime_ms=0.19864 layoutA=RowMajor layoutB=ColumnMajor layoutC=RowMajor 

PvcMixedPrecisionGemmBF16U4BF16BF16BF16S4_RCR_1/mixed_dtype_int4/32x14336x4096x1/manual_time      0.135 ms        0.141 ms         5139 alpha=1 avg_runtime_ms=0.134587 avg_tflops=27.9232 avg_throughput=233.732 best_bandwidth=242.173 best_runtime_ms=0.129896 best_tflop=28.9316 beta=0 k=4.096k l=1 m=32 n=14.336k total_runtime_ms=691.697 worst_runtime_ms=0.19448 layoutA=RowMajor layoutB=ColumnMajor layoutC=RowMajor 

PvcMixedPrecisionGemmFP16U4FP16S8FP16S4_RCR_1/mixed_dtype_int4/32x14336x4096x1/manual_time        0.766 ms        0.774 ms          913 alpha=1 avg_runtime_ms=0.766356 avg_tflops=4.90385 avg_throughput=41.0479 best_bandwidth=41.4178 best_runtime_ms=0.759512 best_tflop=4.94804 beta=0 k=4.096k l=1 m=32 n=14.336k total_runtime_ms=699.687 worst_runtime_ms=0.777816 layoutA=RowMajor layoutB=ColumnMajor layoutC=RowMajor 

PvcMixedPrecisionGemmFP16U4S8S8FP16S4_RCR_1/mixed_dtype_int4/32x14336x4096x1/manual_time          0.767 ms        0.775 ms          914 alpha=1 avg_runtime_ms=0.766603 avg_tflops=4.90227 avg_throughput=41.0347 best_bandwidth=41.4234 best_runtime_ms=0.759408 best_tflop=4.94872 beta=0 k=4.096k l=1 m=32 n=14.336k total_runtime_ms=700.678 worst_runtime_ms=0.777192 layoutA=RowMajor layoutB=ColumnMajor layoutC=RowMajor 

PvcMixedPrecisionGemmBF16U4BF16S8BF16S4_RCR_1/mixed_dtype_int4/32x14336x4096x1/manual_time        0.841 ms        0.849 ms          833 alpha=1 avg_runtime_ms=0.840966 avg_tflops=4.46879 avg_throughput=37.4061 best_bandwidth=37.7432 best_runtime_ms=0.833456 best_tflop=4.50905 beta=0 k=4.096k l=1 m=32 n=14.336k total_runtime_ms=700.529 worst_runtime_ms=0.853112 layoutA=RowMajor layoutB=ColumnMajor layoutC=RowMajor 

PvcMixedPrecisionGemmBF16U4S8S8BF16S4_RCR_1/mixed_dtype_int4/32x14336x4096x1/manual_time          0.841 ms        0.849 ms          832 alpha=1 avg_runtime_ms=0.841015 avg_tflops=4.46852 avg_throughput=37.4039 best_bandwidth=37.7244 best_runtime_ms=0.833872 best_tflop=4.5068 beta=0 k=4.096k l=1 m=32 n=14.336k total_runtime_ms=699.726 worst_runtime_ms=0.849992 layoutA=RowMajor layoutB=ColumnMajor layoutC=RowMajor 

And this is before:

PvcMixedPrecisionGemmFP16U4FP16F16FP16S4_RCR_1/mixed_dtype_int4/32x14336x4096x1/manual_time       0.496 ms        0.505 ms         2319 alpha=1 avg_runtime_ms=0.495244 avg_tflops=7.58837 avg_throughput=63.5187 best_bandwidth=107.184 best_runtime_ms=0.293488 best_tflop=12.8049 beta=0 k=4.096k l=1 m=32 n=14.336k total_runtime_ms=1.14936k worst_runtime_ms=1.58142 layoutA=RowMajor layoutB=ColumnMajor layoutC=RowMajor 

PvcMixedPrecisionGemmBF16U4BF16BF16BF16S4_RCR_1/mixed_dtype_int4/32x14336x4096x1/manual_time      0.294 ms        0.301 ms         2342 alpha=1 avg_runtime_ms=0.293811 avg_tflops=12.7909 avg_throughput=107.066 best_bandwidth=108.843 best_runtime_ms=0.289016 best_tflop=13.0031 beta=0 k=4.096k l=1 m=32 n=14.336k total_runtime_ms=688.159 worst_runtime_ms=0.352352 layoutA=RowMajor layoutB=ColumnMajor layoutC=RowMajor 

PvcMixedPrecisionGemmFP16U4FP16S8FP16S4_RCR_1/mixed_dtype_int4/32x14336x4096x1/manual_time        0.924 ms        0.932 ms          756 alpha=1 avg_runtime_ms=0.924363 avg_tflops=4.06561 avg_throughput=34.0313 best_bandwidth=34.4582 best_runtime_ms=0.912912 best_tflop=4.1166 beta=0 k=4.096k l=1 m=32 n=14.336k total_runtime_ms=698.821 worst_runtime_ms=0.937768 layoutA=RowMajor layoutB=ColumnMajor layoutC=RowMajor 

PvcMixedPrecisionGemmFP16U4S8S8FP16S4_RCR_1/mixed_dtype_int4/32x14336x4096x1/manual_time          0.924 ms        0.932 ms          757 alpha=1 avg_runtime_ms=0.924363 avg_tflops=4.06561 avg_throughput=34.0313 best_bandwidth=34.2785 best_runtime_ms=0.917696 best_tflop=4.09514 beta=0 k=4.096k l=1 m=32 n=14.336k total_runtime_ms=699.752 worst_runtime_ms=0.94016 layoutA=RowMajor layoutB=ColumnMajor layoutC=RowMajor 

PvcMixedPrecisionGemmBF16U4BF16S8BF16S4_RCR_1/mixed_dtype_int4/32x14336x4096x1/manual_time        0.998 ms         1.01 ms          700 alpha=1 avg_runtime_ms=0.998473 avg_tflops=3.76384 avg_throughput=31.5054 best_bandwidth=31.6959 best_runtime_ms=0.992472 best_tflop=3.7866 beta=0 k=4.096k l=1 m=32 n=14.336k total_runtime_ms=698.938 worst_runtime_ms=1.01161 layoutA=RowMajor layoutB=ColumnMajor layoutC=RowMajor 

PvcMixedPrecisionGemmBF16U4S8S8BF16S4_RCR_1/mixed_dtype_int4/32x14336x4096x1/manual_time          0.991 ms        0.999 ms          706 alpha=1 avg_runtime_ms=0.991135 avg_tflops=3.79171 avg_throughput=31.7386 best_bandwidth=31.9571 best_runtime_ms=0.98436 best_tflop=3.81781 beta=0 k=4.096k l=1 m=32 n=14.336k total_runtime_ms=699.744 worst_runtime_ms=1.00017 layoutA=RowMajor layoutB=ColumnMajor layoutC=RowMajor 

@taozha2 taozha2 requested a review from rolandschulz August 27, 2025 05:13
Copy link

@rolandschulz rolandschulz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make sure all comments are addressed

@taozha2 taozha2 marked this pull request as draft September 5, 2025 00:57
@taozha2 taozha2 changed the title improve mixed data type performance [Draft PR, NOT review] improve mixed data type performance Sep 5, 2025
@rolandschulz
Copy link

sorry. forgot. thanks for changing to draft.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants