In RMSNorm fwd _set_cluster_n(), why are the N thresholds (for picking larger cluster_n), smaller for 16-bit types than for 32-bit types?