Always end BN_mod_exp_mont_consttime with normal Montgomery reduction.

davidben · briansmith · commit 76e98c4351f0 · 2023-09-15T17:01:39.000-07:00
This partially fixes a bug where, on x86_64, BN_mod_exp_mont_consttime would sometimes return m, the modulus, when it should have returned zero. Thanks to Guido Vranken for reporting it. It is only a partial fix because the same bug also exists in the "rsaz" codepath. That will be fixed in the subsequent CL. (See the commented out test.) The bug only affects zero outputs (with non-zero inputs), so we believe it has no security impact on our cryptographic functions. BoringSSL calls BN_mod_exp_mont_consttime in the following cases: - RSA private key operations - Primality testing, raising the witness to the odd part of p-1 - DSA keygen and key import, pub = g^priv (mod p) - DSA signing, r = g^k (mod p) - DH keygen, pub = g^priv (mod p) - Diffie-Hellman, secret = peer^priv (mod p) It is not possible in the RSA private key operation, provided p and q are primes. If using CRT, we are working modulo a prime, so zero output with non-zero input is impossible. If not using CRT, we work mod n. While there are nilpotent values mod n, none of them hit zero by exponentiating. (Both p and q would need to divide the input, which means n divides the input.) In primality testing, this can only be hit when the input was composite. But as the rest of the loop cannot then hit 1, we'll correctly report it as composite anyway. DSA and DH work modulo a prime, where this case cannot happen. Analysis: This bug is the result of sloppiness with the looser bounds from "almost Montgomery multiplication", described in https://eprint.iacr.org/2011/239. Prior to upstream's ec9cc70f72454b8d4a84247c86159613cee83b81, I believe x86_64-mont5.pl implemented standard Montgomery reduction (the left half of figure 3 in the paper). Though it did not document this, ec9cc70f7245 changed it to implement the "almost" variant (the right half of the figure.) The difference is that, rather than subtracting if T >= m, it subtracts if T >= R. In code, it is the difference between something like our bn_reduce_once, vs. subtracting based only on T's carry bit. (Interestingly, the .Lmul_enter branch of bn_mul_mont_gather5 seems to still implement normal reduction, but the .Lmul4x_enter branch is an almost reduction.) That means none of the intermediate values here are bounded by m. They are only bounded by R. Accordingly, Figure 2 in the paper ends with step 10: REDUCE h modulo m. BN_mod_exp_mont_consttime is missing this step. The bn_from_montgomery call only implements step 9, AMM(h, 1). (x86_64-mont5.pl's bn_from_montgomery only implements an almost reduction.) The impact depends on how unreduced AMM(h, 1) can be. Remark 1 of the paper discusses this, but is ambiguous about the scope of its 2^(n-1) < m < 2^n precondition. The m+1 bound appears to be unconditional: Montgomery reduction ultimately adds some 0 <= Y < m*R to T, to get a multiple of R, and then divides by R. The output, pre-subtraction, is thus less than m + T/R. MM works because T < mR => T' < m + mR/R = 2m. A single subtraction of m if T' >= m gives T'' < m. AMM works because T < R^2 => T' < m + R^2/R = m + R. A single subtraction of m if T' >= R gives T'' < R. See also Lemma 1, Section 3 and Section 4 of the paper, though their formulation is more complicated to capture the word-by-word algorithm. It's ultimately the same adjustment to T. But in AMM(h, 1), T = h*1 = h < R, so AMM(h, 1) < m + R/R = m + 1. That is, AMM(h, 1) <= m. So the only case when AMM(h, 1) isn't fully reduced is if it outputs m. Thus, our limited impact. Indeed, Remark 1 mentions step 10 isn't necessary because m is a prime and the inputs are non-zero. But that doesn't apply here because BN_mod_exp_mont_consttime may be called elsewhere. Fix: To fix this, we could add the missing step 10, but a full division would not be constant-time. The analysis above says it could be a single subtraction, bn_reduce_once, but then we could integrate it into the subtraction already in plain Montgomery reduction, implemented by uppercase BN_from_montgomery. h*1 = h < R <= m*R, so we are within bounds. Thus, we delete lowercase bn_from_montgomery altogether, and have the mont5 path use the same BN_from_montgomery ending as the non-mont5 path. This only impacts the final step of the whole exponentiation and has no measurable perf impact. In doing so, add comments describing these looser bounds. This includes one subtlety that BN_mod_exp_mont_consttime actually mixes bn_mul_mont (MM) with bn_mul_mont_gather5/bn_power5 (AMM). But this is fine because MM is AMM-compatible; when passed AMM's looser inputs, it will still produce a correct looser output. Ideally we'd drop the "almost" reduction and stick to the more straightforward bounds. As this only impacts the final subtraction in each reduction, I would be surprised if it actually had a real performance impact. But this would involve deeper change to x86_64-mont5.pl, so I haven't tried this yet. I believe this is basically the same bug as golang/go#13907 from Go. Change-Id: I06f879777bb2ef181e9da7632ec858582e2afa38 Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/52825 Commit-Queue: David Benjamin <davidben@google.com> Reviewed-by: Adam Langley <agl@google.com>
diff --git a/build.rs b/build.rs
@@ -952,7 +952,6 @@ fn prefix_all_symbols(pp: char, prefix_prefix: &str, prefix: &str) -> String {
         "aes_nohw_set_encrypt_key",
         "aesni_gcm_decrypt",
         "aesni_gcm_encrypt",
-        "bn_from_montgomery",
         "bn_from_montgomery_in_place",
         "bn_gather5",
         "bn_mul_mont",
diff --git a/crypto/fipsmodule/bn/asm/x86_64-mont5.pl b/crypto/fipsmodule/bn/asm/x86_64-mont5.pl
@@ -2088,194 +2088,6 @@
 .size	__bn_post4x_internal,.-__bn_post4x_internal
 ___
 }
-{
-$code.=<<___;
-.globl	bn_from_montgomery
-.type	bn_from_montgomery,\@abi-omnipotent
-.align	32
-bn_from_montgomery:
-.cfi_startproc
-	testl	\$7,`($win64?"48(%rsp)":"%r9d")`
-	jz	bn_from_mont8x
-	xor	%eax,%eax
-	ret
-.cfi_endproc
-.size	bn_from_montgomery,.-bn_from_montgomery
-
-.type	bn_from_mont8x,\@function,6
-.align	32
-bn_from_mont8x:
-.cfi_startproc
-	.byte	0x67
-	mov	%rsp,%rax
-.cfi_def_cfa_register	%rax
-	push	%rbx
-.cfi_push	%rbx
-	push	%rbp
-.cfi_push	%rbp
-	push	%r12
-.cfi_push	%r12
-	push	%r13
-.cfi_push	%r13
-	push	%r14
-.cfi_push	%r14
-	push	%r15
-.cfi_push	%r15
-.Lfrom_prologue:
-
-	shl	\$3,${num}d		# convert $num to bytes
-	lea	($num,$num,2),%r10	# 3*$num in bytes
-	neg	$num
-	mov	($n0),$n0		# *n0
-
-	##############################################################
-	# Ensure that stack frame doesn't alias with $rptr+3*$num
-	# modulo 4096, which covers ret[num], am[num] and n[num]
-	# (see bn_exp.c). The stack is allocated to aligned with
-	# bn_power5's frame, and as bn_from_montgomery happens to be
-	# last operation, we use the opportunity to cleanse it.
-	#
-	lea	-320(%rsp,$num,2),%r11
-	mov	%rsp,%rbp
-	sub	$rptr,%r11
-	and	\$4095,%r11
-	cmp	%r11,%r10
-	jb	.Lfrom_sp_alt
-	sub	%r11,%rbp		# align with $aptr
-	lea	-320(%rbp,$num,2),%rbp	# future alloca(frame+2*$num*8+256)
-	jmp	.Lfrom_sp_done
-
-.align	32
-.Lfrom_sp_alt:
-	lea	4096-320(,$num,2),%r10
-	lea	-320(%rbp,$num,2),%rbp	# future alloca(frame+2*$num*8+256)
-	sub	%r10,%r11
-	mov	\$0,%r10
-	cmovc	%r10,%r11
-	sub	%r11,%rbp
-.Lfrom_sp_done:
-	and	\$-64,%rbp
-	mov	%rsp,%r11
-	sub	%rbp,%r11
-	and	\$-4096,%r11
-	lea	(%rbp,%r11),%rsp
-	mov	(%rsp),%r10
-	cmp	%rbp,%rsp
-	ja	.Lfrom_page_walk
-	jmp	.Lfrom_page_walk_done
-
-.Lfrom_page_walk:
-	lea	-4096(%rsp),%rsp
-	mov	(%rsp),%r10
-	cmp	%rbp,%rsp
-	ja	.Lfrom_page_walk
-.Lfrom_page_walk_done:
-
-	mov	$num,%r10
-	neg	$num
-
-	##############################################################
-	# Stack layout
-	#
-	# +0	saved $num, used in reduction section
-	# +8	&t[2*$num], used in reduction section
-	# +32	saved *n0
-	# +40	saved %rsp
-	# +48	t[2*$num]
-	#
-	mov	$n0,  32(%rsp)
-	mov	%rax, 40(%rsp)		# save original %rsp
-.cfi_cfa_expression	%rsp+40,deref,+8
-.Lfrom_body:
-	mov	$num,%r11
-	lea	48(%rsp),%rax
-	pxor	%xmm0,%xmm0
-	jmp	.Lmul_by_1
-
-.align	32
-.Lmul_by_1:
-	movdqu	($aptr),%xmm1
-	movdqu	16($aptr),%xmm2
-	movdqu	32($aptr),%xmm3
-	movdqa	%xmm0,(%rax,$num)
-	movdqu	48($aptr),%xmm4
-	movdqa	%xmm0,16(%rax,$num)
-	.byte	0x48,0x8d,0xb6,0x40,0x00,0x00,0x00	# lea	64($aptr),$aptr
-	movdqa	%xmm1,(%rax)
-	movdqa	%xmm0,32(%rax,$num)
-	movdqa	%xmm2,16(%rax)
-	movdqa	%xmm0,48(%rax,$num)
-	movdqa	%xmm3,32(%rax)
-	movdqa	%xmm4,48(%rax)
-	lea	64(%rax),%rax
-	sub	\$64,%r11
-	jnz	.Lmul_by_1
-
-	movq	$rptr,%xmm1
-	movq	$nptr,%xmm2
-	.byte	0x67
-	mov	$nptr,%rbp
-	movq	%r10, %xmm3		# -num
-___
-$code.=<<___ if ($addx);
-	leaq	OPENSSL_ia32cap_P(%rip),%r11
-	mov	8(%r11),%r11d
-	and	\$0x80108,%r11d
-	cmp	\$0x80108,%r11d		# check for AD*X+BMI2+BMI1
-	jne	.Lfrom_mont_nox
-
-	lea	(%rax,$num),$rptr
-	call	__bn_sqrx8x_reduction
-	call	__bn_postx4x_internal
-
-	pxor	%xmm0,%xmm0
-	lea	48(%rsp),%rax
-	jmp	.Lfrom_mont_zero
-
-.align	32
-.Lfrom_mont_nox:
-___
-$code.=<<___;
-	call	__bn_sqr8x_reduction
-	call	__bn_post4x_internal
-
-	pxor	%xmm0,%xmm0
-	lea	48(%rsp),%rax
-	jmp	.Lfrom_mont_zero
-
-.align	32
-.Lfrom_mont_zero:
-	mov	40(%rsp),%rsi		# restore %rsp
-.cfi_def_cfa	%rsi,8
-	movdqa	%xmm0,16*0(%rax)
-	movdqa	%xmm0,16*1(%rax)
-	movdqa	%xmm0,16*2(%rax)
-	movdqa	%xmm0,16*3(%rax)
-	lea	16*4(%rax),%rax
-	sub	\$32,$num
-	jnz	.Lfrom_mont_zero
-
-	mov	\$1,%rax
-	mov	-48(%rsi),%r15
-.cfi_restore	%r15
-	mov	-40(%rsi),%r14
-.cfi_restore	%r14
-	mov	-32(%rsi),%r13
-.cfi_restore	%r13
-	mov	-24(%rsi),%r12
-.cfi_restore	%r12
-	mov	-16(%rsi),%rbp
-.cfi_restore	%rbp
-	mov	-8(%rsi),%rbx
-.cfi_restore	%rbx
-	lea	(%rsi),%rsp
-.cfi_def_cfa_register	%rsp
-.Lfrom_epilogue:
-	ret
-.cfi_endproc
-.size	bn_from_mont8x,.-bn_from_mont8x
-___
-}
 }}}
 
 if ($addx) {{{
@@ -3864,10 +3676,6 @@
 	.rva	.LSEH_begin_bn_power5
 	.rva	.LSEH_end_bn_power5
 	.rva	.LSEH_info_bn_power5
-
-	.rva	.LSEH_begin_bn_from_mont8x
-	.rva	.LSEH_end_bn_from_mont8x
-	.rva	.LSEH_info_bn_from_mont8x
 ___
 $code.=<<___ if ($addx);
 	.rva	.LSEH_begin_bn_mulx4x_mont_gather5
@@ -3899,11 +3707,6 @@
 	.byte	9,0,0,0
 	.rva	mul_handler
 	.rva	.Lpower5_prologue,.Lpower5_body,.Lpower5_epilogue	# HandlerData[]
-.align	8
-.LSEH_info_bn_from_mont8x:
-	.byte	9,0,0,0
-	.rva	mul_handler
-	.rva	.Lfrom_prologue,.Lfrom_body,.Lfrom_epilogue		# HandlerData[]
 ___
 $code.=<<___ if ($addx);
 .align	8
diff --git a/src/arithmetic/bigint.rs b/src/arithmetic/bigint.rs
@@ -146,29 +146,28 @@ impl<M, E> Elem<M, E> {
     }
 }
 
-impl<M, E: ReductionEncoding> Elem<M, E> {
-    fn decode_once(self, m: &Modulus<M>) -> Elem<M, <E as ReductionEncoding>::Output> {
-        // A multiplication isn't required since we're multiplying by the
-        // unencoded value one (1); only a Montgomery reduction is needed.
-        // However the only non-multiplication Montgomery reduction function we
-        // have requires the input to be large, so we avoid using it here.
-        let mut limbs = self.limbs;
-        let num_limbs = m.width().num_limbs;
-        let mut one = [0; MODULUS_MAX_LIMBS];
-        one[0] = 1;
-        let one = &one[..num_limbs]; // assert!(num_limbs <= MODULUS_MAX_LIMBS);
-        limbs_mont_mul(&mut limbs, one, m.limbs(), m.n0(), m.cpu_features());
-        Elem {
-            limbs,
-            encoding: PhantomData,
-        }
+/// Does a Montgomery reduction on `limbs` assuming they are Montgomery-encoded ('R') and assuming
+/// they are the same size as `m`, but perhaps not reduced mod `m`. The result will be
+/// fully reduced mod `m`.
+fn from_montgomery_amm<M>(limbs: BoxedLimbs<M>, m: &Modulus<M>) -> Elem<M, Unencoded> {
+    debug_assert_eq!(limbs.len(), m.limbs().len());
+
+    let mut limbs = limbs;
+    let num_limbs = m.width().num_limbs;
+    let mut one = [0; MODULUS_MAX_LIMBS];
+    one[0] = 1;
+    let one = &one[..num_limbs]; // assert!(num_limbs <= MODULUS_MAX_LIMBS);
+    limbs_mont_mul(&mut limbs, one, m.limbs(), m.n0(), m.cpu_features());
+    Elem {
+        limbs,
+        encoding: PhantomData,
     }
 }
 
 impl<M> Elem<M, R> {
     #[inline]
     pub fn into_unencoded(self, m: &Modulus<M>) -> Elem<M, Unencoded> {
-        self.decode_once(m)
+        from_montgomery_amm(self.limbs, m)
     }
 }
 
@@ -623,7 +622,13 @@ pub fn elem_exp_consttime<M>(
         limbs_mont_square(acc, m, n0, cpu_features);
     }
 
-    fn gather_mul_base(table: &[Limb], state: &mut [Limb], n0: &N0, i: Window, num_limbs: usize) {
+    fn gather_mul_base_amm(
+        table: &[Limb],
+        state: &mut [Limb],
+        n0: &N0,
+        i: Window,
+        num_limbs: usize,
+    ) {
         prefixed_extern! {
             fn bn_mul_mont_gather5(
                 rp: *mut Limb,
@@ -648,7 +653,7 @@ pub fn elem_exp_consttime<M>(
         }
     }
 
-    fn power(table: &[Limb], state: &mut [Limb], n0: &N0, i: Window, num_limbs: usize) {
+    fn power_amm(table: &[Limb], state: &mut [Limb], n0: &N0, i: Window, num_limbs: usize) {
         prefixed_extern! {
             fn bn_power5(
                 r: *mut Limb,
@@ -690,7 +695,7 @@ pub fn elem_exp_consttime<M>(
             // TODO: Optimize this to avoid gathering
             gather_square(table, state, m.n0(), i / 2, num_limbs, cpu_features);
         } else {
-            gather_mul_base(table, state, m.n0(), i - 1, num_limbs)
+            gather_mul_base_amm(table, state, m.n0(), i - 1, num_limbs)
         };
         scatter(table, state, i, num_limbs);
     }
@@ -702,37 +707,15 @@ pub fn elem_exp_consttime<M>(
             state
         },
         |state, window| {
-            power(table, state, m.n0(), window, num_limbs);
+            power_amm(table, state, m.n0(), window, num_limbs);
             state
         },
     );
 
-    prefixed_extern! {
-        fn bn_from_montgomery(
-            r: *mut Limb,
-            a: *const Limb,
-            not_used: *const Limb,
-            n: *const Limb,
-            n0: &N0,
-            num: c::size_t,
-        ) -> bssl::Result;
-    }
-    Result::from(unsafe {
-        bn_from_montgomery(
-            entry_mut(state, ACC, num_limbs).as_mut_ptr(),
-            entry(state, ACC, num_limbs).as_ptr(),
-            core::ptr::null(),
-            entry(state, M, num_limbs).as_ptr(),
-            m.n0(),
-            num_limbs,
-        )
-    })?;
-    let mut r = Elem {
-        limbs: base.limbs,
-        encoding: PhantomData,
-    };
-    r.limbs.copy_from_slice(entry(state, ACC, num_limbs));
-    Ok(r)
+    let mut r_amm = base.limbs;
+    r_amm.copy_from_slice(entry(state, ACC, num_limbs));
+
+    Ok(from_montgomery_amm(r_amm, m))
 }
 
 /// Verified a == b**-1 (mod m), i.e. a**-1 == b (mod m).
diff --git a/src/arithmetic/bigint_elem_exp_consttime_tests.txt b/src/arithmetic/bigint_elem_exp_consttime_tests.txt