Skip to content

Commit 76e98c4

Browse files
davidbenbriansmith
authored andcommitted
Always end BN_mod_exp_mont_consttime with normal Montgomery reduction.
This partially fixes a bug where, on x86_64, BN_mod_exp_mont_consttime would sometimes return m, the modulus, when it should have returned zero. Thanks to Guido Vranken for reporting it. It is only a partial fix because the same bug also exists in the "rsaz" codepath. That will be fixed in the subsequent CL. (See the commented out test.) The bug only affects zero outputs (with non-zero inputs), so we believe it has no security impact on our cryptographic functions. BoringSSL calls BN_mod_exp_mont_consttime in the following cases: - RSA private key operations - Primality testing, raising the witness to the odd part of p-1 - DSA keygen and key import, pub = g^priv (mod p) - DSA signing, r = g^k (mod p) - DH keygen, pub = g^priv (mod p) - Diffie-Hellman, secret = peer^priv (mod p) It is not possible in the RSA private key operation, provided p and q are primes. If using CRT, we are working modulo a prime, so zero output with non-zero input is impossible. If not using CRT, we work mod n. While there are nilpotent values mod n, none of them hit zero by exponentiating. (Both p and q would need to divide the input, which means n divides the input.) In primality testing, this can only be hit when the input was composite. But as the rest of the loop cannot then hit 1, we'll correctly report it as composite anyway. DSA and DH work modulo a prime, where this case cannot happen. Analysis: This bug is the result of sloppiness with the looser bounds from "almost Montgomery multiplication", described in https://eprint.iacr.org/2011/239. Prior to upstream's ec9cc70f72454b8d4a84247c86159613cee83b81, I believe x86_64-mont5.pl implemented standard Montgomery reduction (the left half of figure 3 in the paper). Though it did not document this, ec9cc70f7245 changed it to implement the "almost" variant (the right half of the figure.) The difference is that, rather than subtracting if T >= m, it subtracts if T >= R. In code, it is the difference between something like our bn_reduce_once, vs. subtracting based only on T's carry bit. (Interestingly, the .Lmul_enter branch of bn_mul_mont_gather5 seems to still implement normal reduction, but the .Lmul4x_enter branch is an almost reduction.) That means none of the intermediate values here are bounded by m. They are only bounded by R. Accordingly, Figure 2 in the paper ends with step 10: REDUCE h modulo m. BN_mod_exp_mont_consttime is missing this step. The bn_from_montgomery call only implements step 9, AMM(h, 1). (x86_64-mont5.pl's bn_from_montgomery only implements an almost reduction.) The impact depends on how unreduced AMM(h, 1) can be. Remark 1 of the paper discusses this, but is ambiguous about the scope of its 2^(n-1) < m < 2^n precondition. The m+1 bound appears to be unconditional: Montgomery reduction ultimately adds some 0 <= Y < m*R to T, to get a multiple of R, and then divides by R. The output, pre-subtraction, is thus less than m + T/R. MM works because T < mR => T' < m + mR/R = 2m. A single subtraction of m if T' >= m gives T'' < m. AMM works because T < R^2 => T' < m + R^2/R = m + R. A single subtraction of m if T' >= R gives T'' < R. See also Lemma 1, Section 3 and Section 4 of the paper, though their formulation is more complicated to capture the word-by-word algorithm. It's ultimately the same adjustment to T. But in AMM(h, 1), T = h*1 = h < R, so AMM(h, 1) < m + R/R = m + 1. That is, AMM(h, 1) <= m. So the only case when AMM(h, 1) isn't fully reduced is if it outputs m. Thus, our limited impact. Indeed, Remark 1 mentions step 10 isn't necessary because m is a prime and the inputs are non-zero. But that doesn't apply here because BN_mod_exp_mont_consttime may be called elsewhere. Fix: To fix this, we could add the missing step 10, but a full division would not be constant-time. The analysis above says it could be a single subtraction, bn_reduce_once, but then we could integrate it into the subtraction already in plain Montgomery reduction, implemented by uppercase BN_from_montgomery. h*1 = h < R <= m*R, so we are within bounds. Thus, we delete lowercase bn_from_montgomery altogether, and have the mont5 path use the same BN_from_montgomery ending as the non-mont5 path. This only impacts the final step of the whole exponentiation and has no measurable perf impact. In doing so, add comments describing these looser bounds. This includes one subtlety that BN_mod_exp_mont_consttime actually mixes bn_mul_mont (MM) with bn_mul_mont_gather5/bn_power5 (AMM). But this is fine because MM is AMM-compatible; when passed AMM's looser inputs, it will still produce a correct looser output. Ideally we'd drop the "almost" reduction and stick to the more straightforward bounds. As this only impacts the final subtraction in each reduction, I would be surprised if it actually had a real performance impact. But this would involve deeper change to x86_64-mont5.pl, so I haven't tried this yet. I believe this is basically the same bug as golang/go#13907 from Go. Change-Id: I06f879777bb2ef181e9da7632ec858582e2afa38 Reviewed-on: https://boringssl-review.googlesource.com/c/boringssl/+/52825 Commit-Queue: David Benjamin <[email protected]> Reviewed-by: Adam Langley <[email protected]>
1 parent fd3f3d5 commit 76e98c4

File tree

4 files changed

+104
-245
lines changed

4 files changed

+104
-245
lines changed

build.rs

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -952,7 +952,6 @@ fn prefix_all_symbols(pp: char, prefix_prefix: &str, prefix: &str) -> String {
952952
"aes_nohw_set_encrypt_key",
953953
"aesni_gcm_decrypt",
954954
"aesni_gcm_encrypt",
955-
"bn_from_montgomery",
956955
"bn_from_montgomery_in_place",
957956
"bn_gather5",
958957
"bn_mul_mont",

crypto/fipsmodule/bn/asm/x86_64-mont5.pl

Lines changed: 0 additions & 197 deletions
Original file line numberDiff line numberDiff line change
@@ -2088,194 +2088,6 @@
20882088
.size __bn_post4x_internal,.-__bn_post4x_internal
20892089
___
20902090
}
2091-
{
2092-
$code.=<<___;
2093-
.globl bn_from_montgomery
2094-
.type bn_from_montgomery,\@abi-omnipotent
2095-
.align 32
2096-
bn_from_montgomery:
2097-
.cfi_startproc
2098-
testl \$7,`($win64?"48(%rsp)":"%r9d")`
2099-
jz bn_from_mont8x
2100-
xor %eax,%eax
2101-
ret
2102-
.cfi_endproc
2103-
.size bn_from_montgomery,.-bn_from_montgomery
2104-
2105-
.type bn_from_mont8x,\@function,6
2106-
.align 32
2107-
bn_from_mont8x:
2108-
.cfi_startproc
2109-
.byte 0x67
2110-
mov %rsp,%rax
2111-
.cfi_def_cfa_register %rax
2112-
push %rbx
2113-
.cfi_push %rbx
2114-
push %rbp
2115-
.cfi_push %rbp
2116-
push %r12
2117-
.cfi_push %r12
2118-
push %r13
2119-
.cfi_push %r13
2120-
push %r14
2121-
.cfi_push %r14
2122-
push %r15
2123-
.cfi_push %r15
2124-
.Lfrom_prologue:
2125-
2126-
shl \$3,${num}d # convert $num to bytes
2127-
lea ($num,$num,2),%r10 # 3*$num in bytes
2128-
neg $num
2129-
mov ($n0),$n0 # *n0
2130-
2131-
##############################################################
2132-
# Ensure that stack frame doesn't alias with $rptr+3*$num
2133-
# modulo 4096, which covers ret[num], am[num] and n[num]
2134-
# (see bn_exp.c). The stack is allocated to aligned with
2135-
# bn_power5's frame, and as bn_from_montgomery happens to be
2136-
# last operation, we use the opportunity to cleanse it.
2137-
#
2138-
lea -320(%rsp,$num,2),%r11
2139-
mov %rsp,%rbp
2140-
sub $rptr,%r11
2141-
and \$4095,%r11
2142-
cmp %r11,%r10
2143-
jb .Lfrom_sp_alt
2144-
sub %r11,%rbp # align with $aptr
2145-
lea -320(%rbp,$num,2),%rbp # future alloca(frame+2*$num*8+256)
2146-
jmp .Lfrom_sp_done
2147-
2148-
.align 32
2149-
.Lfrom_sp_alt:
2150-
lea 4096-320(,$num,2),%r10
2151-
lea -320(%rbp,$num,2),%rbp # future alloca(frame+2*$num*8+256)
2152-
sub %r10,%r11
2153-
mov \$0,%r10
2154-
cmovc %r10,%r11
2155-
sub %r11,%rbp
2156-
.Lfrom_sp_done:
2157-
and \$-64,%rbp
2158-
mov %rsp,%r11
2159-
sub %rbp,%r11
2160-
and \$-4096,%r11
2161-
lea (%rbp,%r11),%rsp
2162-
mov (%rsp),%r10
2163-
cmp %rbp,%rsp
2164-
ja .Lfrom_page_walk
2165-
jmp .Lfrom_page_walk_done
2166-
2167-
.Lfrom_page_walk:
2168-
lea -4096(%rsp),%rsp
2169-
mov (%rsp),%r10
2170-
cmp %rbp,%rsp
2171-
ja .Lfrom_page_walk
2172-
.Lfrom_page_walk_done:
2173-
2174-
mov $num,%r10
2175-
neg $num
2176-
2177-
##############################################################
2178-
# Stack layout
2179-
#
2180-
# +0 saved $num, used in reduction section
2181-
# +8 &t[2*$num], used in reduction section
2182-
# +32 saved *n0
2183-
# +40 saved %rsp
2184-
# +48 t[2*$num]
2185-
#
2186-
mov $n0, 32(%rsp)
2187-
mov %rax, 40(%rsp) # save original %rsp
2188-
.cfi_cfa_expression %rsp+40,deref,+8
2189-
.Lfrom_body:
2190-
mov $num,%r11
2191-
lea 48(%rsp),%rax
2192-
pxor %xmm0,%xmm0
2193-
jmp .Lmul_by_1
2194-
2195-
.align 32
2196-
.Lmul_by_1:
2197-
movdqu ($aptr),%xmm1
2198-
movdqu 16($aptr),%xmm2
2199-
movdqu 32($aptr),%xmm3
2200-
movdqa %xmm0,(%rax,$num)
2201-
movdqu 48($aptr),%xmm4
2202-
movdqa %xmm0,16(%rax,$num)
2203-
.byte 0x48,0x8d,0xb6,0x40,0x00,0x00,0x00 # lea 64($aptr),$aptr
2204-
movdqa %xmm1,(%rax)
2205-
movdqa %xmm0,32(%rax,$num)
2206-
movdqa %xmm2,16(%rax)
2207-
movdqa %xmm0,48(%rax,$num)
2208-
movdqa %xmm3,32(%rax)
2209-
movdqa %xmm4,48(%rax)
2210-
lea 64(%rax),%rax
2211-
sub \$64,%r11
2212-
jnz .Lmul_by_1
2213-
2214-
movq $rptr,%xmm1
2215-
movq $nptr,%xmm2
2216-
.byte 0x67
2217-
mov $nptr,%rbp
2218-
movq %r10, %xmm3 # -num
2219-
___
2220-
$code.=<<___ if ($addx);
2221-
leaq OPENSSL_ia32cap_P(%rip),%r11
2222-
mov 8(%r11),%r11d
2223-
and \$0x80108,%r11d
2224-
cmp \$0x80108,%r11d # check for AD*X+BMI2+BMI1
2225-
jne .Lfrom_mont_nox
2226-
2227-
lea (%rax,$num),$rptr
2228-
call __bn_sqrx8x_reduction
2229-
call __bn_postx4x_internal
2230-
2231-
pxor %xmm0,%xmm0
2232-
lea 48(%rsp),%rax
2233-
jmp .Lfrom_mont_zero
2234-
2235-
.align 32
2236-
.Lfrom_mont_nox:
2237-
___
2238-
$code.=<<___;
2239-
call __bn_sqr8x_reduction
2240-
call __bn_post4x_internal
2241-
2242-
pxor %xmm0,%xmm0
2243-
lea 48(%rsp),%rax
2244-
jmp .Lfrom_mont_zero
2245-
2246-
.align 32
2247-
.Lfrom_mont_zero:
2248-
mov 40(%rsp),%rsi # restore %rsp
2249-
.cfi_def_cfa %rsi,8
2250-
movdqa %xmm0,16*0(%rax)
2251-
movdqa %xmm0,16*1(%rax)
2252-
movdqa %xmm0,16*2(%rax)
2253-
movdqa %xmm0,16*3(%rax)
2254-
lea 16*4(%rax),%rax
2255-
sub \$32,$num
2256-
jnz .Lfrom_mont_zero
2257-
2258-
mov \$1,%rax
2259-
mov -48(%rsi),%r15
2260-
.cfi_restore %r15
2261-
mov -40(%rsi),%r14
2262-
.cfi_restore %r14
2263-
mov -32(%rsi),%r13
2264-
.cfi_restore %r13
2265-
mov -24(%rsi),%r12
2266-
.cfi_restore %r12
2267-
mov -16(%rsi),%rbp
2268-
.cfi_restore %rbp
2269-
mov -8(%rsi),%rbx
2270-
.cfi_restore %rbx
2271-
lea (%rsi),%rsp
2272-
.cfi_def_cfa_register %rsp
2273-
.Lfrom_epilogue:
2274-
ret
2275-
.cfi_endproc
2276-
.size bn_from_mont8x,.-bn_from_mont8x
2277-
___
2278-
}
22792091
}}}
22802092

22812093
if ($addx) {{{
@@ -3864,10 +3676,6 @@
38643676
.rva .LSEH_begin_bn_power5
38653677
.rva .LSEH_end_bn_power5
38663678
.rva .LSEH_info_bn_power5
3867-
3868-
.rva .LSEH_begin_bn_from_mont8x
3869-
.rva .LSEH_end_bn_from_mont8x
3870-
.rva .LSEH_info_bn_from_mont8x
38713679
___
38723680
$code.=<<___ if ($addx);
38733681
.rva .LSEH_begin_bn_mulx4x_mont_gather5
@@ -3899,11 +3707,6 @@
38993707
.byte 9,0,0,0
39003708
.rva mul_handler
39013709
.rva .Lpower5_prologue,.Lpower5_body,.Lpower5_epilogue # HandlerData[]
3902-
.align 8
3903-
.LSEH_info_bn_from_mont8x:
3904-
.byte 9,0,0,0
3905-
.rva mul_handler
3906-
.rva .Lfrom_prologue,.Lfrom_body,.Lfrom_epilogue # HandlerData[]
39073710
___
39083711
$code.=<<___ if ($addx);
39093712
.align 8

src/arithmetic/bigint.rs

Lines changed: 30 additions & 47 deletions
Original file line numberDiff line numberDiff line change
@@ -146,29 +146,28 @@ impl<M, E> Elem<M, E> {
146146
}
147147
}
148148

149-
impl<M, E: ReductionEncoding> Elem<M, E> {
150-
fn decode_once(self, m: &Modulus<M>) -> Elem<M, <E as ReductionEncoding>::Output> {
151-
// A multiplication isn't required since we're multiplying by the
152-
// unencoded value one (1); only a Montgomery reduction is needed.
153-
// However the only non-multiplication Montgomery reduction function we
154-
// have requires the input to be large, so we avoid using it here.
155-
let mut limbs = self.limbs;
156-
let num_limbs = m.width().num_limbs;
157-
let mut one = [0; MODULUS_MAX_LIMBS];
158-
one[0] = 1;
159-
let one = &one[..num_limbs]; // assert!(num_limbs <= MODULUS_MAX_LIMBS);
160-
limbs_mont_mul(&mut limbs, one, m.limbs(), m.n0(), m.cpu_features());
161-
Elem {
162-
limbs,
163-
encoding: PhantomData,
164-
}
149+
/// Does a Montgomery reduction on `limbs` assuming they are Montgomery-encoded ('R') and assuming
150+
/// they are the same size as `m`, but perhaps not reduced mod `m`. The result will be
151+
/// fully reduced mod `m`.
152+
fn from_montgomery_amm<M>(limbs: BoxedLimbs<M>, m: &Modulus<M>) -> Elem<M, Unencoded> {
153+
debug_assert_eq!(limbs.len(), m.limbs().len());
154+
155+
let mut limbs = limbs;
156+
let num_limbs = m.width().num_limbs;
157+
let mut one = [0; MODULUS_MAX_LIMBS];
158+
one[0] = 1;
159+
let one = &one[..num_limbs]; // assert!(num_limbs <= MODULUS_MAX_LIMBS);
160+
limbs_mont_mul(&mut limbs, one, m.limbs(), m.n0(), m.cpu_features());
161+
Elem {
162+
limbs,
163+
encoding: PhantomData,
165164
}
166165
}
167166

168167
impl<M> Elem<M, R> {
169168
#[inline]
170169
pub fn into_unencoded(self, m: &Modulus<M>) -> Elem<M, Unencoded> {
171-
self.decode_once(m)
170+
from_montgomery_amm(self.limbs, m)
172171
}
173172
}
174173

@@ -623,7 +622,13 @@ pub fn elem_exp_consttime<M>(
623622
limbs_mont_square(acc, m, n0, cpu_features);
624623
}
625624

626-
fn gather_mul_base(table: &[Limb], state: &mut [Limb], n0: &N0, i: Window, num_limbs: usize) {
625+
fn gather_mul_base_amm(
626+
table: &[Limb],
627+
state: &mut [Limb],
628+
n0: &N0,
629+
i: Window,
630+
num_limbs: usize,
631+
) {
627632
prefixed_extern! {
628633
fn bn_mul_mont_gather5(
629634
rp: *mut Limb,
@@ -648,7 +653,7 @@ pub fn elem_exp_consttime<M>(
648653
}
649654
}
650655

651-
fn power(table: &[Limb], state: &mut [Limb], n0: &N0, i: Window, num_limbs: usize) {
656+
fn power_amm(table: &[Limb], state: &mut [Limb], n0: &N0, i: Window, num_limbs: usize) {
652657
prefixed_extern! {
653658
fn bn_power5(
654659
r: *mut Limb,
@@ -690,7 +695,7 @@ pub fn elem_exp_consttime<M>(
690695
// TODO: Optimize this to avoid gathering
691696
gather_square(table, state, m.n0(), i / 2, num_limbs, cpu_features);
692697
} else {
693-
gather_mul_base(table, state, m.n0(), i - 1, num_limbs)
698+
gather_mul_base_amm(table, state, m.n0(), i - 1, num_limbs)
694699
};
695700
scatter(table, state, i, num_limbs);
696701
}
@@ -702,37 +707,15 @@ pub fn elem_exp_consttime<M>(
702707
state
703708
},
704709
|state, window| {
705-
power(table, state, m.n0(), window, num_limbs);
710+
power_amm(table, state, m.n0(), window, num_limbs);
706711
state
707712
},
708713
);
709714

710-
prefixed_extern! {
711-
fn bn_from_montgomery(
712-
r: *mut Limb,
713-
a: *const Limb,
714-
not_used: *const Limb,
715-
n: *const Limb,
716-
n0: &N0,
717-
num: c::size_t,
718-
) -> bssl::Result;
719-
}
720-
Result::from(unsafe {
721-
bn_from_montgomery(
722-
entry_mut(state, ACC, num_limbs).as_mut_ptr(),
723-
entry(state, ACC, num_limbs).as_ptr(),
724-
core::ptr::null(),
725-
entry(state, M, num_limbs).as_ptr(),
726-
m.n0(),
727-
num_limbs,
728-
)
729-
})?;
730-
let mut r = Elem {
731-
limbs: base.limbs,
732-
encoding: PhantomData,
733-
};
734-
r.limbs.copy_from_slice(entry(state, ACC, num_limbs));
735-
Ok(r)
715+
let mut r_amm = base.limbs;
716+
r_amm.copy_from_slice(entry(state, ACC, num_limbs));
717+
718+
Ok(from_montgomery_amm(r_amm, m))
736719
}
737720

738721
/// Verified a == b**-1 (mod m), i.e. a**-1 == b (mod m).

0 commit comments

Comments
 (0)