-
Notifications
You must be signed in to change notification settings - Fork 37
Description
Background
Ceno heavily reply on LogUp to do range check, e.g. 16 bit range check
See circuits statistis here #585 (comment)
Most of lookup operation are contribution via 16 bit range check. Take add
opcode as example, the 9 lookups are all contributes via
- rs1, rs2, rd offline memory check assert timestamp < global timestamp => (3 * 2) = 6 of 16 bit range check
- rd splitted into 2 of 16 bit range check
- program fetch: 1 lookup
So in overall there are 6 + 2 + 1 = 9 lookup
If we do 32 bit range check
- rs1, rs2, rd offline memory check timestamp check => 3 lookup
- rd range: 1 lookup
- program fetch: 1 lookup
overall is 5 lookup, which across 2^3 = 8 boundary. The tower sumcheck part leaf layer size will be cut half size. so the expected latency would be cut to half.
As a side effect, we also save bunch of witin
for they are there for holding 16 limb, so it also benefit of mpcs since there are less polynomial.
Design Rationales
On logup formula right hand side, we have m(x) and T(x). One of nice property for 32 bits range check table T(x) is we can skip its commitment & PCS, for verifier can evaluate T(r) succinctly via tricks here. So another challenge is how to deal with huge & sparse polynomial m(x)
Via spartan paper p29 7.2.2. sparse polynomial commitment SPARK
, we can view sparse m(x) into tuple of 3 dense [(i, j, M(i,j))] and commit 3 dense polynomial respectively. Giving original m(x) dense size is 2^32. The insight magic of the SPARK
is via split into i, j polynomial, each size just match non-zero entries of m(x). I think the most innovation to break variables into row, col, is in SPARK offline memory check memory-in-the-head, it reduced audit_ts_(row/col) dense size from 2^32 to 2^16 size.
originally my question is why
row
+col
instead ofrow
, is it just to deal with R1CS matricx ? After some thought I found it's not, the key point to split intorow
,col
is |audit_ts| size can be reduce exponentially, from 2^32 to 2 ^ 16. The math magic due to identity polynomialeq
are splittable, each with exponentially size reduction, and can be evaluated separately.
Prover need to commit i
, j
, M
, along with read_ts_row
, write_ts_row
, audit_ts_row
, read_ts_col
, write_ts_col
, audit_ts_col
related to SPARK protocol.
With SPARK, the e2e table proving flow will be like this
- prover commit to original polynomiall related to table proof, plus few polynomial for SPARK:
i
,j
,M
,read_ts_row
,write_ts_row
,audit_ts_row
,read_ts_col
,write_ts_col
,audit_ts_col
- derived challenges
- generate tower sumcheck proof, at the end of logup m(x), we have random point
$r$ and evaluation v=m($r$ ) - generate mPCS proof for original polynomials.
- generate spark_proof(r, v=m(
$r$ ), [i
,j
,M
,read_ts_row
,write_ts_row
,audit_ts_row
,read_ts_col
,write_ts_col
,audit_ts_col
])- tower_product_sumcheck: for
row
,col
offline memory check follow SPARK - one sumcheck: v =
$\sum_k e_{row}(k) * e_{col}(k) * M(k)$ - check offline memory check formula of row, col satisfication
- derive new random point
$r'$ and 9 evaluations i($r'$ ), j($r'$ ), M($r'$ ), read_ts_row($r'$ ), write_ts_row($r'$ ), ...
- tower_product_sumcheck: for
- generate mPCS proof for SPARK protocol
What the overhead
In table proof part, since there is a new SPARK proof flow involve in critical path sequentially, the overall latency of table proof will increase. However all the opcode proof will benefit quite a lot on 32 bits range check. As the opcode proof occupied the major overhead, the increase overhead in table probably negligible.
In more detail analysis, the proving time overhead in SPARK is donimated by the size |read_ts_row|, |write_ts_row|, |read_ts_col|, |write_ts_col|, associate with number of non zero entry |m(x)|. In real world workload, if there are more repeated values to be range check, then non zero entry in |m(x)| will be even less, so the cost will be save quite a lots. The worst case happend suppose all the lookups value are distinct.
Sub Task breandown
- preparation: justify the solution worthy by some mock data. E.g. hardcode riscv_add lookup from 9 -> 5 and see latency improvement. [Simulation to support #702] experiment latency improvement for lookup size reduce to halfย #704
- implement SPARK spare polynomial commitment & benchmark
- opcode circuit to do 32 bit range check Refactor/using UInt32 as default UInt to reduce witness sizeย #686
- introduce SPARK into table proof flow
Other side effects
This feature rely on base field to hold 32 bit riv32, therefore we need to stick to Goldilock64.
So the future roalmap will be Goldilock64 -> Binary Field, without mersenne31/babybear in transition.