Proposal features: 16 -> 32 bit range check on LogUp

### Background
Ceno heavily reply on LogUp to do range check, e.g. 16 bit range check
<img width="166" alt="image" src="https://github.com/user-attachments/assets/a966ff7c-0d32-4882-8939-ebd63d012334">
See circuits statistis here https://github.com/scroll-tech/ceno/pull/585#issuecomment-2477865876

Most of lookup operation are contribution via 16 bit range check. Take `add` opcode as example, the 9 lookups are all contributes via
- rs1, rs2, rd offline memory check assert timestamp < global timestamp => (3 * 2) = 6 of 16 bit range check
- rd splitted into 2 of 16 bit range check
- program fetch: 1 lookup
So in overall there are 6 + 2 + 1 = 9 lookup

If we do 32 bit range check
- rs1, rs2, rd offline memory check timestamp check => 3 lookup
- rd range: 1 lookup
- program fetch: 1 lookup
overall is 5 lookup, which across 2^3 = 8 boundary. The tower sumcheck part leaf layer size will be cut half size. so the expected latency would be cut to half. 

As a side effect, we also save bunch of `witin` for they are there for holding 16 limb, so it also benefit of mpcs since there are less polynomial.

### Design Rationales
On logup formula right hand side, we have m(x) and T(x). One of nice property for 32 bits range check table T(x) is we can skip its commitment & PCS, for verifier can evaluate T(r) succinctly via [tricks](https://github.com/scroll-tech/ceno/issues/378#issuecomment-2436580306) here. So another challenge is how to deal with huge & sparse polynomial m(x)

Via [spartan paper](https://eprint.iacr.org/2019/550.pdf) p29 7.2.2. sparse polynomial commitment `SPARK`, we can view sparse m(x) into  tuple of 3 dense [(i, j, M(i,j))] and commit 3 dense polynomial respectively. Giving original m(x) dense size is 2^32. The insight magic of the `SPARK` is via split into i, j polynomial, each size just match non-zero entries of m(x). I think the most innovation to break variables into row, col, is in SPARK offline memory check memory-in-the-head, it reduced audit_ts_(row/col) dense size from 2^32 to 2^16 size. 

> originally my question is why `row` + `col` instead of `row`, is it just to deal with R1CS matricx ? After some thought I found it's not, the key point to split into `row`, `col` is |audit_ts| size can be reduce exponentially, from 2^32 to 2 ^ 16. The math magic due to identity polynomial `eq` are splittable, each with exponentially size reduction, and can be evaluated separately.

Prover need to commit `i`, `j`, `M`, along with `read_ts_row`, `write_ts_row`, `audit_ts_row`, `read_ts_col`, `write_ts_col`, `audit_ts_col` related to SPARK protocol.

With SPARK, the e2e table proving flow will be like this
- prover commit to original polynomiall related to table proof, plus few polynomial for SPARK: `i`, `j`, `M`, `read_ts_row`, `write_ts_row`, `audit_ts_row`, `read_ts_col`, `write_ts_col`, `audit_ts_col`
- derived challenges
- generate tower sumcheck proof, at the end of logup m(x), we have random point $r$ and evaluation v=m($r$)
- generate mPCS proof for original polynomials.
- generate spark_proof(r, v=m($r$), [`i`, `j`, `M`, `read_ts_row`, `write_ts_row`, `audit_ts_row`, `read_ts_col`, `write_ts_col`, `audit_ts_col`])
  - tower_product_sumcheck: for `row`, `col` offline memory check follow SPARK
  - one sumcheck: v = $\sum_k e_{row}(k) * e_{col}(k) * M(k)$
  - check offline memory check formula of row, col satisfication
  - derive new random point $r'$ and 9 evaluations i($r'$), j($r'$), M($r'$), read_ts_row($r'$), write_ts_row($r'$), ...
- generate mPCS proof for SPARK protocol

### What the overhead
In table proof part, since there is a new SPARK proof flow involve in critical path sequentially, the overall latency of table proof will increase. However all the opcode proof will benefit quite a lot on 32 bits range check. As the opcode proof occupied the major overhead, the increase overhead in table probably negligible.

In more detail analysis, the proving time overhead in SPARK is donimated by the size |read_ts_row|, |write_ts_row|, |read_ts_col|, |write_ts_col|, associate with number of non zero entry |m(x)|. In real world workload, if there are more repeated values to be range check, then non zero entry in |m(x)| will be even less, so the cost will be save quite a lots. The worst case happend suppose all the lookups value are distinct.

### Sub Task breandown
- [x] preparation: justify the solution worthy by some mock data. E.g. hardcode riscv_add lookup from 9 -> 5 and see latency improvement. https://github.com/scroll-tech/ceno/pull/704
- [ ] implement SPARK spare polynomial commitment & benchmark
- [ ] opcode circuit to do 32 bit range check https://github.com/scroll-tech/ceno/pull/686
- [ ] introduce SPARK into table proof flow

### Other side effects
This feature rely on base field to hold 32 bit riv32, therefore we need to stick to Goldilock64.
So the future roalmap will be Goldilock64 -> Binary Field, without mersenne31/babybear in transition.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Proposal features: 16 -> 32 bit range check on LogUp #702

Background

Design Rationales

What the overhead

Sub Task breandown

Other side effects

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Proposal features: 16 -> 32 bit range check on LogUp #702

Description

Background

Design Rationales

What the overhead

Sub Task breandown

Other side effects

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions