Accelerating Polygon zkEVM

We report the first end-to-end, FPGA-accelerated Polygon zkEVM prover.

4/24/24

Irreducible Team

Benchmarking Performance and Cost

Leveraging our previously reported self-contained FPGA modules for accelerated Merkle tree construction and LDE computation, as well as memory layout modifications to the Polygon zkProver for improved data transpose performance, we generated a 500-transaction batch proof in 84 seconds. The table below indicates how this performance compares to the non-accelerated zkProver.

Last year, Polygon’s zkProver demonstrated an impressive 500-transaction batch-proof generation time of under 2 minutes on a 224-thread Google Cloud Platform (GCP) server. Our FPGAs are currently integrated with a less powerful 64-thread server. Still, we found that our accelerated zkProver achieved a ~2x performance increase compared to the unmodified version on the same gen1-64 and, more importantly, a 1.4x increase compared to the more powerful GCP instance.

For protocols, this means that Irreducible’s accelerated zkProver can provide the same performance as the unmodified version, but at a much lower cost. Since hardware capex is proprietary, we can compare cost efficiency on the basis of service provider pricing. The table below uses current GCP spot and reserved prices to illustrate the cost of generating a proof.

GCP pricing can be found here, and assumes 1-year commitment in the Iowa (US-central1) region

Polygon’s report quoted spot instance pricing. Because Irreducible can offer a lower cost of $0.05 per proof, it’s more cost-effective to use our accelerated service than to run the zkProver on Google Cloud. This is particularly true given that reserved pricing may be the better benchmark. Spot instances are notoriously difficult to manage due to service interruptions, capacity issues, and volatile pricing. So customers needing reliable and robust proof generation will have to pay for reserved or on-demand instances, which are generally 3-4x more expensive. Irreducible offers proving-as-a-service to meet this need, not only at a significantly lower price than current reserved/on-demand rates, but additionally saving customers the expense and hassle of managing their own infrastructure.

Our accelerated zkProver is not only compatible with Polygon’s imminent zkEVM type 1 upgrade, Polygon Miden, and zkSync Boojum, but also anticipates further improved performance with additional enhancements. By applying FPGA acceleration to the codeword batching and quotient polynomial computation stages, we predict a proof generation time under 60 seconds.

Technical Overview

Polygon zkEVM is a complex system with several stages of the proof generation process. We focused our acceleration efforts for this initial proof-of-concept on the polynomial commitment operation, which is the most expensive and accounts for 63% of the overall proving time (before acceleration). This operation commits a batch of polynomials, and it executes four separate times over the course of one proof on polynomial batches of different sizes. All of the polynomial batches contain polynomials with 223 coefficients. The first batch is 669 polynomials, the second 132, the third 372, and the fourth 6 polynomials. The polynomial commitment operation begins with a low-degree extension (LDE) column-wise over the matrix, producing an extended matrix with 224 rows and 669 columns. A low-degree extension consists of multiple Number Theoretic Transforms (NTT). The prover then performs a leaf-hashing operation, hashing each row of the extended matrix with the Poseidon hash, and finally computes a Merkle tree over each row digest. The total size of the input data over all four rounds is 73.7 GiB, and the total size of the LDE output (which must be stored until the end of the proof generation) is 147 GiB.

The integrated system accelerates the LDE and leaf-hashing operations using two FPGA cards. They are connected to each other directly via QSFP and communicate using AMD’s Aurora link-layer protocol. The host streams data via PCIe 4.0 to one FPGA device, which performs the LDE. Once the LDE operation completes, the device streams the output both back to the host and to the second device, which runs the leaf-hashing engine (LHE). This architecture, shown below, both reduces the time to compute the LDE and almost completely hides the latency of leaf-hashing.

This integration requires several modifications to Polygon’s open-source C++ zkevm-prover software, which has a sophisticated memory management strategy, storing all polynomials in a pre-allocated arena at precomputed offsets. The prover implementation stores the polynomial batches in interleaved order, meaning each polynomial is not contiguous in memory. This is convenient during the trace generation and Merkleization steps, and also assists parallelization of the LDE computation. Because our FPGAs only have 16 GiB of High Bandwidth Memory (HBM), we must transpose the stored polynomials before transferring them to the device. Similarly, we transpose the polynomials received from the device before writing them back to program memory. We took two measures to minimize the cost of the transposes. First, we modified the zkevm-prover memory layout so that every row of the matrices are padded to 64-byte boundaries to align with CPU cache lines. This enables our high-throughput, AVX2-accelerated matrix transpose implementation. Second, rather than performing the entire transposes before the LDE begins and after it completes, we pipeline the transposes with the computation in the device to hide the software latency. The software must both transpose the data sent to and received from the LDE module concurrently in separate threads until all polynomials in a batch are processed. We emphasize that even with our software modifications, the proofs we generate are completely valid when checked by the reference verifier.

Ultimately, these improvements enabled us to generate a proof ~40% faster than ever previously reported by Polygon’s zkProver. With further FPGA-based acceleration addressing two additional computation steps, we’ll be able to deliver proofs in under one minute flat.