# Sieving Hardware for the NFS: Architectures and their Bottlenecks

Willi Geiselmann

## The Number Field Sieve

- Precomputation
- Relation collection
- Linear Algebra (Matrix step)
- Postprocessing

## **Relation Collection**

Given F<sub>1</sub>(x,y), F<sub>2</sub>(x,y) ∈ Z[x,y], homogeneous polynomials, e.g. of degree 5 and 1
Find (a,b) ∈ Z x N with F<sub>1</sub>(a,b) and F<sub>2</sub>(a,b) smooth, gcd(a,b) = 1

#### Parameters for 1024 Bit (from TWIRL, 2003)

Smoothness bounds: B<sub>1</sub> = 2.6 • 10<sup>10</sup> (algebraic), B<sub>2</sub> = 3.5 • 10<sup>9</sup> (rational). (≈ 2 x 5 GByte)
Sieving region: A = 5.5 • 10<sup>14</sup>, -A < a < A; B = 2.7 • 10<sup>8</sup>, 0 < b < B. (≈ 1400 TByte)</li>

## Suggested Hardware Designs

- TWINKLE [Shamir 1999; Shamir, Lenstra 2000] not designed for 1024 bit numbers
- TWIRL [Shamir, Tromer 2003] full wafer design
- Mesh-based sieving [G., St. 2003, 2004] not feasible for 1024 bit numbers
- SHARK [Franke et al. 2005] elaborated butterfly transport system
- SmallChips (Non-Wafer-Scale Sieving HW) [G., St. 2007] more realistic, but high inter-chip communication

## Sieving (Eratosthenes)



# Sieving (TWINKLE/TWIRL)



## **TWIRL - Types of Primes**

 Largish primes (rational): 2<sup>19</sup> < p (algebraic): 2<sup>22</sup> < p</li>
 Medium primes: 256 < p</li>
 Smallish primes: p < 256</li>

# TWIRL - Largish Stations DRAM



 Rotates / reads with constant speed
 Writes to the "correct" address

# TWIRL - Largish Stations Adder Array

Rational: 2 100 lines 4 096 columns  $\approx 2^{23}/2$  adders Algebraic: 14 900 lines 32 768 columns  $\approx 2^{29}/64$  adders



#### **TWIRL - Med./Small Stations**

• For  $p < 2^{19}$  (2<sup>22</sup>)

 Stations are different: smaller DRAM, more logic to generate "multiple hits"
 Adder Array similar (≈ 500 lines)

# TWIRL - Parts/Communication (rational)

- Size: 160 cm<sup>2</sup>
   (60 cm<sup>2</sup> DRAM)
- L. Stations:
   60 cm<sup>2</sup> (incl. DRAM)
- Adder Array:
   64 cm<sup>2</sup> + 35 cm<sup>2</sup>

subtraction of the property of

(when using a 0.13  $\mu$ m process)

# TWIRL - Parts/Communication (algebraic)

- Size: 659 cm<sup>2</sup>
   (435 cm<sup>2</sup> DRAM)
- L. Stations:
   490 cm<sup>2</sup> (incl. DRAM)
- Adder Array:
   130 cm<sup>2</sup> + 39 cm<sup>2</sup>

(when using a 0.13  $\mu$ m process)

#### **TWIRL - Performance**

Total chip area 8 x 160 cm<sup>2</sup> + 659 cm<sup>2</sup>
One sieving line in 33 sec (1 GHz)
Sieving of a 1024 bit number with 194 devices in one year
Fastest design (time x area)

## TWIRL – Problems/Solutions

- Devices can not (easily) be cut into pieces (I/O bandwidth of chips).
- Larger Factorbasis increases this problem.
- Production errors especially in Adder Array cause problems: Redundancy required (increases size).
- Smaller production process and/or significant increase in I/O bandwidth would help.



## SmallChips - Idea

Sieve intervals of a given size (2<sup>26</sup>)
Generate (the rare) hits of large primes on different chips
Collect the hits in the memory cell responsible for this sieve location

### SmallChips - Types of Primes

 Largish primes I: 2<sup>27.2</sup> 1</sub> < 2<sup>35</sup> ...Type II/III: 1.5 • 10<sup>7</sup> 27.2</sup>
 Medium primes: 2<sup>13</sup> 7</sup>
 Smallish primes: p < 2<sup>13</sup>

# **SmallChips - Largish Stations**



### **SmallChips - Largish Stations**

256 stations for p > 1.5 • 10<sup>7</sup> ≈ 2<sup>27.2</sup>
Distributed on 32 chips: size: 472 mm<sup>2</sup> (0.13 µm process) output: 448 bit per clock cycle memory: 99%, logic: 1%
DRAM to store both FBs: 160 cm<sup>2</sup> SmallChips - Medium/ **Smallish Stations** Different type of storage: First (p,r)-pair are stored, others are calculated For p < 2<sup>20</sup>: calculated in the collection unit (reduces communication, increases storage/area

# **SmallChips - Collection Unit**

- Distributed on 4 chips, each holding
- 4 arrays of 32 x 32 counting units.
- Each unit is in charge of 2<sup>12</sup> sieve locations,
- and adding up the log(p) values.



## SmallChips - Collection Unit (area estimates)

Distributed on 4 chips: size: 493 mm<sup>2</sup> (0.13 μm process) input: 3584 bit / cc memory: 94% logic: 6%



# **SmallChips - Performance** Total silicon area 172 cm<sup>2</sup> One subinterval (S=2<sup>26</sup>) in 53,000 cc One sieve line in 25 min (600 MHz) Sieving of a 1024 bit number with 8300 devices in one year 3.5 x more silicon area than TWIRL or 2.0 x more after modification

# SmallChips - Problems/ Solutions

- Very fast communication as input to collection unit -> distribute collection unit on more chips.
- Smaller process reduces chip size and/or allows to increase FB, communication will not increase much.
   4% FB ↔ 0.4% communication 100% FB ↔ 10% communication

## Conclusion

 SmallChips seems to be feasible
 Design/production costs are high
 Running costs are very high: 8300 devices require 1.6 MWatt (200 W per device seems optimistic)
 -> 1 400 000 € per Factorization