# Introduction to Efficient and Secure Arithmetic Circuits 

Arnaud TISSERAND

CNRS, Lab-STICC

Conférence rentrée informatique ENS Paris-Scalay. Sept. 2021

## Course Language

Slides have been prepared in English.
Some words/remarks are also given in FR French in case of not immediate translation or specific feature.

Questions, comments and help requests are welcome in both French and English.

## Licence

This document is licensed under a Creative Commons Attribution NonCommercial 4.0 International License.
https://creativecommons.org/licenses/by-nc/4.0/

All figures and tables not from the author are presented with their source.

## Summary

## Computer Arithmetic

Preliminaries on Digital Circuits

Addition \& Multiplication

Introduction to Physical Attacks

Protections at the Arithmetic Level

References

References to books, articles and links are given throughout and at the end of this document.

## Computer Arithmetic

## What is Computer Arithmetic? (Personal Definition)

Branch of computer engineering/science that deals with:

- representations of numbers: formats, coding and behavior for (subsets of) $\mathbb{N}, \mathbb{Z}, \mathbb{Q}, \mathbb{R}, \mathbb{C}, \mathbb{F}_{q}, \ldots$, fixed vs multiple precision;
- algorithms for operations: $\pm, \times, \div \sqrt{ }, \frac{1}{x}, \frac{1}{\sqrt{x}}, \frac{1}{\sqrt{x^{2}+y^{2}}}$, exp, log, sin, cos, mod, $\operatorname{gcd},(a+b) \bmod p$, conversions, $\ldots$;
- implementations in hardware or(/and) software;
- quality: error/accuracy, specific cases (div. by 0), reproducibility;
- speed: delay, latency, throughput;
- costs: silicon area, code/data memory, power/energy consumption;
- methods and tools: study, coding, validation, verification, porting, evaluation, ...;
- training of programmers and users;
- ?


## Computer Arithmetic Overview

## representations of numbers $\mathbb{N}, \mathbb{Z}, \mathbb{Q}, \simeq \mathbb{R}, \mathbb{F}_{\mathscr{Q}}$

# Computer Arithmetic Overview 

```
representations of numbers \(\mathbb{N}, \mathbb{Z}, \mathbb{Q}, \simeq \mathbb{R}, \mathbb{F}_{q}\)
```


## algorithms

```
\pm, }\times,\div,\sqrt{q}{,},mo
, e
```

implementation
soft GPP $/$ SP,
ASIC, FPGA

## Computer Arithmetic Overview

## representations of numbers $\mathbb{N}, \mathbb{Z}, \mathbb{Q}, \simeq \mathbb{R}, \mathbb{F}_{q}$


implementation soft GPP/SP, ASIC, FPGA


## Computer Arithmetic Overview



## Computer Arithmetic Overview



## Computer Arithmetic Overview



## Computer Arithmetic Overview

speed, throughput, latency


## Computer Arithmetic Overview

speed, throughput, latency


## Computer Arithmetic Close Domains

- microelectronics for digital circuits;
- computer architecture for processor design (units, instructions, registers, interruptions, counters ... );
- programming languages and compilation;
- numerical computing and applied mathematics;
- formal proofs and verification methods;
- computer algebra ( $\mathbb{F R}$ calcul formel );
- specific application domains such as signal and image processing, AI
- and probably other domains...


## Computer Arithmetic in Software (Example SW1)

The following Python code:

```
a, b = 1, 9
c = a + b
print(c, type(c))
from math import *
x = pi + 1.0
print(x, type(x))
print([ sin(pi/n) for n in [4, 6, 12] ])
```

produces (using Python 3.7):
10 <class 'int'>
4.141592653589793 <class 'float'>
[0.7071067811865475, $0.49999999999999994,0.25881904510252074$

Warning : do not perform large computations using "raw" Python, use NumPy standard library (see also Numba or PyPy)!

## Computer Arithmetic in Software (Example SW2)

The following C code:

```
#include <stdio.h>
#include <math.h>
int main() {
    double n = 4.0;
    double x = M_PI;
    double y = sin(x/n);
    printf("y = %f\n", y);
    return 0;
}
```

compiled (gcc 8.3) using:
gcc -lm example_sin.c
produces:
$\mathrm{y}=0.707107$

## Computer Arithmetic in Software (Example SW3)

The following Python code:

```
a = 1.0
b = 12.345e50
c = 9.8765e-40
v = [a, -a, b, -b, c, -c]
print(sum(v))
from itertools import permutations
print(sorted(set( [ sum(e) for e in permutations(v) ] )))
produces (using Python 3.7):
0.0
[-1.0, -9.8765e-40, 0.0, 9.8765e-40, 1.0]
```

Warning : associativity does not (necessarily) hold for floating-point arithmetic! See for instance: David Goldberg. What every computer scientist should know about floating-point arithmetic. ACM Comput. Surv. 23, March 1991, DOI: https://doi.org/10.1145/103162.103163

## Computer Arithmetic in Hardware (Example HW1)

Overview of one core in an Intel Xeon processor, source: https:
//www.hc32.hotchips.org/assets/program/conference/day1/ HotChips2020_Server_Processors_Intel_Irma_ICX-CPU-final3.pdf


Link to other examples: https://en.wikichip.org/wiki/WikiChip

## Computer Arithmetic in Hardware (Example HW2)

Source: NVIDIA TURING GPU ARCHITECTURE white paper (WP-09183-001_v01)


See also: https://developer.nvidia.com/blog/nvidia-turing-architecture-in-depth/

Preliminaries on Digital Circuits

## Logic Values: Representation

The logic values $\{0,1\}$ are represented using voltages:

- $0 \Longleftrightarrow$ reference voltage or ground ( $V_{S S}$, m 1 mor )
- $1 \Longleftrightarrow$ supply voltage ( $V_{D D}>0$ or $\uparrow$ )

Due to the noise in the circuit (from many sources), the logic values must be represented using voltage intervals (noise margins): digital vs. analog



## CMOS Logic

CMOS = complementary MOS
N and P transistors are only used for passing strong signals


The simplest gate: only 2 transistors ( 1 N and 1 P )

| A | Y |
| :---: | :---: |
| 0 | 1 |
| 1 | 0 |

circuit:

behavior:



## Logic Gate: NAND2 (2-input not-and)



| A | B | Y |
| :---: | :---: | :---: |
| 0 | 0 | 1 |
| 0 | 1 | 1 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |



## Logic Gate: NAND3 (3-input NAND)



| A | B | C | Y |
| :---: | :---: | :---: | :---: |
| 0 | 0 | 0 | 1 |
| 0 | 0 | 1 | 1 |
| 0 | 1 | 0 | 1 |
| 0 | 1 | 1 | 1 |
| 1 | 0 | 0 | 1 |
| 1 | 0 | 1 | 1 |
| 1 | 1 | 0 | 1 |
| 1 | 1 | 1 | 0 |



The number of transistors in series is limited (3 to 5)

## Memory Elements

There are many types of memory elements. Here, we will only focus on standard flip-flops


| CLK | D | Q(t+1) | QN(t+1) |
| :---: | :---: | :---: | :---: |
| 1 | $X$ | $Q(t)$ | $Q N(t)$ |
| 0 | $X$ | $Q(t)$ | $Q N(t)$ |
| $\uparrow$ | 0 | 0 | 1 |
| $\uparrow$ | 1 | 1 | 0 |

Remark:
$\uparrow$ is the rising clock edge


## Setup, Hold and Propagation Delays



- setup delay ( $\mathrm{t}_{\text {setup }}$ ): data should be held steady before clock edge
- hold delay ( $\mathrm{t}_{\text {hold }}$ ): data should be held steady after clock edge
- propagation delay ( $\mathrm{t}_{\text {propag }}$ ): propagation time from D to Q


## Fanout ( ER sortance )

The gate delay (change output state) depends on the output load. Fanout measures this load as the number of inputs of gate connected to the output (normalized w.r.t. an inverter)




## Power Consumption: Basic Definitions

Instantaneous power:

$$
P(t)=i_{D D}(t) V_{D D}
$$

Energy over some time interval T:

$$
E=\int_{0}^{T} i_{D D}(t) V_{D D} d t
$$

Average power over interval T:

$$
P_{a v g}=\frac{E}{T}=\frac{1}{T} \int_{0}^{T} i_{D D}(t) V_{D D} d t
$$



Units:

- current A
- voltage V
- power W
- energy J or Wh


## Power Consumption: Components

Power dissipation in CMOS circuits comes from 2 main components:

- static dissipation:
- sub-threshold conduction through OFF transistors
- leakage current through P-N junctions
- tunneling current through gate oxide
- ...
- dynamic dissipation:
- charging and discharging of load capacitances (useful + parasitic)
- short-circuit current

$$
P_{\text {total }}=P_{\text {static }}+P_{\text {dynamic }}
$$

## Charging and Discharging Load Capacitances

There are capacitances everywhere in the circuit: transistor gate, routing, parasitics. . .


## Solutions:

- design small circuits (small transistor, short wires, technology shrinking)
- reduce the activity (algorithms, data coding, sleep mode)
- reduce $V_{D D}$ (without lowering speed)


## Simple Power Consumption Model

Average dynamic power dissipation (no leakage, no short circuit):

$$
P=\alpha \times C \times f \times V_{D D}^{2}
$$

where

- $\alpha$ is the activity factor
- $C$ is the average switched capacitance (at each cycle)
- $f$ is the circuit frequency
- $V_{D D}$ is the supply voltage

Remark: the gate delay is $d=\gamma \times \frac{C \times V_{D D}}{\left(V_{D D}-V_{T}\right)^{2}} \approx \frac{1}{V_{D D}}$

- gate and/or input reordering (reduce glitching power):

- use complex gates (reduce internal capacitances and area):


Addition \& Multiplication

## Positional Number System(s)

$$
X=\sum_{i=-m}^{n-1} x_{i} \beta^{i}=\left(x_{n-1} x_{n-2} \cdots x_{1} x_{0} \cdot x_{-1} x_{-2} \cdots x_{-m}\right)
$$

- radix $\beta$ (usually a power of 2 )
- digits $x_{i}(\in \mathbb{N})$ in the $\operatorname{digit}$ set $\mathcal{D}$
- rank or position $i$, weight $\beta^{i}$
- $n$ integer digits, $m$ fractional digits

Examples:

- $\beta=10, \mathcal{D}=\{0,1,2,3,4,5,6,7,8,9\}$
- $\beta=2, \mathcal{D}=\{0,1\}$


## Positional Number System(s)

$$
X=\sum_{i=-m}^{n-1} x_{i} \beta^{i}=\left(x_{n-1} x_{n-2} \cdots x_{1} x_{0} \cdot x_{-1} x_{-2} \cdots x_{-m}\right)
$$

- radix $\beta$ (usually a power of 2 )
- digits $x_{i}(\in \mathbb{N})$ in the $\operatorname{digit}$ set $\mathcal{D}$
- rank or position $i$, weight $\beta^{i}$
- $n$ integer digits, $m$ fractional digits

Examples:

- $\beta=10, \mathcal{D}=\{0,1,2,3,4,5,6,7,8,9\}$
- $\beta=2, \mathcal{D}=\{0,1\}$
- carry save: $\beta=2, \mathcal{D}_{\mathrm{cs}}=\{0,1,2\}$
- borrow save: $\beta=2, \mathcal{D}_{\mathrm{bs}}=\{-1,0,1\}$


## Positional Number System(s)

$$
X=\sum_{i=-m}^{n-1} x_{i} \beta^{i}=\left(x_{n-1} x_{n-2} \cdots x_{1} x_{0} \cdot x_{-1} x_{-2} \cdots x_{-m}\right)
$$

- radix $\beta$ (usually a power of 2 )
- digits $x_{i}(\in \mathbb{N})$ in the $\operatorname{digit}$ set $\mathcal{D}$
- rank or position $i$, weight $\beta^{i}$
- $n$ integer digits, $m$ fractional digits

Examples:

- $\beta=10, \mathcal{D}=\{0,1,2,3,4,5,6,7,8,9\}$
- $\beta=2, \mathcal{D}=\{0,1\}$
- carry save: $\beta=2, \mathcal{D}_{\mathrm{cs}}=\{0,1,2\}$
- borrow save: $\beta=2, \mathcal{D}_{\mathrm{bs}}=\{-1,0,1\}$
- signed digits: $\beta>2, \mathcal{D}_{\text {sd }, \alpha, \beta}=\{-\alpha, \ldots, \alpha\}$ with $2 \alpha+1 \geq \beta$


## Positional Number System(s)

$$
X=\sum_{i=-m}^{n-1} x_{i} \beta^{i}=\left(x_{n-1} x_{n-2} \cdots x_{1} x_{0} \cdot x_{-1} x_{-2} \cdots x_{-m}\right)
$$

- radix $\beta$ (usually a power of 2 )
- digits $x_{i}(\in \mathbb{N})$ in the $\operatorname{digit}$ set $\mathcal{D}$
- rank or position $i$, weight $\beta^{i}$
- $n$ integer digits, $m$ fractional digits

Examples:

- $\beta=10, \mathcal{D}=\{0,1,2,3,4,5,6,7,8,9\}$
- $\beta=2, \mathcal{D}=\{0,1\}$
- carry save: $\beta=2, \mathcal{D}_{\text {cs }}=\{0,1,2\}$
- borrow save: $\beta=2, \mathcal{D}_{\text {bs }}=\{-1,0,1\}$
- signed digits: $\beta>2, \mathcal{D}_{\text {sd }, \alpha, \beta}=\{-\alpha, \ldots, \alpha\}$ with $2 \alpha+1 \geq \beta$
- theoretical systems: $\beta=\frac{1+\sqrt{5}}{2}, \beta=1+i \ldots$


## Radix-2 Signed Integers

- sign and magnitude (absolute value)

$$
A=\left(s_{a} a_{n-2} \ldots a_{1} a_{0}\right)=(-1)^{s_{a}} \times \sum_{i=0}^{n-2} a_{i} 2^{i}
$$

- 2's complement

$$
A=\left(a_{n-1} a_{n-2} \ldots a_{1} a_{0}\right)=-a_{n-1} 2^{n-1}+\sum_{i=0}^{n-2} a_{i} 2^{i}
$$

- biased (usually $B=2^{n-1}-1$ )

$$
A=A_{\text {math }}+B
$$

## Signed Integers

| integer | representations |  |  |
| :---: | :---: | :---: | :---: |
|  | sign/magnitude | 2's complement | biased <br> $(\mathrm{B}=7)$ |
|  | - | 1000 |  |
| -7 | 1111 | 1001 | 0000 |
| 6 | 1110 | 1010 | 0001 |
| -5 | 1101 | 1011 | 0010 |
| -4 | 1100 | 1100 | 0011 |
| -3 | 1011 | 1101 | 0100 |
| -2 | 1010 | 1110 | 0101 |
| -1 | 1001 | 1111 | 0110 |
|  | 0000 | 0000 | 0111 |
| 1 | 0001 | 0001 | 1000 |
| 2 | 0010 | 0010 | 1001 |
| 3 | 0011 | 0011 | 1010 |
| 4 | 0100 | 0100 | 1011 |
| 5 | 0101 | 0101 | 1100 |
| 6 | 0110 | 0110 | 1101 |
| 7 | 0111 | 0111 | 1110 |
| 8 |  |  |  |

## Fixed-Point Representations

Widely used in DSPs and digital integrated circuits for higher speed, lower silicon area and power consumption compared to floating point


Typical fixed-point formats: 16, 24, 32 and 48 bits

## Floating-Point Representation(s)

Radix- $\beta$ floating-point representation of $x$ :

- sign $s_{x}$, 1-bit encoding: $0 \Rightarrow x>0$ and $1 \Rightarrow x<0$
- exponent $e_{x} \in \mathbb{N}$ on $k$ digits and $e_{\min } \leq e_{x} \leq e_{\max }$
- mantissa $m_{x}$ on $n+1$ digits
- encoding:

$$
\begin{aligned}
x & =(-1)^{s_{x}} \times m_{x} \times \beta^{e_{x}} \\
m_{x} & =x_{0} \cdot x_{1} x_{2} x_{3} \cdots x_{n} \\
x_{i} & \in\{0,1, \ldots, \beta-1\}
\end{aligned}
$$

For accuracy purpose, the mantissa must be normalized $\left(x_{0} \neq 0\right)$
Then $m_{x} \in[1, \beta[$ and a specific encoding is required for the number 0

## IEEE-754: basic formats

Radix $\beta=2$, the first bit of the normalized mantissa is always a " 1 " (non-stored implicit bit)

| format | number of bits |  |  |  |
| :---: | :---: | :---: | :---: | :---: |
|  | total | sign | exponent | mantissa |
| double precision | 64 | 1 | 11 | $52+1$ |
| simple precision | 32 | 1 | 8 | $23+1$ |

double precision


IEEE-754: Exponent and Special Values

| format | $\begin{gathered} \text { size } \\ k \end{gathered}$ | bias <br> b |  | unbiased |  | biased |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  |  |  |  | $e_{\text {min }}$ | $e_{\text {max }}$ | $e_{\text {min }}$ | $e_{\text {max }}$ |
| SP | 8 | 127 | $\left(=2^{8-1}-1\right)$ | -126 | 127 | 1 | 254 |
| DP | 11 | 1023 | $\left(=2^{11-1}-1\right)$ | -1022 | 1023 | 1 | 2046 |


| -0 | 1 | 00000000 | 00000000000000000000000 |
| :---: | :--- | :--- | :--- | :--- |
| +0 | 0 | 0000000 | 00000000000000000000000 |
| $-\infty$ | 1 | 11111111 | 00000000000000000000000 |
| $+\infty$ | 0 | 11111111 | 00000000000000000000000 |
| NaN | 0 | 11111111 | 00000000000000000000001 (for instance) |

Not a Number ( NaN ) is the result of invalid operations such as $0 / 0, \sqrt{-1}$ or $0 \times \infty$

## Basic Cells for Addition

Useful circuit element in computer arithmetic: counter
A $(m, k)$-counter is a cell that counts the number of 1 on its $m$ inputs (result expressed as a $k$-bit integer)

$$
\sum_{i=0}^{m-1} a_{i}=\sum_{j=0}^{k-1} s_{j} 2^{j}
$$



Standard counters:

- half-adder or HA is a $(2,2)$-counter
- full-adder or FA is a $(3,2)$-counter


## FA Cell



Arithmetic equation:

$$
2 c+s=a+b+d
$$

Logic equation:

$$
\begin{aligned}
& s=a \oplus b \oplus d \\
& c=a b+a d+b d
\end{aligned}
$$

Articles about FA in IEEE Journals


There many implementations of the FA cell

## Carry Ripple Adder (CRA)

Very simple architecture: $n$ FA cells connected in series


|  | complexity |
| :---: | :---: |
| delay | $O(n)$ |
| area | $O(n)$ |

Warning: Sometimes a CRA is also called Carry Propagate Adder (CPA), but CPA also means a non-redundant adder (that propagates)

## Useless Activity in a Carry Ripple Adder



Very simple architecture:
$n$ FA cells connected in series


Theoretical models (equiprobable and uniform distribution of inputs):

- worst case $n^{2} / 2$ transitions
- average $3 n / 2$ transitions and only $n / 2$ useful


## Carry-Select Adder

Idea: computation of the higher half part for the 2 possible input carries ( 0 and 1) and selection when the output carry from lower half part is known


Recursive version $\longrightarrow O(\log n)$ delay but there is a fanout problem...

## Carry Lookahead Adder

Idea: compute all carries as fast as possible (instead of propagating them)
At rank $i$, the input carry $c_{i}$ is 1 in the following cases:

- rank $i-1$ generates a carry
$\hookrightarrow g_{i-1}=1$
- rank $i-1$ propagates a carry generated at rank $i-2$
$\hookrightarrow p_{i-1}=g_{i-2}=1$
- ranks $i-1$ and $i-2$ propagate a carry generated at rank $i-3$
$\hookrightarrow p_{i-1}=p_{i-2}=g_{i-3}=1$
- ranks $i-1$ to 0 propagate the adder input carry $c_{0}$ (set to 1 )
$\hookrightarrow p_{i-1}=p_{i-2}=\ldots=p_{1}=p_{0}=c_{0}=1$

All carries can be computed using the relation $\left(c_{i}=g_{i-1}+c_{i-1} p_{i-1}\right)$ :

$$
c_{i}=g_{i-1}+p_{i-1} g_{i-2}+p_{i-1} p_{i-2} g_{i-3}+\ldots+p_{i-1} \cdots p_{1} g_{0}+p_{i-1} \cdots p_{0} c_{0}
$$

CLA architecture: parallel evaluation of

- $\left(g_{i}, p_{i}\right)$ for all $i$
- carries $c_{i}$ for all $i$ using the above equation
- sums using $s_{i}=a_{i} \oplus b_{i} \oplus c_{i}=p_{i} \oplus c_{i}$


Carry Lookahead Adder: 4-Bit Example

$$
\begin{aligned}
& c_{1}=g_{0}+p_{0} c_{0} \\
& c_{2}=g_{1}+p_{1} g_{0}+p_{1} p_{0} c_{0} \\
& c_{3}=g_{2}+p_{2} g_{1}+p_{2} p_{1} g_{0}+p_{2} p_{1} p_{0} c_{0} \\
& c_{4}=g_{3}+p_{3} g_{2}+p_{3} p_{2} g_{1}+p_{3} p_{2} p_{1} g_{0}+p_{3} p_{2} p_{1} p_{0} c_{0}
\end{aligned}
$$



## Parallel-Prefix Problems

The $n$ outputs $\left(y_{n-1}, y_{n-2}, \cdots, y_{0}\right)$ are computed using the $n$ inputs $\left(x_{n-1}\right.$, $x_{n-2}, \cdots, x_{0}$ ) and the associative operator $\square$ :

$$
\begin{aligned}
y_{0} & =x_{0} \\
y_{1} & =x_{1} \square x_{0} \\
y_{2} & =x_{2} \square x_{1} \square x_{0} \\
& \vdots \\
y_{n-1} & =x_{n-1} \square x_{n-2} \square \cdots \square x_{1} \square x_{0}
\end{aligned}
$$



## Parallel-Prefix Addition: Standard Architectures




## Redundant or Constant Time Adders

To speed-up the addition, one solution consists in "saving" the carries and using them (this makes sense only in case of multiple additions)

In 1961, Avizienis suggested to represent numbers in radix $\beta$ with digits in $\{-\alpha,-\alpha+1, \ldots, 0, \ldots, \alpha-1, \alpha\}$ instead of $\{0,1,2, \ldots, \beta-1\}$ with $\alpha \leq \beta-1$

Using this representation, if $2 \alpha+1>\beta$ some numbers have several possible representation at the bit level. For instance, the value 2345 (in the standard representation) can be represented in radix 10 with digits in $\{-5,-4,-3,-2,-1,0,1,2,3,4,5\}$ by the values $2345,235(-5)$ or $24(-5)(-5)$

Such a representation is said redundant
In a redundant number system there is constant-time addition algorithm (without carry propagation) where all computations are done in parallel

## Addition

Q: How can we speed up addition?


## Addition

Q: How can we speed up addition?
R: Save the carries!


## Addition

Q: How can we speed up addition?
R: Save the carries!


The computation time does not depend on $n$

## Addition using the carry-save representation

Q: How can we speed up addition?
R: Save the carries!


The computation time does not depend on $n$

## Addition using the carry-save representation

Q: How can we speed up addition?
$R$ : Save the carries!


$$
\begin{aligned}
& X+Y+Z=S+R=\sum_{i=0}^{n}\left(s_{i}+r_{i}\right) 2^{i} \\
& =W=\sum_{i=0}^{n} w_{i} 2^{i} \text { avec } w_{i}=s_{i}+r_{i} \in\{0,1,2\} \\
& =\left(\begin{array}{l|l|l|l|l}
w_{n} w_{n-1} \ldots w_{1} w_{0}
\end{array}\right)_{\mathrm{cs}}=\left(\begin{array}{|c|c|c|c|c|c|c|c}
s_{n} & s_{n-1} & s_{1} & s_{0} \\
r_{n} & r_{n-1} \\
r_{1} & r_{0} \\
\hline
\end{array}\right)_{\mathrm{cs}}
\end{aligned}
$$

The computation time does not depend on $n$
$T(n)=O(1)$

Addition of 2 Carry-Save Numbers


## Carry-Save Trees

Example with 3 inputs: $A, B$ and $C$


Carry-save reduction tree: $n(h)$ non-redundant inputs can be reduced by a $h$-level carry-save tree where $n(h)=\lfloor 3 n(h-1) / 2\rfloor$ and $n(0)=2$

| $h$ | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| $n(h)$ | 3 | 4 | 6 | 9 | 13 | 19 | 28 | 42 | 63 | 94 | 141 |

## Shift-And-Add Multiplication

The product $P=A \times B$ can be performed using additions and shifts with the following (parallel-serial) algorithm:

$$
\begin{aligned}
& P \longleftarrow 0 \\
& \text { for } i \text { from } 0 \text { to } n-1 \text { do } \\
& P \longleftarrow P+a_{i} B 2^{i}
\end{aligned}
$$

Remark: This algorithm requires a shifter operator (variable shift amount) Simplification (constant shift):

Operation on line 4 is virtual

Shift-And-Add Multiplication: Implementation


|  | complexity |
| :---: | :---: |
| delay | $O(n)$ |
| area | $O(n)$ |

## Fast Multipliers

1. partial products generation $a_{i} b_{j}$ (with or without recoding) $\hookrightarrow$ delay in $O(1)$ (fanout $a_{i}, b_{j}$ $O(\log n))$
2. sum of the partial products using a carry-save reduction tree $\hookrightarrow$ delay in $O(\log n)$
3. assimilation of the carries using a fast adder
$\hookrightarrow$ delay in $O(\log n)$


Multiplication delay $O(\log n)$, area $O\left(n^{2}\right)$

## Power Consumption in Fast Multipliers



- $30 \%$ to $70 \%$ of redundant transitions (useless)
- place and route steps based on the internal arrival time
- add a pipeline stage


## MAC and FMA

MAC: multiply and accumulate $P(t)=A \times B+P(t-1)$
$A, B$ are $n$-bit values and $P$ a $m$-bit with $m \gg n$ (e.g.,
$16 \times 16+40 \longrightarrow 40$ in some DSPs)
FMA: fused multiply and add $P=A \times B+C$ where $A, B, C$ and $P$ can be stored in different registers


## Squarer

## Multiplication by Constants (1/2)

Problem: substitute a complete multiplier by an optimized sequence of shifts and additions and/or subtractions
Example: $p=111463 \times x$

| algo. | $p=111463 \times x=$ | \#op. |
| ---: | :--- | :---: |
| direct | $(x \ll 16)+(x \ll 15)+(x \ll 13)+(x \ll 12)+(x \ll 9)$ | $10 \pm$ |
|  | $+(x \ll 8)+(x \ll 6)+(x \ll 5)+(x \ll 2)+(x \ll 1)+x$ |  |
| CSD | $(x \ll 17)-(x \ll 14)-(x \ll 12)+(x \ll 10)$ | $7 \pm$ |
|  | $-(x \ll 7)-(x \ll 5)+(x \ll 3)-x$ | $5 \pm$ |
| Bernstein | $\left(\left(\left(t_{2} \ll 2\right)+x\right) \ll 3\right)-x$ |  |
|  | where |  |
|  | $t_{1}=(((x \ll 3)-x) \ll 2)-x$ | $4 \pm$ |
|  | $t_{2}=t_{1} \ll 7+t_{1}$ |  |
| Our | $\left(t_{2} \ll 12\right)+\left(t_{2} \ll 5\right)+t_{1}$ |  |
|  | where |  |
|  | $t_{1}=(x \ll 3)-x$ | $t_{2}=\left(t_{1} \ll 2\right)-x$ |

CSD: canonical signed digit, $111463=11011001101100111_{2}=100 \overline{1} 0 \overline{1} 0100 \overline{1} 0 \overline{1} 0100 \overline{1}_{2}$

## Multiplication by Constants (2/2)

FIR (1,5,5,1)

Power savings: 30 up to $60 \%$

| operator | init. | $[1]$ | $[2]$ | our |
| :---: | :---: | :---: | :---: | :---: |
| DCT 8b | 300 | 94 | 73 | 56 |
| DCT 12b | 368 | 100 | 84 | 70 |
| DCT 16b | 521 | 129 | 114 | 89 |
| DCT 24b | 789 | 212 | - | 119 |

Power savings: 10\%

| operator | init. | $[1]$ | $[2]$ | our |
| :---: | :---: | :---: | :---: | :---: |
| $8 \times 8$ Had. | 56 | 24 | - | 24 |
| $(16,11)$ R.-M. | 61 | 43 | 31 | 31 |
| $(15,7)$ BCH | 72 | 48 | 47 | 44 |
| $(24,12,8)$ Golay | 76 | - | 47 | 45 |

Power savings: up to $40 \%$

| operator | init. | [22] | our |
| :---: | :---: | :---: | :---: |
| 8 bits | 35 | 32 | 24 |
| 16 bits | 72 | 70 | 46 |

Parks-McClellan filter



$$
x_{x[t]}^{\longrightarrow \times 4) \rightarrow+\square}
$$

$$
\stackrel{\mathrm{xlt]}}{\square}
$$

A

## Example: $\sqrt{x}$ over $[1,2]$ and $\mu \leq 8 \mathrm{sb}$

Selection of coefficients leading to sparse recodings
$p^{*}=1.00076383+0.48388463 x-0.071198745 x^{2}$
$p=1+(0.10000 \overline{1})_{2} x-(0.0001001)_{2} x^{2}$
replace $\times$ by a small number of $\pm$


| solution | area | period | \#cycles | latency | power |
| :---: | :---: | :---: | :---: | :---: | :---: |
| wo. tools | 1.00 | 1.00 | 2 | 1.00 | 1.00 |
| w. tools | 0.59 | 0.97 | 1 | 0.48 | 0.45 |

## Modular Exponentiation for RSA

Computation of operations such as : $a^{b} \bmod n$

$$
a^{b}=\underbrace{a \times a \times a \times a \times \ldots \times a \times a \times a}_{a \text { appears } b \text { times }}
$$

Order of magnitude of exponents: $2^{\text {size of exponent }} \rightsquigarrow 2^{2048} \ldots 2^{4096}$
Fast exponentiation principle:

$$
\begin{aligned}
a^{b} & =\left(a^{2}\right)^{\frac{b}{2}} & & \text { when } b \text { is even } \\
& =a \times\left(a^{2}\right)^{\frac{b-1}{2}} & & \text { when } b \text { is odd }
\end{aligned}
$$

Least significant bit of the exponent: bit $=0 \rightsquigarrow$ even and bit $=1 \rightsquigarrow$ odd

## Square and Multiply Algorithm

```
input: a, b, n where b= ( bt-1 bt-2 \ldots. b b b b ) < 
output: ab}\operatorname{mod}
r=1
for i from 0 to t-1 do
    if }\mp@subsup{b}{i}{}=1\mathrm{ then
        r=r\cdota mod}
    endif
    a= a}\mp@code{modn
    endfor
    return r
```

This is the right to left version (there exists a left to right one)

## Elliptic Curve Cryptography in 1 Slide. . .



Elliptic Curve Cryptography in 1 Slide. . .


Elliptic Curve Cryptography in 1 Slide. . .


Elliptic Curve Cryptography in 1 Slide. . .

## encryption



$$
E: y^{2}=x^{3}+4 x+20 \text { over } \mathbb{F}_{1009}
$$

$$
\text { points: } \mathbf{P}, \mathbf{Q}=(x, y) \text { or }(x, y, z) \text { or } \ldots
$$

$$
\text { coordinates: } x, y, z \in \mathbb{F}_{q}
$$

$$
\mathbb{F}_{p}, \mathbb{F}_{2^{m}, t}: 200-600 \text { bits }
$$

$$
k=\left(k_{t-1} k_{t-2} \ldots k_{1} k_{0}\right)_{2} \in \mathbb{N}
$$

Scalar multiplication operation

$$
\text { for } i \text { from } 0 \text { to } t-1 \text { do }
$$

$$
\text { if } k_{i}=1 \text { then } \mathbf{Q}=\operatorname{ADD}(\mathbf{P}, \mathbf{Q})
$$

$$
\mathbf{P}=\operatorname{DBL}(\mathbf{P})
$$

Elliptic Curve Cryptography in 1 Slide. . .


## encryption



$$
E: y^{2}=x^{3}+4 x+20 \text { over } \mathbb{F}_{1009}
$$

$$
\begin{aligned}
& \text { points: } \mathbf{P}, \mathbf{Q}=(x, y) \text { or }(x, y, z) \text { or } \ldots \\
& \text { coordinates: } x, y, z \in \mathbb{F}_{q} \\
& \mathbb{F}_{p}, \mathbb{F}_{2^{m}}, t: 200-600 \text { bits } \\
& k=\left(k_{t-1} k_{t-2} \ldots k_{1} k_{0}\right)_{2} \in \mathbb{N} \\
& \text { Scalar multiplication operation } \\
& \text { for i from } 0 \text { to } t-1 \text { do } \\
& \text { if } k_{i}=1 \text { then } \mathbf{Q}=\operatorname{ADD}(\mathbf{P}, \mathbf{Q}) \\
& \mathbf{P}=\operatorname{DBL}(\mathbf{P})
\end{aligned}
$$

## Point addition/doubling operations

sequence of finite field operations DBL: $v_{1}=z_{1}^{2}, v_{2}=x_{1}-v_{1}, \ldots$
ADD: $w_{1}=z_{1}^{2}, w_{2}=z_{1} \times w_{1}$,

Elliptic Curve Cryptography in 1 Slide. . .

## protocol level

## encryption



$$
E: y^{2}=x^{3}+4 x+20 \text { over } \mathbb{F}_{1009}
$$

$$
\text { points: } \mathbf{P}, \mathbf{Q}=(x, y) \text { or }(x, y, z) \text { or } \ldots
$$

coordinates: $x, y, z \in \mathbb{F}_{q}$

$$
\begin{aligned}
& \mathbb{F}_{p}, \mathbb{F}_{2^{m}, t}: 200-600 \text { bits } \\
& k=\left(k_{t-1} k_{t-2} \ldots k_{1} k_{0}\right)_{2} \in \mathbb{N}
\end{aligned}
$$

$$
\begin{aligned}
& \text { Scalar multiplication operation } \\
& \hline \text { for i from } 0 \text { to } t-1 \mathrm{do} \\
& \text { if } k_{i}=1 \text { then } \mathbf{Q}=\operatorname{ADD}(\mathbf{P}, \mathbf{Q}) \\
& \mathbf{P}=\operatorname{DBL}(\mathbf{P}) \\
& \hline
\end{aligned}
$$

## Point addition/doubling operations

sequence of finite field operations
DBL: $v_{1}=z_{1}^{2}, v_{2}=x_{1}-v_{1}, \ldots$
ADD: $w_{1}=z_{1}^{2}, w_{2}=z_{1} \times w_{1}$,

## $\mathbb{F}_{p}$ or $\mathbb{F}_{2^{m}}$ operations <br> operation modulo large prime $\left(\mathbb{F}_{p}\right)$ or irreducible polynomial $\left(\mathbb{F}_{2^{m}}\right)$

## Introduction to Physical Attacks

## Attacks



## Attacks



## Attacks


$\mathrm{EMR}=$ Electromagnetic radiation

## Attacks


$\mathrm{EMR}=$ Electromagnetic radiation

## Attacks


$\mathrm{EMR}=$ Electromagnetic radiation

## Side Channel Attacks (SCAs) (1/2)

Attack: attempt to find, without any knowledge about the secret:

- the message (or parts of the message)
- informations on the message
- the secret (or parts of the secret)


## Side Channel Attacks (SCAs) (1/2)

Attack: attempt to find, without any knowledge about the secret:

- the message (or parts of the message)
- informations on the message
- the secret (or parts of the secret)
"Old style" side channel attacks:



## Side Channel Attacks (2/2)



General principle: measure external parameter(s) on running device in order to deduce internal informations

## Side Channel Attacks (2/2)



General principle: measure external parameter(s) on running device in order to deduce internal informations

## What Should be Measured?

Answer: everything that can "enter" and/or "get out" in/from the device

- power consumption
- electromagnetic radiation
- temperature
- sound
- computation time
- number of cache misses
- number and type of error messages

The measured parameters may provide informations on:

- global behavior (temperature, power, sound...)
- local behavior (EMR, \# cache misses...)


## Power Consumption Analysis

## General principle:

1. measure the current $i(t)$ in the cryptosystem
2. use those measurements to "deduce" secret informations



Source: [8] Kocher, Jaffe and Jun. Differential Power Analysis, Crypto99


- algorithm $\longrightarrow$ decomposition into steps
- detect loops
- constant time for the loop iterations
- non-constant time for the loop iterations

Source: [8] Kocher, Jaffe and Jun. Differential Power Analysis, Crypto99

## Differences \& External Signature

An algorithm

$$
\begin{aligned}
& r=c_{0} \\
& \text { for } i \text { from } 1 \text { to } n \text { do } \\
& \text { if } a_{i}=0 \text { then } \\
& r=r+c_{1} \\
& \text { else } \\
& \quad r=r \times c_{2}
\end{aligned}
$$

## Differences \& External Signature

An algorithm has a current signature

$$
\begin{aligned}
& r=c_{0} \\
& \text { for } i \text { from } 1 \text { to } n \text { do } \\
& \text { if } a_{i}=0 \text { then } \\
& r=r+c_{1} \\
& \text { else } \\
& \quad r=r \times c_{2}
\end{aligned}
$$



## Differences \& External Signature

An algorithm has a current signature and a time signature:

$$
\begin{aligned}
& r=c_{0} \\
& \text { for } i \text { from } 1 \text { to } n \text { do } \\
& \text { if } a_{i}=0 \text { then } \\
& \quad r=r+c_{1} \\
& \text { else } \\
& \quad r=r \times c_{2}
\end{aligned}
$$



## Simple Power Analysis (SPA)



Source: [8]

## Simple Power Analysis (SPA)



Source: [8]

## SPA in Practice

## General principle:

```
algorithm
```


## SPA in Practice

General principle:


## SPA in Practice

General principle:


## SPA in Practice

General principle:


## SPA in Practice

## General principle:



Methods: interpretation of the differences in

- control signals
- computation time
- operand values


## Limits of the SPA

Example of behavior difference: (activity into a register)


## Limits of the SPA

Example of behavior difference: (activity into a register)


## Limits of the SPA

Example of behavior difference: (activity into a register)


Important: a small difference may be evaluated has a noise during the measurement $\rightarrow$ traces cannot be distinguished

Question: what can be done when differences are too small?

## Limits of the SPA

Example of behavior difference: (activity into a register)


Important: a small difference may be evaluated has a noise during the measurement $\rightarrow$ traces cannot be distinguished

Question: what can be done when differences are too small?

Answer: use statistics over several traces

## Differential Power Analysis (DPA)

```
cryptosystem
```


## Differential Power Analysis (DPA)



## Differential Power Analysis (DPA)



## Differential Power Analysis (DPA)



## Differential Power Analysis (DPA)



## Differential Power Analysis (DPA)



## Differential Power Analysis (DPA)



## Differential Power Analysis (DPA)



Differential Power Analysis (DPA) Example


## Template Attack

```
cryptosystem
```





Template Attack


Template Attack


Template Attack


## Electromagnetic Radiation Analysis (1/2)

General principle: use a probe to measure the EMR


EMR measurement:

## Electromagnetic Radiation Analysis (1/2)

General principle: use a probe to measure the EMR


## EMR measurement:

- global EMR with a large probe


## Electromagnetic Radiation Analysis (1/2)

General principle: use a probe to measure the EMR


## EMR measurement:

- global EMR with a large probe
- local EMR with a micro-probe


## Electromagnetic Radiation Analysis (2/2)

EMR analysis methods:

- simple electromagnetic analysis: SEMA
- differential electromagnetic analysis: DEMA

Local EMR analysis may be used to determine internal architecture details, and then select weak parts of the circuit for the attack
$\rightarrow$ X-Y table


## Side Channel Attack on ECC



Side Channel Attack on ECC



| Scalar multiplication operation |
| :--- |
| for $i$ from 0 to $t-1$ do |
| if $k_{i}=1$ then $\mathbf{Q}=\operatorname{ADD}(\mathbf{P}, \mathbf{Q})$ |
| $\mathbf{P}=\operatorname{DBL}(\mathbf{P})$ |

Side Channel Attack on ECC



| Scalar multiplication operation |
| :--- |
| for $i$ from 0 to $t-1$ do |
| if $k_{i}=1$ then $\mathbf{Q}=\operatorname{ADD}(\mathbf{P}, \mathbf{Q})$ |
| $\mathbf{P}=\operatorname{DBL}(\mathbf{P})$ |

Side Channel Attack on ECC

\section*{| 응 |
| :--- |
| 0 |
| 0 |
| 0 |
| 0 |
| 0 |}

encryption



| Scalar multiplication operation |
| :--- |
| for $i$ from 0 to $t-1$ do |
| if $k_{i}=1$ then $\mathbf{Q}=\operatorname{ADD}(\mathbf{P}, \mathbf{Q})$ |
| $\mathbf{P}=\operatorname{DBL}(\mathbf{P})$ |

Side Channel Attack on ECC

\section*{| 잉 |
| :--- |
| 0 |
| 0 |
| 0 |
| 0 |
| 0 |}

encryption



- simple power analysis (\& variants)


## Side Channel Attack on ECC

## protocol level

encryption



$$
\begin{array}{|l}
\hline \text { Scalar multiplication operation } \\
\hline \text { for } i \text { from } 0 \text { to } t-1 \text { do } \\
\text { if } k_{i}=1 \text { then } \mathbf{Q}=\operatorname{ADD}(\mathbf{P}, \mathbf{Q}) \\
\mathbf{P}=\operatorname{DBL}(\mathbf{P})
\end{array}
$$

- simple power analysis (\& variants)
- differential power analysis (\& variants)
- horizontal/vertical/templates/... attacks

Protections at the Arithmetic Level

## Countermeasure

## Principles for preventing attacks:

- embed additional protection blocks
- modify the original circuit into a secured version
- application levels: circuit, architecture, algorithm, protocol...


## Countermeasure

## Principles for preventing attacks:

- embed additional protection blocks
- modify the original circuit into a secured version
- application levels: circuit, architecture, algorithm, protocol...


## Countermeasures:

- electrical shielding
- detectors, estimators, decoupling
- use uniform computation durations and power consumption
- use detection/correction codes (for fault injection attacks)
- provide a random behavior (algorithms, representation, operations...)
- add noise (e.g. masking, useless instructions/computations)
- circuit reconfiguration (algorithms, block location, representation of values...)


## Low-Level Coding and Circuit Activity

## Assumptions:

- $b$ is a bit (i.e. $b \in\{0,1\}$, logical or mathematical value)
- electrical states for a wire $\quad$ : $V_{D D}$ (logical 1) or GND (logical 0)


## Low-Level Coding and Circuit Activity

## Assumptions:

- $b$ is a bit (i.e. $b \in\{0,1\}$, logical or mathematical value)
- electrical states for a wire $: V_{D D}($ logical 1$)$ or GND (logical 0)

Low-level codings of a bit:

|  | $b=0$ | $b=1$ |
| :--- | :--- | :--- |
| standard | GND | $-V_{D D}$ |

## Low-Level Coding and Circuit Activity

## Assumptions:

- $b$ is a bit (i.e. $b \in\{0,1\}$, logical or mathematical value)
- electrical states for a wire $\quad V_{D D}($ logical 1$)$ or GND (logical 0$)$

Low-level codings of a bit:

|  | $b=0$ | $b=1$ |
| :--- | :--- | :--- |
| standard | $G G D$ | $V_{D D}$ |
| dual rail | $\left.\begin{array}{r}r_{0}=V_{D D} \\ r_{1}=G N D\end{array}\right](1,0)_{\mathrm{DR}}$ | $\left.\begin{array}{r}r_{0}=\mathrm{GND} \\ r_{1}=V_{D D}\end{array}\right](0,1)_{\mathrm{DR}}$ |

## Low-Level Coding and Circuit Activity

## Assumptions:

- $b$ is a bit (i.e. $b \in\{0,1\}$, logical or mathematical value)
- electrical states for a wire $-V_{D D}($ logical 1$)$ or GND (logical 0)

Low-level codings of a bit:

|  | $b=0$ | $b=1$ |
| :--- | :--- | :--- |
| standard | $G \mathrm{GND}$ | $V_{D D}$ |
| dual rail | $\left.\begin{array}{r}r_{0}=V_{D D} \\ r_{1}=\mathrm{GND}\end{array}\right](1,0)_{\mathrm{DR}}$ | $\left.\begin{array}{r}r_{0}=\mathrm{GND} \\ r_{1}=V_{D D}\end{array}\right](0,1)_{\mathrm{DR}}$ |



## Low-Level Coding and Circuit Activity

## Assumptions:

- $b$ is a bit (i.e. $b \in\{0,1\}$, logical or mathematical value)
- electrical states for a wire $-V_{D D}($ logical 1$)$ or GND (logical 0)

Low-level codings of a bit:

|  | $b=0$ | $b=1$ |
| :---: | :---: | :---: |
| standard | - GND | $=V_{D D}$ |
| dual rail | $\left[\begin{array}{l} r_{0}=V_{D D} \\ r_{1}=G N D \end{array}\right](1,0)_{\mathrm{DR}}$ | $\left[\begin{array}{l} r_{0}=\mathrm{GND} \\ r_{1}=V_{D D} \end{array}\right](0,1)_{\mathrm{DR}}$ |



## Circuit Logic Style

Countermeasure principles: uniformize circuit activity and exclusive coding

## Circuit Logic Style

Countermeasure principles: uniformize circuit activity and exclusive coding

## Solution based on precharge logic and dual-rail coding:



## Circuit Logic Style

Countermeasure principles: uniformize circuit activity and exclusive coding

Solution based on precharge logic and dual-rail coding:


Solution based on validity line and dual-rail coding:


Important overhead: silicon area and local storage (registers)

## Countermeasure: Architecture

## Increase internal parallelism:

- replace one fast but big operator
- by several instances of a small but slow one



## Protected Multipliers

Unprotected


## Protected Multipliers

Unprotected

Protected
Overhead:
Area/time < 10 \%
References:
PhD D. Pamula [9]
Articles: [12], [11], [10]



## Protected (Old) Accelerator



Warning: old dedicated accelerator (similar behavior is expected for our new one)

## Circuit-Level Protections for Arithmetic Operators



References: [5] and [6]

## Arithmetic Level Countermeasures

## Redundant number system $=$

- a way to improve the performance of some operations
- a way to represent a value with different representations


Important property: $\forall i \quad\left[R_{i}(k)\right] \mathbf{P}=[k] \mathbf{P}$

Proposed solution: use random redundant representations of $k$

## Double-Base Number System

Standard radix-2 representation:

$$
k=\sum_{i=0}^{t-1} k_{i} 2^{i}=\begin{array}{|l|l|l|l|l|l|}
\hline k_{t-1} & k_{t-2} & \cdots & k_{2} & k_{1} & k_{0} \\
\hline
\end{array}
$$

## Double-Base Number System

Standard radix-2 representation:

$$
k=\sum_{i=0}^{t-1} k_{i} 2^{i}=\begin{array}{|l|l|l|l|l|l|l|}
\hline 2^{t-1} & 2^{t-2} & \cdots & 2^{2} & 2^{1} & 2^{0} & \text { implicit weights } \\
k_{t-1} & k_{t-2} & \cdots & k_{2} & k_{1} & k_{0} \\
\hline
\end{array}
$$

Digits: $k_{i} \in\{0,1\}$, typical size: $t \in\{160, \ldots, 600\}$

## Double-Base Number System

Standard radix-2 representation:

$$
k=\sum_{i=0}^{t-1} k_{i} 2^{i}=\begin{array}{|l|l|l|l|l|l|l|}
2^{t-1} & 2^{t-2} & \cdots & 2^{2} & 2^{1} & 2^{0} & \text { implicit weights } \\
k_{t-1} & k_{t-2} & \cdots & k_{2} & k_{1} & k_{0} \\
\hline
\end{array}
$$

Digits: $k_{i} \in\{0,1\}, \quad$ typical size: $t \in\{160, \ldots, 600\}$

Double-Base Number System (DBNS):

$$
k=\sum_{j=0}^{n-1} k_{j} 2^{a_{j}} 3^{b_{j}}=
$$

## Double-Base Number System

Standard radix-2 representation:

$$
k=\sum_{i=0}^{t-1} k_{i} 2^{i}=\begin{array}{|l|l|l|l|l|l|l|}
\hline 2^{t-1} & 2^{t-2} & \cdots & 2^{2} & 2^{1} & 2^{0} & \text { implicit weights } \\
\hline k_{t-1} & k_{t-2} & \cdots & k_{2} & k_{1} & k_{0} \\
\hline
\end{array}
$$

Digits: $k_{i} \in\{0,1\}, \quad$ typical size: $t \in\{160, \ldots, 600\}$

Double-Base Number System (DBNS):

$$
k=\sum_{j=0}^{n-1} k_{j} 2^{a_{j}} 3^{b_{j}}=\begin{array}{|c|c|c|c|}
\hline k_{n-1} & \cdots & k_{1} & k_{0} \\
a_{n-1} & \cdots & n(2,3)-\text { terms } \\
a_{1} & a_{0} & \begin{array}{l}
n \\
\text { explicit "digits" } \\
b_{n-1}
\end{array} & \cdots
\end{array} b_{1} \quad b_{0} \begin{aligned}
& \text { explicit ranks }
\end{aligned}
$$

$a_{j}, b_{j} \in \mathbb{N}, \quad k_{j} \in\{1\}$ or $k_{j} \in\{-1,1\}, \quad$ size $n \approx \log t$

## Double-Base Number System

Standard radix-2 representation:

$$
k=\sum_{i=0}^{t-1} k_{i} 2^{i}=\begin{array}{|l|l|l|l|l|l|l|}
2^{t-1} & 2^{t-2} & \cdots & 2^{2} & 2^{1} & 2^{0} & \text { implicit weights } \\
k_{t-1} & k_{t-2} & \cdots & k_{2} & k_{1} & k_{0} \\
\hline
\end{array}
$$

Digits: $k_{i} \in\{0,1\}$, typical size: $t \in\{160, \ldots, 600\}$

Double-Base Number System (DBNS):

$$
k=\sum_{j=0}^{n-1} k_{j} 2^{a_{j}} 3^{b_{j}}=\begin{array}{|c|c|c|c|}
\hline k_{n-1} & \cdots & k_{1} & k_{0} \\
a_{n-1} & \cdots & n(2,3)-\text { terms } \\
a_{1} & a_{0} & \begin{array}{l}
n \\
\text { explicit "digits" } \\
b_{n-1}
\end{array} & \cdots
\end{array} b_{1} \quad b_{0} \begin{aligned}
& \text { explicit ranks }
\end{aligned}
$$

$a_{j}, b_{j} \in \mathbb{N}, \quad k_{j} \in\{1\}$ or $k_{j} \in\{-1,1\}, \quad$ size $n \approx \log t$
DBNS is a very redundant and sparse representation: $\quad 1701=(11010100101)_{2}$

$$
\begin{aligned}
1701 & =243+1458=2^{0} 3^{5}+2^{1} 3^{6}=(1,0,5),(1,1,6) \\
& =1728-27=2^{6} 3^{3}-2^{0} 3^{3}=(1,6,3),(-1,0,3) \\
& =729+972=2^{0} 3^{6}+2^{2} 3^{5}=(1,0,6),(1,2,5)
\end{aligned}
$$

Randomized DBNS Recoding of the Scalar $k$

## encryption

$\mid$ signature

On-the-fly DBNS random recoding for the scalar $k$
randomly recode windows of the scalar $k$ on-the-fly:
$1+2 \leftrightarrows 3 \quad 1+3 \leftrightarrows 2^{2} \quad 1+2^{3} \leftrightarrows 3^{2}$
control number of reductions $(\leftarrow)$ and expansions $(\rightarrow)$

possible rules recoding rules


| Point tripling operation |
| :--- |
| $\mathbf{Q}=\operatorname{TPL}(\mathbf{P})=\mathbf{P}+\mathbf{P}+\mathbf{P}$ |

Randomized DBNS Recoding of the Scalar $k$

## encryption



On-the-fly DBNS random recoding for the scalar $k$
randomly recode windows of the scalar $k$ on-the-fly:
$1+2 \leftrightarrows 3 \quad 1+3 \leftrightarrows 2^{2} \quad 1+2^{3} \leftrightarrows 3^{2}$
control number of reductions $(\leftarrow)$ and expansions $(\rightarrow)$

possible rules recoding rules
random choice $\cdots \cdots \cdot \gg \cdots$ recoded $k_{i}\left(, k_{i+1}\right)$

| Point tripling operation |
| :--- |
| $\mathbf{Q}=\operatorname{TPL}(\mathbf{P})=\mathbf{P}+\mathbf{P}+\mathbf{P}$ |

Randomized DBNS Recoding of the Scalar $k$

## encryption



On-the-fly DBNS random recoding for the scalar $k$
randomly recode windows of the scalar $k$ on-the-fly:
$1+2 \leftrightarrows 3 \quad 1+3 \leftrightarrows 2^{2} \quad 1+2^{3} \leftrightarrows 3^{2}$
control number of reductions $(\leftarrow)$ and expansions $(\rightarrow)$

possible rules recoding rules
random choice $\cdots \cdots \cdot \ggg>$ recoded $k_{i}\left(, k_{i+1}\right)$

| Point tripling operation |
| :--- |
| $\mathbf{Q}=\operatorname{TPL}(\mathbf{P})=\mathbf{P}+\mathbf{P}+\mathbf{P}$ |

DBNS is redundant $\Rightarrow$ security DBNS is sparse $\Rightarrow 20-30 \%$ speed $\nearrow$ Ref: [3] Chabrier, Pamula \& Tisserand. Asilomar 2009

## References

## References I

Surveys: Proc. IEEE 2006 [1], Proc. IEEE 2012 [2], IEEE TVLSI 2013 [7]
[1] H. Bar-EI, H. Choukri, D. Naccache, M. Tunstall, and C. Whelan.
The sorcerer's apprentice guide to fault attacks.
Proceedings of the IEEE, 94(2):370-382, February 2006.
[2] A. Barenghi, L. Breveglieri, I. Koren, and D. Naccache.
Fault injection attacks on cryptographic devices: Theory, practice, and countermeasures.
Proceedings of the IEEE, 100(11):3056-3076, November 2012.
[3] T. Chabrier, D. Pamula, and A. Tisserand.
Hardware implementation of DBNS recoding for ECC processor.
In Proc. 44rd Asilomar Conference on Signals, Systems and Computers, pages 1129-1133, Pacific Grove, California, U.S.A., November 2010. IEEE.
[4] T. Chabrier and A. Tisserand.
On-the-fly multi-base recoding for ECC scalar multiplication without pre-computations.
In A. Nannarelli, P.-M. Seidel, and P. T. P. Tang, editors, Proc. 21st Symposium on Computer Arithmetic (ARITH), pages 219-228, Austin, TX, U.S.A, April 2013. IEEE Computer Society.
[5] J. Chen, A. Tisserand, E. M. Popovici, and S. Cotofana. Robust sub-powered asynchronous logic.
In J. Becker and M. R. Adrover, editors, Proc. 24th International Workshop on Power and Timing Modeling, Optimization and Simulation (PATMOS), pages 1-7, Palma de Mallorca, Spain, September 2014. IEEE.
[6] J. Chen, A. Tisserand, E. M. Popovici, and S. Cotofana.
Asynchronous charge sharing power consistent Montgomery multiplier.
In J. Sparso and E Yahya, editors, Proc. 21st IEEE International Symposium on Asynchronous Circuits and Systems (ASYNC), pages 132-138, Mountain View, California, USA, May 2015.
[7] D. Karaklajic, J.-M. Schmidt, and I. Verbauwhede.
Hardware designer's guide to fault attacks.
IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 21(12):2295-2306, December 2013.

## References II

[8] P. C. Kocher, J. Jaffe, and B. Jun.
Differential power analysis.
In Proc. Advances in Cryptology (CRYPTO), volume 1666 of LNCS, pages 388-397. Springer, August 1999.
[9] D. Pamula.
Arithmetic Operators on GF $\left(2^{m}\right)$ for Cryptographic Applications: Performance - Power Consumption - Security Tradeoffs.
Phd thesis, University of Rennes 1 and Silesian University of Technology, December 2012.
[10] D. Pamula, E. Hrynkiewicz, and A. Tisserand.
Analysis of $\mathrm{GF}\left(2^{233}\right)$ multipliers regarding elliptic curve cryptosystem applications.
In 11th IFAC/IEEE International Conference on Programmable Devices and Embedded Systems (PDeS), pages 271-276, Brno, Czech Republic, May 2012.
[11] D. Pamula and A. Tisserand.
$\mathrm{GF}\left(2^{m}\right)$ finite-field multipliers with reduced activity variations.
In 4th International Workshop on the Arithmetic of Finite Fields, volume 7369 of LNCS, pages 152-167, Bochum, Germany, July 2012. Springer.
[12] D. Pamula and A. Tisserand.
Fast and secure finite field multipliers.
In Proc. 18th Euromicro Conference on Digital System Design (DSD), pages 653-660, Madeira, Portugal, August 2015.
[13] J. Proy, N. Veyrat-Charvillon, A. Tisserand, and N. Meloni.
Full hardware implementation of short addition chains recoding for ECC scalar multiplication.
In Actes Conférence d'informatique en Parallélisme, Architecture et Système (ComPAS), Lille, France, June 2015.

## Good Books (in French)

Micro et nano-électronique
Bases, Composants, Circuits
Hervé Fanet
2006
Dunod
ISBN: 2-10-049141-5


Arithmétique des ordinateurs Jean-Michel Muller
1989
Masson

ISBN: 2-225-81689-1
(web version)

## CMOS VLSI Design

A Circuits and Systems Perspective
Neil Weste and David Harris
3rd edition, 2004
Addison Wesley
ISBN: 0-321-14901-7


## Power Analysis Attacks

Revealing the Secrets of Smart Cards Stefan Mangard, Elisabeth Oswald and Thomas Popp 2007
Springer
ISBN:978-0-387-30857-9

## Good Books (in English)

## Digital Arithmetic

Milos Ercegovac and Tomas Lang 2003
Morgan Kaufmann
ISBN: 1-55860-798-6

Digital Arithmetic


## Thank you!

Contact (will change in a few months):

- mailto:arnaud.tisserand@univ-ubs.fr
- http://www-labsticc.univ-ubs.fr/~tisseran

