"For Internal E3S Use Only. These Slides May Contain Prepublication Data and/or Confidential Information."

Exceptional service in the national interest













### Designing an Analog Crossbar based Neuromorphic Accelerator

Sapan Agarwal, Alexander Hsia, Robin Jacobs-Gedrim, David R. Hughart, Steven J. Plimpton, Conrad D. James, Matthew J. Marinella Sandia National Laboratories







Sandia National Laboratories is a multimission laboratory managed and operated by National Technology and Engineering Solutions of Sandia, LLC, a wholly owned subsidiary of Honeywell International, Inc., for the U.S. Department of Energy's National Nuclear Security Administration under contract DE-NA0003525.

### The Von Neumann Bottleneck





orders of magnitude more energy!

Optical interconnects 100 fJ to 1 pJ

# Use Resistive Memories for Local Computation





$$I = G \times V$$
 multiplication

- A resistive memory or ReRAM is a programmable resistor
  - Apply small voltages allows the conductance to be read: I = G x V
  - Apply large voltages to change the resistance



## Directly Process in the Memory Itself





Analog is efficiently and naturally able to combine computation and data access

Effectively, large-scale processing in memory with a multiplier and adder at each real-valued memory location

# Crossbars Can Perform Parallel Reads and Writes





Energy to charge the crossbar is  $CV^2$  $E \propto C \propto \text{number of RRAMs} \propto N \times M$ 

 $E \sim O(N \times M)$ 

# SRAM Arrays Require Charging Columns Multiple Times





M columns

SRAMs must be read one row at a time, charging M columns Each column wire length is O(N).

Energy = N Rows × M Columns × O(N) wire length Energy ~ O(N<sup>2</sup>×M) O(N) times worse than a crossbar!

# Want To Accelerate Many Different Maioral National Laboratories **Neural Algorithms**



**Backpropagation** 

**Sparse** Coding

### **Liquid State Machine**







# Crossbars Can Perform Parallel Reads and Writes



Energy to charge the crossbar is  $CV^2$  $E \propto C \propto \text{number of RRAMs} \propto N \times M$ 

 $E \sim O(N \times M)$ 

## General Purpose Neural Architecture



#### Neuromorphic core:

- Evaluate vector matrix multiplies along rows or columns
- Train based on input vectors

#### **Digital Core:**

- Process neural core inputs/outputs
- For NxN crossbar, the crossbar accelerates
   O(N<sup>2</sup>) operations leaving only O(N) operations
   for the digital core

# Can Run Neural Networks on this



### Architecture





## **Back Propagation**





## 



#### **Vector Matrix Multiply**



#### Matrix Vector Multiply



#### Outer product Update





#### Comparator



## Row & Column Driver Circuitry



#### **Row Driver Logic**



Voltage level shifter (drive high V transistor with low V)



#### Column Driver Logic



#### Array driver pass transistors



## **Compare Architectures**



 $1024 \times 1024 = 1M$  array operations, sum over 1 training cycle, 3 operations:

- Vector Matrix Multiply
- Matrix Vector Multiply
- Outer Product Update



Used a commercial 14/16 nm PDK

\*\*\*Requires 100 M $\Omega$  on state devices

## Neural Core Energy Analysis





# Multiscale Model of a Neural Training Accelerator





### **#ROSS SIM**

#### https://cross-sim.sandia.gov



#### Simple Python API:

# Do a matrix vector multiplication result = neural core.run xbar mvm(vector)



Learning Algorithm



**Neural Core** Simulator



**Xyce** Crossbar Circuit Model

**Detailed but** slow

**Physical** Hardware Crossbar

Numeric Crossbar Simulator

Fast but approximate

#### Measured **Devices**



#### **Algorithmic Performance**



## Simple API to model crossbars



# Create a neural\_core object that models a crossbar

neural\_core = MakeCore(params=params)

#

neural\_core.set\_matrix(weights) # set the initial weights
result = neural\_core.run\_xbar\_vmm(vector) # Do a vector matrix multiply
result = neural\_core.run\_xbar\_mvm(vector) # Do the transpose, a matrix vector mult.
neural\_core.update\_matrix(vector1,vector2) # Do an outer product update

All crossbar details are transparent to the user

### Go from Measurement to Accuracy





## Multi-ReRAM Synapse: Periodic Carry

If we need more bits per synapse, use multiple memristors

- Three 10 level ReRAMs could represent 1-1000!
- Adding to the weight requires reading every ReRAM to account for any carries and serially programming each ReRAM: VERY EXPENSIVE



- Use >10 levels to represent a base 10 system
- Ignore carry and program the crossbar in parallel.
- Periodically (once every few hundred cycles) read the ReRAM and perform the carry



## Periodic Carry Compensates for Write Noise





Read and reset every 100 pulses Do 300,000 small (0.02% of weight range) updates

net of 1500 positive training pulses

Noise Sigma = 1.4% for single device

- (from  $\sigma_{noise}/G_{range} = 0.1\sqrt{\Delta G/G_{range}}$ )
- Write noise applied during updates and carries

Learn from a 0.5% Signal

## Periodic Carry Mitigates Write Nonline



### Alternating Pulses Cause Weight Decay



Use center linear range of weights





Train with 1% signal Ideal result is 0.6

## TaO<sub>x</sub> Results









#### A/D and D/A is modeled, Serial operations modeled

- When resetting weight, need to adjust pulse size based on current state to compensate for nonlinearity
- When reading a single weight, need to adjust readout range to be smaller (change capacitor on the integrator)

# Li-Ion Synaptic Transistor for Analog Computation (LISTA)





E. J. Fuller, et al, "Li-Ion Synaptic Transistor for Low Power Analog Computing," *Advanced Materials*, vol. 29, no. 4, p. 1604310, 2017.

## Summary





- Fundamental O(N) energy scaling advantage
- Use CrossSim to co-design materials to algorithms
  - Use periodic carry to overcome noise devices
- Need high resistance 10-100 M $\Omega$  Devices
- Need low write nonlinearities



https://cross-sim.sandia.gov

## Extra Slides



## Overcoming the Power Limit





Integrate Processing and Memory

# The Noise Limited Energy to Read a Crossbar Column is Independent of Crossbar Size



$$I_{o} = G_{o}V$$

$$I_{o} = G_{o}V$$

$$I_{o} = G_{o}V$$

$$I_{o} = G_{o}V$$

Thermal Noise = 
$$\langle \Delta I^2 \rangle$$
  
=  $N \times (4k_b T \times G_o \times \Delta f)$ 

$$SNR^2 = \frac{(NI_o)^2}{\langle \Delta I^2 \rangle}$$

$$\frac{1}{\Delta f} = 4k_b T \times SNR^2 \times \frac{1}{V^2 G_o \times N}$$

Measure N resistors and determine the total output current with some signal to noise ratio (SNR)\*

What is the minimum energy?

$$Energy = V^{2}G_{o} \times N \times \frac{1}{\Delta f}$$

Power in each resistor × number of resistors

Determined by noise and SNR

If we double the number of resistors, we can double the speed to get the same energy and SNR.

This is because the noise scales as sqrt(N) while the signal scales as N

$$Energy = 4k_bT \times SNR^2$$

<sup>\*</sup>we are assuming we need some fixed precision on the output, and don't need full floating point accuracy

## Experimental Device Non-idealities



Device: Write Variability, Write Nonlinearity, Asymmetry, Read Noise

Circuit: A/D, D/A noise, parasitics



#### **Read Noise**



### **Combined Effects of Nonidealities**





# What are the Neural ReRAM Device Requirements?



|                             | Small  | Large  | File  |
|-----------------------------|--------|--------|-------|
|                             | Images | Images | Types |
| Read Noise σ (% Range)      | 3%     | 5%     | 9%    |
| Write Noise σ (% Range)     | 0.3%   | 0.4%   | 0.4%  |
| Asymmetric Nonlinearity (v) | 0.1    | 0.1    | 0.1   |
| Symmetric Nonlinearity (v)  | >20    | 5      | 5     |
| Maximum Current             | 160 nA | 13 nA  | 40 nA |





## **Full System Simulation**

# A/D & D/A Have Minimal Impact



|            | Range         | BITS |
|------------|---------------|------|
| Row Input  | -1 to 1       | 8    |
| Col Output | -6 to 6       | 8    |
| Col Input  | -1 to 1       | 8    |
| Row Output | -4 to 4       | 8    |
| Row Update | -0.01 to 0.01 | 7    |
| Cal Undata | -1 to 1       | 5    |

| Data set   | #Training/Test<br>Examples | Network Size |
|------------|----------------------------|--------------|
| File Types | 4,501 / 900                | 256×512×9    |
| MNIST      | 60,000 /10,000             | 784×300×10   |

## TaO<sub>x</sub> Results



Carry once every 1000 updates for the LSB, and every 2 updates on others



A/D and D/A is modeled, serial operations modeled

- When resetting weight, need to adjust pulse size based on current state to compensate for nonlinearity
- When reading a single weight, need to adjust readout range to be smaller (change capacitor on the integrator)

## LISTA Results







- Carry once every 1000 updates
- Use a single device per weight and subtract a reference current

## **Neural Core Latency Analysis**









## Neural Core Area Analysis





For the ReRAM, high voltage transistors require 8X area, improving this could give ~2X area savings