

# **Crosslayer Design and Modeling for Future Optical Interconnects**

MODSIM 2017, August 10th, Seattle, WA

Keren Bergman Lightwave Research Lab, Columbia University New York









#### **Trends in extreme HPC**

- Evolution of the top10 in the last six years:
  - Average total compute power:
    - 0.86 PFlops → 21 PFlops
    - ~24x increase
  - Average nodal compute power:
    - 31GFlops  $\rightarrow$  600GFlops
    - ~19x increase
  - Average number of nodes
    - 28k → 35k
    - ~1.3x increase



<sup>[</sup>top500.org, S. Rumley, et al. Optical Interconnects for Extreme Scale Computing Systems, Elsevier PARCO 64, 2017]

 $\rightarrow$  Node compute power main contributor to performance growth





#### Interconnect trends

- Top 10 average node level evolutions:
  - Average node compute power:
    - 31GFlops  $\rightarrow$  600GFlops
    - ~19x increase
  - Average bandwidth available per node
    - 2.7GB/s → 7.8GB/s
    - ~3.2x increase
  - Average byte-per-flop ratio
    - 0.06 B/Flop  $\rightarrow$  0.01 B/Flop
    - ~6x decrease
    - Sunway TaihuLight (#1) shows 0.004 B/Flop

#### → Growing gap in interconnect bandwidth



[top500.org, S. Rumley, et al. Optical Interconnects for Extreme Scale Computing Systems, Elsevier PARCO 64, 2017]



#### **Exascale interconnects – power and cost constraints**

- Real Exascale goal: reaching performance
  - ...while satisfying constraints (20MW, \$200M)
  - ...with reasonably useful applications
- Assume 15% of \$ budget for interconnect:
  - 15% x \$200M / 500 Pb/s = 6 ¢/Gb/s

1.25 ExaFLOP X 0.01 B/FLOP = 125 Pb/s injection BW X 4 hops = 500 Pb/s installed BW

Bi-directional links must thus be sold for ~10 ¢/Gb/s

 Today: optical 10\$/Gb/s electrical 0.1-1 \$/Gb/s

- Assume 15% of power budget for interconnect:
  - 15% x 20MW / 125 Pb/s = 24 mW/Gb/s = 24 pJ/bit
    = budget for communicating a bit end-to-end

| → 6 pJ/bit per hop                      |                         |
|-----------------------------------------|-------------------------|
| $\rightarrow$ 4 pJ/bit for switching    | today ~20 pJ/bit        |
| $\rightarrow$ 2 pJ/bit for transmission | today ~10 pJ/bit (elec) |



### **Exascale supercomputing node**



# Silicon Photonics: all the parts

- Silicon as core material
  - High refractive index; high contrast; sub micron cross-section, small bend radius.
- Small footprint devices
  - 10 µm 1 mm scale compared to cm-level scale for telecom
- Low power consumption
  - Can reach <1 pJ/bit per link</li>
- Aggressive WDM platform
  - Bandwidth densities 1-2Tb/s pin IO



- •Silicon wafer-scale CMOS
  - Integration, density scaling
  - CMOS fabrication tools
  - 2.5D and 3D platforms

S. Rumley et al. "Silicon Photonics for Exascale Systems", IEEE JLT 33 (4), 2015.



#### **Optically-Connected Memory Architecture**



## PhoenixSim: Integrated Multi-Level Modeling and Design Environment

- Integrated design/modeling environment across three layers:
  - Application IO primitives
    - Copy memory array to remote location
    - Send, multicast, broadcast messages
    - Thread synchronization (e.g. barrier)
  - Network architecture and protocols
    - Link locking mechanisms (frame detection)
    - Network topology (routing)
    - Arbitration of shared buses, switches
  - Si Photonic Hardware implementations
    - Silicon photonics modulators, switches
- Complete "toolbox" of models at each layer
  - Ensure interoperability among models
  - Cross-layer co-optimization is Key







#### **Methodology - Abstraction of Physical Devices**



# **Physical - Silicon Photonic Link Design**

- Co-existence of <u>Electronics</u> and <u>Photonics</u>
- Energy-Bandwidth optimization



Silicon Photonic Interconnects," IEEE JLT 34 (12), 2015.





#### Utilization of Optical Power Budget





#### **Considering the electronics**







#### **All-Parameter Optimization: Max Bandwidth Design**







#### **All-Parameter Optimization: Min Energy Design**





## Cost per bandwidth – declining but slowly



- Today (2017):
  - 100G (EDR) best\$/Gb/s figure
  - Copper cable have shorter reaches due to higher bit-rate
  - Optics: Not even ½ order of magnitude price drop over 4 years
  - But electrical-optical gap is shrinking





### **Beyond the Link: Photonic Switching**

#### MEMS-based Switches



[Lucent Technologies' Lambda Router]

- Free-space propagation
- High actuating voltage
- Broadband
- Low loss/low crosstalk
- Bulky
- Slow
- Scalable
- Cost ultimately limit by installation & calibration

#### SiP-based PIC Switches



[Benjamin Lee, OFC 2013, PDP paper]

- Planar lightwave circuits
- Broadband/Wavelength Selective
- High integration (small footprint)
- Fast (E-O effect)
- Lossy/relatively high crosstalk
- Rather Scalable
- CMOS/PIC monolithic integration
- Cost can be low benefiting from mature CMOS industry





#### **MRR Element Model**



- Drop loss: 0.35 dB
- Thru loss: 0.1 dB; Xtalk : -29.3 dB



- Drop loss: 0.17 dB; Xtalk: -32.4 dB
- Thru loss: 0.19 dB; Xtalk : -31.7 dB





## Transitioning to Novel Modular Architectures...

- Modular architecture and control plane
- Avoids on chip crossings
- Fully non-blocking

SiP Devices

• Path independent insertion loss

ed integration method

**Electrical PCB** 

Low crosstalk



[Dessislava Nikolova\*, David M. Calhoun\*, Yang Liu, Sébastien Rumley, Ari Novack, Tom Baehr-Jones, Michael Hochberg, Keren Bergman, Modular architecture for fully non-blocking silicon photonic switch fabric, Nature Microsystems & Nanoengineering **3** (1607) (Jan 2017).]

COLUMBIA UNIVERSITY



#### **Clos-of-Switch-and-Select Architecture**



 Offering a suitable balance that keeps the number of stages to the modest value of three while largely reducing the required number of MRRs





#### **Network layer**

- Implemented circuit level arbitration
  - Data or packets emitted by application layer delayed while circuit is set up
    - Circuit setup is assumed to take a predefined time  $\Delta t$
  - Includes prediction mechanism:
    - Keep circuit on if high probability of being reused
    - Prefetch next circuit if next destination highly probable
  - Supposes a fully non-blocking physical layer
    - A circuit can always be established as long as input and output ports are free
- Consider a 5 node example:





### **Circuit arbitration - visualization**

- Arbitration at play under random packet arrivals (30% load)
  - Correlated destinations: packet goes to next index with prob 50%



#### **Photonic Interconnected Memory - ModSim**

Optical Devices

Loss

**Grating Couplers** 

Modulators

Fibers

Required Laser Power

Filters

Photodiode

Power Penalties

 SST's event-based simulation allows accurate/tractable system performance.

**COLUMBIA UNIVERSITY** 

IN THE CITY OF NEW YORK

- Efficacy of different interconnect topologies evaluated, user-defined system components
- Performance of memory reads, writes, etc. simulated for performance evaluation
- Simulation results from optically-connected memories evaluated against conventional busses and electrical networks.

#### STREAM[Read]

4



#### and the second

Laboratory for Physical Sciences





Link

Bit Rate

#

Aggregation

Total

Efficiency

Total Energy (pJ/bit







**Electrical Circuits** 

Receiver

Thermal Tuning

Amplifier (TIA)

Deserialization

Required Electrical Power

Transmitter

Thermal Tuning

Serialization

Modulator Driver

Receiver Sensitivity



#### **Application layer**







## Hybrid switching interconnects

- Do NOT use optical switches *in place* of packet routers!
- → Use optical switches in addition to packet routers
  - Coarse bandwidth steering across network clients: optical switching
  - Fine (per packet) bandwidth allocation: packet routing
- Bandwidth steering: equivalent to connectivity "re-wiring"
  - No need for large number of ports R
    - Allow for cheap, (soon) easy to fabricate silicon photonics switches



## Flexfly: A Reconfigurable Dragonfly

- Incorporates photonic switching at inter-group level
- Reconfigure topology towards application traffic



COLUMBI UNIVERSIT



### Adapting topology for GTC application







#### Flexfly – simulated performance



# COLUMBIA UNIVERSITY

4.2 mm



## Implementation: AIM SiP Tapeout Run



| Λ | TZ   | A My         | , |
|---|------|--------------|---|
|   | Loni | 7 I. W<br>CS |   |

|   | Device                                        | Area             |
|---|-----------------------------------------------|------------------|
| 1 | 4x4x4 λ Space-and-<br>wavelength switch       | 1.9mm x<br>2.6mm |
| 2 | 4x4 Si space switch                           | 1.4mm x<br>2.3mm |
| 3 | 4x4 Si/SiN two-layered space switch           | 1.5mm x<br>2.3mm |
| 4 | 2x2 double-gated/single-<br>gated ring switch | 0.8mm x<br>1.4mm |
| 5 | Crossing and escalator test structure         | 0.6mm x<br>1mm   |
| 6 | $1x2x8 \lambda$ MUX with rings                | 1.2mm x<br>0.2mm |
| 7 | 1x2x4 λ MUX with micro-<br>disks              | 0.6mm x<br>0.2mm |
| 8 | 2x2 double-gated MZM switch                   | 3mm x<br>0.4mm   |

31





#### **Our FPGA-Controlled Switch Test-Bed**







#### **Flexfly – testbed implementation**





## Conclusions

- Ultra-large scale interconnects are in high need for bandwidth
  - Interconnect bandwidth limitations among main HPC scalability threats
- Optics is playing a role and will continue to
  - But beware of costs and power consumption
  - Packaging is particularly important
  - Cost…
- Modeling/design must be cross-layer
- Optical switching in HPC:
  - Photonics switching for bandwidth steering
  - Flexfly: low port-count and cheap silicon photonics switches in HPC interconnects

