

## Hyperion: A Unified, Zero-CPU Data-Processing Unit (DPU)

# Marco Spaziani Brunella, Marco Bonola and **Animesh Trivedi**

CompSys 2022





## **CPU - as the Performance Horse**





- Stalling of Moore's Law and Dennard Scaling
- Turing Tax **the cost of Generalization**
- **Security** considerations
- Energy needs

Rise of accelerator-centric computing

## **Imagine this setup**



**Disaggregated clients** 

Network protocols

Interaction among the accelerators

## The Key Challenges with the CPU in the Loop

#### 1. The CPU coordinates the control path and resource allocation

- a. Coordinate control flow among accelerators which buffers to allocate, pin, DMA
- b. Control the data transfer among accelerators when to initiate and how to initiate
- c. Done with pair-wise accelerator integrations, but multiple?

#### 2. The CPU dictates the computing abstractions

- a. Shared memory, virtual memory, processes, context switches, files
- b. Keeping the memory coherent between the host's view and accelerator view

#### 3. The CPU limits the innovation and imagination

- a. Active and passive disaggregation
- b. Designing a new interconnect, network discovery protocols
- c. Scalable energy needs

## Hyperion: A Zero-CPU Data Processing Unit (DPU)

#### Hardware:

• FPGA + NIC + Storage = DPU

#### Software:



- A new compiler
- eBPF as an **IR** for <u>(any)</u> hardware

#### **Client:**

- Disaggregated clients
- Network protocols NVMoF
- Application-level, KV, NFS, DSes



## **Disaggregation and Slicing**



#### Innovation in Discovery, reconfiguration, slicing, virtualization, communication etc.

## **Comments on the Reviews**

First of all, thank you :)

- Target application-domain?
  - Disaggregated, cloud storage and processing
  - Mostly well-defined, requires multi-tenancy and dynamic reconfiguration
- Limited FPGA resources, esp. on-chip memories
  - Needs data staging primitives between SRAM, DRAM, HBM, then NVMe storage

#### - Development complexity

- Target well-defined data structures as the basic building blocks: B-arr Tree, Hash Tables, Arrays, LSM tree, Heaps, extent-trees, etc.
- **Compiler development:** challenging, but feasible
- "I wonder if this approach can really fully eliminate CPUs"
  - We also do not know. We think it can, but we are open to hear counter arguments

## Where are we going from here?

#### 5-page vision:

Hyperion: A Case for Unified, Self-Hosting, Zero-CPU Data-Processing Units (DPUs)

#### https://arxiv.org/abs/2205.08882



|                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    | 'Us)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 |                                                                                                                                                                                        |
|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Marco Spaziani Brunella<br>University of Rome Tor Vergata, Axbryd                                                                                                                                                                                                                                                                                                                                                                                                                                                  | Marco Bonola Animesh Trivedi<br>CNIT/Axbryd VU, Amsterdam                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |                                                                                                                                                                                        |
| Abstract                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           | What                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | Examples                                                                                                                                                                               |
| Since the inception of computing, we have been reliant on<br>CPU-powered architectures. However, today this reliance is<br>challenged by manufacturing limitations (CMOS scaling),                                                                                                                                                                                                                                                                                                                                 | Net + Accel<br>Net + GPU<br>Sto + GPU                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                | SmartNICs [5,110], AccINet [53], hXDP [35]<br>GPUDirect [102], GPUNet [78]<br>Donard [22], SPIN [25], GPUfs [124], GPUDi-                                                              |
| performance expectations (stalled clocks, Turing tax), and<br>security concerns (microarchitectural attacks). To re-imagine                                                                                                                                                                                                                                                                                                                                                                                        | Net + Sto                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | rect [103], nvidia BAM [113]<br>iSCSI, NVMoF (offload [117], BlueField [5])                                                                                                            |
| ur computing architecture, in this work we take a more radi-<br>al but pragmatic approach and propose to eliminate the CPU                                                                                                                                                                                                                                                                                                                                                                                         | Sto + Accel                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          | i10 [68], ReFlex [80]<br>ASIC/CPU [60, 83, 121], GPUs [25, 26, 124]                                                                                                                    |
| with its design baggage, and integrate three primary pillars of<br>computing, i.e., networking, storage, and computing, into a                                                                                                                                                                                                                                                                                                                                                                                     | Hybrid System                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        | FPGA [69, 116, 119, 143], Hayagui [15]<br>with ARM SoC [3, 47, 90], BEE3 [44], hybrid<br>CPU-FPGA systems [39, 41]                                                                     |
| single, self-hosting, unified CPU-free Data Processing Unit<br>DPU) called Hyperion. In this paper, we present the case for<br>Hyperion, its design choices, initial work-in-progress details,<br>and seek feedback from the systems community.                                                                                                                                                                                                                                                                    | DPUs                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 | Hyperion (stand-alone), Fungible (MIPS64 R6<br>cores) DPU processor [54], Pensando (host<br>attached P4 Programmable processor) [108]<br>BlueField (host-attached, with ARM cores) [5] |
| 1 Introduction                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                     |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      | work (§4) in the integration of network (net)<br>d accelerators (accel) devices.                                                                                                       |
| Since the inception of computing, we have been designing<br>and building computing systems around the CPU as the pri-<br>mary workhorse. This primary architecture has served us well.<br>However, as the gains from Moore's and Dennard's scaling<br>start to diminish, researchers have started to look beyond the<br>CPU-centric designs to accelerators and domain-specific com-<br>puting devices such as GPU1 [26, 73, 115]. FPOAs [84, 111].<br>TPU5 [72], programmable-storage [87, 116, [21]]. and Smart- | approaches (§4). Additionally, accelerator integration is al-<br>ways done (via virtualization or multiplexing) while koeping<br>the CPU and accelerator view of system resources (DRAM,<br>memory mappings, TLBs) coherent and secure. Though nec-<br>searsy, such integration brings complexity to accelerator man-<br>agement and keeps the CPU as the final resource arbiter. In<br>contrast to accelerators and UO devices, the CPU performance<br>is not expected to improve by a radical margin [101], and is<br>even dropping with each microarchitectural tatks for [23,81].<br>We are not the first one to raise issues associated with the<br>CPU-driven computing architecture [42, 101]. Despite this<br>awareness, CPU-driven designs and consequently, the CPU<br>remains in the critical path of end-to-end system building,<br>thus not escaping the dynamics of Amdah's Law [64].<br>The first-principle reasoning suggests the solution: a sys-<br>tem where there is no CPU, i.e., a zero-CPU or CPU-free<br>architecture. A completely new computing architecture like<br>zero-CPU will require a radical and destructive redesign<br>of computing hardware (buses, interconnects, controllers, |                                                                                                                                                                                        |

2022

8 May

K

arXiv:2205.08882v1

**Acknowledgements**: NWO XS OCENW.XS3.030, and the Xilinx University Donation Program (XUP)

### **Call for a Revolution!**

