

# **Exploring the End-to-End Storage Stack on Modern Storage Hardware**

### **Animesh Trivedi**

Assistant Professor (tenured)

https://animeshtrivedi.github.io/

December, 2023

(presenting on behalf of many in the research team at VU Amsterdam)

# **Data is Essential to our Society**



### A Minute on the Internet



# 200 Zettabytes

(by 2025)

**200** × 1,000,000,000,000,000,000



Scientists estimate that the Earth contains **7.5 sextillion** <u>sand grains</u>. That is 75 followed by **17 zeros**. [See <u>here</u>]

https://localig.com/blog/what-happens-in-an-internet-minute/

https://www.bondhighplus.com/2022/01/08/what-happen-in-an-internet-minute/

### Assume (illustration purposes only):

• 1 grain of rice is 1 byte of data



### Assume (illustration purposes only):

- 1 grain of rice is 1 byte of data
- **Kilobyte**: a cup of rice



### Assume (illustration purposes only):

• 1 grain of rice is 1 byte of data

• **Kilobyte**: a cup of rice

• Megabyte: 8 bags of rice



### Assume (illustration purposes only):

• 1 grain of rice is 1 byte of data

• **Kilobyte**: a cup of rice

• Megabyte: 8 bags of rice

• **Gigabyte**: 3 semi-trucks







### Assume (illustration purposes only):

• 1 grain of rice is 1 byte of data

• **Kilobyte**: a cup of rice

Megabyte: 8 bags of rice

• **Gigabyte**: 3 semi-trucks

• **Terabyte**: 2 container ships





### Assume (illustration purposes only):

• 1 grain of rice is 1 byte of data

• **Kilobyte**: a cup of rice

• Megabyte: 8 bags of rice

• **Gigabyte**: 3 semi-trucks

• **Terabyte**: 2 container ships

• **Petabyte**: Covers Maastricht





### Assume (illustration purposes only):

• 1 grain of rice is 1 byte of data

• **Kilobyte**: a cup of rice

• Megabyte: 8 bags of rice

• **Gigabyte**: 3 semi-trucks

• **Terabyte**: 2 container ships

• **Petabyte**: Covers Maastricht

Exabyte: Covers NL + DE + FR



### Assume (illustration purposes only):

1 grain of rice is 1 byte of data

• **Kilobyte**: a cup of rice

• Megabyte: 8 bags of rice

Gigabyte: 3 semi-trucks

• **Terabyte**: 2 container ships

• **Petabyte**: Covers Maastricht

• **Exabyte**: Covers NL + DE + FR

• **Zettabyte**: Fills up the Pacific Ocean



### Assume (illustration purposes only):

1 grain of rice is 1 byte of data

**Kilobyte**: a cup of rice

Megabyte: 8 bags of rice

**Gigabyte**: 3 semi-trucks

**Terabyte**: 2 container ships

**Petabyte**: Covers Maastricht

**Exabyte**: Covers NL + DE + FR

Zettabyte: Fills up the Pacific Ocean

Yottabytes: An Earth size rice ball

By 2030



# Non-Volatile Memory (NVM) Storage to the Rescue...



### **Novel Systems Designs (CXL)**

### Workload-specific Configurations

#### FlashNeuron: SSD-Enabled Large-Batch Training of Very Deep Neural Networks

Jonghyun Bae<sup>†</sup> Jongsung Lee<sup>†‡</sup> Yunho Jin<sup>†</sup> Sam Son<sup>†</sup> Shine Kim<sup>†‡</sup> Hakbeom Jang<sup>‡</sup>
Tae Jun Ham<sup>†</sup> Jae W. Lee<sup>†</sup>

\*Seoul National University \*Samune Electronics

#### Abstract Deep neural networks (DNNs) are widely used in various

Al application domains such as computer vision, natural language processing, autonomous driving, and bioinformatics. As DNNs continue to get wider and deeper to improve accuracy, the limited DRAM capacity of a training platform like GPU often becomes the limiting factor on the size of DNNs and batch size-called memory capacity wall. Since increasing the batch size is a popular technique to improve hardware utilization, this can yield a suboptimal training throughput. Recent proposals address this problem by offloading some of the intermediate data (e.g., feature maps) to the host memory. However, they fail to provide robust performance as the trainine process on a GPU contends with applications running on a CPU for memory bandwidth and capacity. Thus, we propose FlashNeuron, the first DNN training system using an NVMe SSD as a backing store. To fully utilize the limited SSD write bandwidth, FlashNeuron introduces an offloading scheduler, which selectively offloads a set of intermediate data to the SSD in a compressed format without increasing DNN evaluation time. FlashNeuron causes minimal interference to CPU processes as the GPU and the SSD directly communicate for data transfers. Our evaluation of FlashNeuron with four stateof-the-art DNNs shows that FlashNeuron can increase the batch size by a factor of 12.4× to 14.0× over the maximum allowable batch size on NVIDIA Tesla V100 GPU with 16GB DRAM. By employing a larger batch size. FlashNeuron also improves the training throughput by up to 37.8% (with an average of 30.3%) over the baseline using GPU memory only, while minimally disturbing applications running on CPU.

#### 1 Introduction

Deep neural networks (DNNs) are the key enabler of emerging Al-based applications and services such as computer vision [19,22,38,33,54], natural language processing [2,11,13,51,67], and bioinformatics [46,73]. With a relentless position for higher accuracy, DNNs have become wider and deeper to increase network size [65]. It is because even a 1% cause even low for a comparison of the processing the comparison of the processing the comparison of the processing the proc

DNNs must be trained before deployment to find optimal network parameters that minimize the error rale. Suchasia-Gradient Decent (SGD) is the dominant algorithm used for DNN training [15]. In SGD, the entire dataset is divided into multiple (mini-batches, and weight gradients are calculated and applied to the network parameters (weights) for each batch via backward propagation. Unlike inference, the training algorithm reuses the intermediate results (e.g., feature maps) produced by a forward propagation during the backward propagation has requiring a lost of memory space [SS].

This GPU memory capacity wall [33] often becomes the limiting factor on DN size and its throughput. Specifically, such a large memory capacity requirement forces a GPU device to opene at a relatively small backs; which often and even-by affects its throughput. The use of multiple CPUs can arbitree naturally bysass but the came a careful use of multiple CPUs can arbitree naturally beyone the memory capacity was the leaves accarding to use of multiple CPUs can arbitree natural throughput [27, 38, 59]. However, such a throughput arm introughput [27, 38, 59]. However, such a throughput can be a throughput can be a throughput can be a such as throughput in the proposed memory of the control of the control

This memory capacity problem in DNN training has drawn much attention from the research community. The most popular approach is to utilize the host CPU memory as a backing store to offload some of the tensors that are not immediately used [8, 9, 44, 25, 56, 2]. However, this highering-on-memory approach fails to provide mobust performance as the training process on the CPU contends with applications running on the CPU for memory bandwidth and capacity (e.g., data suggeneration tasks [5, 41, 57, 61] to boost training accuracy. Moreover, these proposals focus mostly on increasing batch size but less on improving training throughput. Therefore, they often yield a low training throughput as the cost of CPU.

GPU data transfers outweighs a larger batch's benefits. Thus, we propose FlashNeuron, the first DNN training system using a high-performance SSD as a backing store. While NVMe SSDs are a promising alternative to substitute or auement DRAM. they have at least an order of maemitude ARTIFACT EVALUATED USOBIX

#### Overcoming the Memory Wall with CXL-Enabled SSDs

Shao-Peng Yang Minjae Kim Sanghyun Nam Juhyung Park Jin-yong Choi DGIST FADU Inc. Syracuse University Soongsil University DGIST Eyee Hyun Nam Eunji Lee Sungjin Lee Bryan S. Kim FADU Inc. Soongsil University DGIST Syracuse University

#### Abstract

This paper investigates the feasibility of using inexpensive flash memory on new interconnect technologies such as Charlos (Compute Express Link) to overcome the memory wall. We explore the design space of a CXL-emabled flash device and show that techniques such as caching and perfectning can help mitigate the concerns regarding flash memory's performance and lifetime. We demonstrate using real-world application traces that these techniques enable the CXL device to an estimated lifetime of at least 3.1 years and serve 68-91% of the memory requests under an increascend. We analysis the limitations of existing techniques and suggest system-level changes to achieve a DRAM-level performance using least.

#### 1 Introduction

The growing imbalance between computing power and memory capacity requirement in computing systems has developed into a challenge known as the memory wall [23, 34, 52]. Figure 1, based on the data from Gholami et al. [34] and expanded with more recent data [11, 30, 43], illustrates the rapid growth in NLP (natural language processing) models (14.1× per year), which far outpaces that of memory capacity (1.3× per year). The memory wall forces modern dataintensive applications such as databases [8, 10, 14, 20], data analytics [1,35], and machine learning (ML) [45,48,66] to either be aware of their memory usage [61] or implement user-level memory management [66] to avoid expensive page swaps [37,53]. As a result, overcoming the memory wall in an application-transparent manner is an active research avenue; approaches such as creating an ML-centric system [45,48,61]. building a memory disaggregation framework [36, 37, 52, 69], and designing new memory architecture [23,42] are actively

We question whether it is possible to overcome the memory wall using flash memory — a memory technology that is typically used in storage due to its high density and capacity scaling [59]. While DRAM can only scale to gigabytes in capacity, a flash memory-based solid-state drive (SSD) is

Figure 1: The trend in memory requirements for NLP applications [11, 30, 34, 43]. The number of parameters increases by a factor of 14.1× per year, while the memory capacity in GPUs only grows by a factor of 1.3× every year.

in the terabyte scale [23], a sufficiently large capacity to adfrest the memory wall challeng. The use of flash memory as main memory is enabled by the recent emergence of interconnect technologies such as CXL [3], Gene-Z[7], CCRI [23], and OpenCAPI [12], which allow PCIe (Peripheral Component Interconnect Express) devices to be accessed directly by the CPU through load/store instructions. Furthermore, these technologies promise excellent scalability as more PCIe devision by the complex of the complex of the complex of the lattached across switches [13] unlike DIMM (Dual Inline Memory Module) used for DRAM.

However, there are three main challenges to using flash memory as CPU caccessible main memory. First, there is a granularity mismatch between memory. First, there is a granularity mismatch between memory requests and flash memory. This results in a significant traffic amplification on top of the existing need for indirection in flash [23, 33]; for example, a 64B acche line flush to the CPU.—mabbel flash would result in 16KiB flash memory page read, 64B updated and 16KiB flash program to a different location (assuming a 16KiB page-level mapping). Second. flash memory is still offered for some granular discovers than Deck Mort and Dark Michael (184). See the solid page-level mapping). Second. flash memory is still offered for some granular flash of the solid page-level mapping. Second. flash memory is still page-level memory is still pa

USENIX Association

2023 USENIX Annual Technical Conference 601

### Sustainable Computing

#### When Poll is More Energy Efficient than Interrupt

Bryan Harris and Nihat Altiparmak Dept. of Computer Science & Engineering University of Louisville {bryan.harris.l,nihat.altiparmak}{@louisville.edu

#### ABSTRACT

Polling is commonly indicated to be a more suitable IO completion mechanism than interrupt for ultra-low latency storage devices. However, polling's impact on overall energy efficiency has not been thoroughly investigated. In this paper, contrary to common belief, we show that polling can also be more energy efficient than interrupt. To do so, we systematically investigate the energy efficiency of all available Linux IO completion mechanisms, including interrupt. classic polling, and hybrid polling using a real ultra-low latency storage device, a power meter, and various workload behaviors. Our experimental results indicate that although hybrid polling provides a good trade-off in CPU utilization, it is the least energy efficient, whereas classic polling is the most energy efficient for low latency IO requests. To the best of our knowledge, this is the first paper classifying polling as more energy efficient than interrupt for a real secondary storage device, and we hope that our observations will lead to more energy efficient IO completion mechanisms for new generation storage device characteristics.

#### CCS CONCEPTS

Information systems → Storage power management;
 Storage class memory.

#### KEYWORDS

IO completion, energy efficiency

#### ACM Reference Format:

Bryan Harris and Nihat Altiparmak. 2022. When Poll is More Energy Efficient than Interrupt. In 14th ACM Workshop on Hot Topics in Storage and File Systems (HolStorage '22), June 27–28, 2022, Virtual Event, USA. ACM, New York, NY, USA, 6 pages. https://doi.org/10. 1145/3538643.359747

Permission to make digital or hard copies of all or part of this work for personal or classroom use is grated without fee provided that copies are not make or distributed for profit or commercial substratege and that copies have this socies and the full classion on the first page, perspitals for components of the control of the composition of the compos

HotStorage '22, June 27–28, 2022, Virtual Event, USA © 2022 Association for Computing Machinery. ACM ISBN 978-1-4503-9999-7/22/06...\$15.00 https://doi.org/10.1145/338643.353947

#### 1 INTRODUCTION

With the most recent advancements in data storage technology, a new category of Solid-State Drives (SSDs) have emerged. These devices are referred to as Ultra-Low Latency (ULL) SSDs and are broadly classified as providing data access in less than 10 µs [17]. Various vendors including Intel, Samsung, and Toshiba have representative ULL SSDs [3, 4, 20], where Intel's latest generation of the Optane SSD is advertised to deliver read IO in 5 us and write IO in 6 us [6]. ULL IO performance providing sub-10 us data access latency renders the performance of traditional, interruptbased IO completion mechanism questionable. Both industry and academia suggested replacing interrupts with polling based IO completion methods for improved latency in such devices [11, 13, 15, 19, 22, 25-27], where polling has also been supported by the Linux kernel since version 4.4. However, one must also consider the relationship between IO performance and power consumption, as power saving methods may not be worth the resulting loss in IO performance.

Despite greater performance, polling is commonly believed to be more costly and less energy efficient than interrupt since polling wastes CPU cycles. The primary assumption behind this is that reduced CPU usage directly correlates to reduced power consumption. Therefore, with kernel version 4.0, Linux introduced a hybrid polling mechanism, which sleeps the task before starting to poll so that less CPU cycles are wasted [13].

In this paper, we study the energy implications of the three IO completion mechanisms available in Linux, including interrupts, classic polling, and hybrid polling techniques, specifically for ULL disk IO. Our empirical evaluation using a real ULL device, a power meter, various workload behaviors, and the most recent longterm Linux kernel relies on IO performance measured per energy unit, bytes transferred per joule. Considering both performance and energy in a single metric, we make observations laying out the most energy efficient IO completion mechanisms. We hope that our observations and analysis can lead to more energy efficient storage stack designs in the future.

#### 2 IO COMPLETION IN LINUX KERNEL

In this section, we outline the working mechanisms of available Linux IO completion mechanisms for the most commonly used Linux-native synchronous IO interface.

59

USENIX Association 19th USENIX Conference on File and Storage Technologies 387

# **Rise of Domain-Specific Computing**







### Rise of accelerator-centric computing

- + Specialized hardware
- Energy/Perf. gains over the CPU

# Position: Workload-Specialized Storage Software Will Emerge



# Position: Workload-Specialized Storage Software Will Emerge



### KV-SSDs



Storage Device (flash and NVM storage)

# Position: Workload-Specialized Storage Software Will Emerge



[Part - 1/3]: Performance and Scheduling Challenges

[Part - 2/3] : New Interfaces - Zone Namespace (ZNS) SSDs

[Part - 3/3]: (WiP) Building Workload-Specialized Storage Stacks

### **Workload-NVMe Interaction**



# **Results: Pure Performance**





There is a large gap (10x) in the CPU efficiency between SPDK and io\_uring stacks
Linux kernel, with block I/O are the primary consumers of the CPU cycles

### So What's Wrong with SPDK?

Takes a pure performance-based approach

Highly CPU inefficient (only poll, 100% CPU utilization)

Scaling performance can be fragile beyond CPU cores

Does not have a file system

Does not have multi-tenancy (only single process)

No support for any other kind of devices except NVMe

No provision for the kernel supported services:

- Caching, buffering, security
- Importantly: Sharing and I/O Scheduling



# What are the Scheduling Challenges



(a) IOPS performance of schedulers;



(a) IOPS performance of schedulers;

Latency (P95) with background (b) reads and (c) writes traffic

- No scheduling (NOOP) helps with pure performance scaling
- No scheduling (NOOP) has poor performance isolation with <u>interfering tasks</u>

# The Interference Control (or Delivering Quality-of-Service)



I/O Scheduling interference and overheads

### The Interference Control (or Delivering Quality-of-Service)



I/O Scheduling interference and overheads

### Inside an SSD

- Mixing of data (lifetime, workloads)
- I/O Scheduling
- Interference from GC
- Over provisioning
- Parallelism management
- ...

### [Part - 1/3]: Performance and Scheduling Challenges

[Part - 2/3] : New Interfaces - Zone Namespace (ZNS) SSDs

[Part - 3/3]: (WiP) Building Workload-Specialized Storage Stacks

### **ZNS:** The New Storage Interface and Capabilities





https://zonedstorage.io/docs/introduction/zns

Standardized in the NVMe 1.4, July 2021

A ZNS SSD is divided into Zones

Each zone has its size and a write pointer



Each zone must be written sequentially

Limited intra-zone parallelism (only 1 write at a time)



New I/O Command: Append

Multiple Append command can be issued to a zone (high intra-zone parallelism)



NVMe Flash Zone Namespace (ZNS) SSD

New I/O Command: Append

Multiple Append command can be issued to a zone (high intra-zone parallelism)



New zone-management commands: **Finish** and **Reset** 

**Finish**: makes it read-only (release write resources)

**Reset**: garbage collect the zone



New zone-management commands: **Finish** and **Reset** 

**Finish**: makes it read-only (release write resources)

**Reset**: garbage collect the zone



Zone-N

NVMe Flash Zone Namespace (ZNS) SSD

# **Zone Namespace (ZNS) Devices: The State Machine**



### State of the ZNS Software





Idea: Different zones helps to isolate workloads from each other and better Quality-of-Service (QoS)

#### **<u>But:</u>** There are multiple ways ZNS devices can be integrated

- Should I use **Append** or **Write**? How do I manage **parallelism**? Intra-zone or Inter-zone?
- What is the cost of **Reset** and **Finish**? And the state machine implementation
- Does ZNS deliver on its promise of isolation?

#### Performance Characterization of NVMe Flash Devices with Zoned Namespaces (ZNS)

Krijn Doekemeijer\*1, Nick Tehrany\*1,2, Balakrishnan Chandrasekaran1, Matias Bjørling3, and Animesh Trivedi1 <sup>1</sup>Vrije Universiteit Amsterdam, Amsterdam, the Netherlands <sup>2</sup>Delft University of Technology, Delft, the Netherlands 3Western Digital, Copenhagen, Denmark {k, doekemeijer, n, a, tehrany, b, chandrasekaran, a, trivedi}@vu, nl, matias, bjorling@wdc.com

Abstract-The recent emergence of NVMe flash devices with Zoned Namespace support, ZNS SSDs, represents a significant new advancement in flash storage. ZNS SSDs introduce a new storage abstraction of append-only zones with a set of new I/O (i.e., append) and management (zone state machine transition) commands. With the new abstraction and commands, ZNS SSDs offer more control to the host software stack than a non-zoned SSD for flash management, which is known to be complex (because of garbage collection, scheduling, block allocation, parallelism management, overprovisioning). ZNS SSDs are, consequently, gaining adoption in a variety of applications (e.g., file systems, key-value stores, and databases), particularly latencysensitive big data applications. Despite this enthusiasm, there has yet to be a systematic characterization of ZNS SSD performance with its zoned storage model abstractions and I/O operations. This work addresses this crucial shortcoming. We report on the performance features of a commercially available ZNS SSD (13 key observations), explain how these features can be incorporated into publicly available state-of-the-art ZNS emulators, and recommend guidelines for ZNS SSD application developers. All artifacts (code and data sets) of this study are publicly available at https://github.com/stonet-research/NVMeBenchmarks.

Index Terms-Measurements, NVMe storage, Zoned Namespace Devices

#### I. INTRODUCTION

The emergence of fast flash storage in data centers, HPC, and commodity computing has fundamentally caused changes in every layer of the storage stack, and led to a series of new developments such as a new host interface (NVM Express. NVMe) [1], [2], [3], a high-performance block layer [4], [5] [6], [7], new storage I/O abstractions [8], [9], [10], [11], [12], [13], [14], and re/co-design of storage application stacks [15]. [16], [17], [18], [19], [20], [21]. Today, flash-based solidstate drives (SSDs) can support very low latencies (i.e., a few microseconds), and multi GiB/s bandwidth with millions of I/O operations per second [22], [23], [24],

Despite these advancements, the conceptual model of a storage device remains unchanged since the introduction of hard disk drives (HDDs) more than half a century ago. A storage device supports only two necessary operations: write and read data in units of sectors (or blocks) [25]. Data can be read from and written to anywhere on the device, hence

\*Equal contributions, joint first authors, Nick was with TU Delft during this work.

supporting random and sequential I/O operations. Though this model works with conventional HDDs, it is not apt for flash-based storage devices as flash internally does not support overwriting data [26], [27], [28]. Flash devices offer the illusion of "overwritable" storage via the flash translation layer (FTL), a software component that runs within the device. The FTL enables easy integration of flash devices (by allowing them to masquerade as fast HDDs), albeit it introduces unpredictability in performance [29], [30], [31], [32], [33], [34] and complicates device lifetime management [35]. These challenges are defined as the unwritten contracts of SSDs [26]. As data centers have largely transitioned to SSDs for fast, reliable storage [36], [37], and modern big data applications have high QoS demands [38], [39], there is a dire need to address these unwritten contracts.

Researchers and practitioners advocate for open flash SSD interfaces beyond block I/O [40] to address these challenges. Examples include Open-Channel SSDs (OCSSD) [41], multistream SSDs [9], and, more recently, Zoned Namespaces (ZNS) [11]. The focus of this work is on NVMe devices that support ZNS, which are commercially available today [42]. [43]. ZNS promises a low and stable tail latency [11] and a high device longevity, and, hence, addresses the needs of modern big data workloads. There is, unsurprisingly, a rich body of active and recent work on ZNS [44], [11], [45], [46], [47], [48], [49], [50], [51], [52], [53], [54], [55], [56], Despite this enthusiasm, there has not been a systematic performance and operational characterization of ZNS SSDs. This lack of an extensive performance and operational characterization of ZNS SSDs severely limits the utilization and application of ZNS devices in big data workloads. In this work, we bridge this gap by presenting the performance characterization of a commercially-available NVMe ZNS device.

We complement this characterization of a physical device with an investigation of emulated ZNS devices, since they are widely used in research [51], [57], [58], [55]. Emulated devices enable researchers to explore the ZNS design space without being constrained by device-specific characteristics. Such unconstrained explorations are crucial since ZNS is a new interface and the selection of available configurations in a real SSD is, unsurprisingly, quite limited. The research validity of all of these works hinge on an emulator's ability to mimic

#### ZINC - A ZNS Interference-aware **NVMe Command Scheduler**

1st Nick Tehrany Computer Science

n.a.tehrany@vu.nl

2<sup>nd</sup> Kriin Doekemeijer Computer Science k.doekemeijer@vu.nl

3rd Zebin Ren Computer Science z.ren@vii.nl

4th Animesh Trivedi Computer Science Vrije Universiteit Amsterdam Vrije Universiteit Amsterdam Vrije Universiteit Amsterdam Vrije Universiteit Amsterdam Amsterdam, The Netherlands Amsterdam, The Netherlands Amsterdam, The Netherlands Amsterdam, The Netherlands a trivedi@vu.nl

Abstract-NVMe Zoned Namespaces (ZNS) is a new NVMe standard designed to open up the rigid block-based host-device interface and offer a new zone-based device interface to host software. ZNS introduces a set of new I/O (append) and flash management (reset, finish, open, close) commands to host software (i.e., block layer, file system, or application). The flash management commands allow managing flash-based SSDs directly and offer a possibility for better interference management between flash- and user-issued I/O operations. In this paper, we demonstrate that, despite ZNS's promises, its new commands create complex interference patterns with I/O and each other that lead to significant losses in application performance. We introduce a first-of-its-kind interference model for ZNS and use this model to report 3 interference observations made on a physical ZNS SSD. Based on our interference study, we propose ZINC, a ZNS interface-aware NVMe command scheduler that mitigates the impact of interference by prioritizing user I/O commands over flash management commands (configurable). ZINC delivers up to 56.87% lower interference in fio-based micro-benchmarks compared to the state-of-the-practice mq-deadline, and a 9.81% throughput improvement for RocksDB + ZenFS, a ZNS-enabled KV-store for ZNS SSDs. With concurrent reset operations. RocksDB's throughput degrades from 80 KIOPS to 72 KIOPS with mq-deadline, but remains at 80 KIOPS with ZINC. We opensourced ZINC at https://anonymous.4open.science/r/zinc

#### I. INTRODUCTION

The emergence of solid-state drives (SSDs) has presented a significant advancement in storage technology over prior technologies (e.g., hard disk drives), with modern SSDs capable of achieving millions of I/O operations per second, gigabytes of bandwidth per second, and sub-microsecond access latencies [40]. Despite their popularity [41], a perennial challenge with flash-based SSDs is delivering both predictable- and consistent performance to a variety of workloads. The internal structure of flash-based SSDs is at the root of this challengeflash requires significant management effort [31] (i.e., location mapping, garbage collection, parallelism management, bad block, and ECC) and flash management is typically only done in firmware, known as the flash-translation layer (FTL) running within SSDs [28]. Traditional block-based NVMe SSDs (non-ZNS) have an interface that only accounts for blockbased NVMe read and write commands. This interface does not offer any insight or host-level control over data placement or garbage collection [20], these are all done inside



Figure 1: Conventional (a) NVMe vs. (b) ZINC scheduling. any point in time. Consequently, when the firmware inside the SSD executes these background management commands, these commands interfere with the foreground host-issued I/O commands. There has been a series of efforts to curtain the impact of the interference (collectively termed as Unwritten Contracts [17]) with better resource allocation, scheduling techniques [3], [10], [14], [26], [34] and even new host-device interfaces such as Open-Channel SSDs [23], StreamSSD [13], and more recently, Zoned Namespace (ZNS) SSDs [2], [24]. Among them, ZNS has attracted a significant amount of research interest [8], [9], [15], [20], [21], [33], [36] and has become an industry-standard with the NVMe 2.0 specification.

The unique aspect of NVMe ZNS devices is that they offer a new I/O abstraction-append-only zones-with a set of new I/O (append) and zone management (reset, finish, open, close) commands. Zone management commands closely imitate how flash chips are managed internally and offer more direct control over garbage collection and data placement within zones to the host systems software (i.e., block layer, file systems, or applications; see [§II]. The zone interface provides opportunities, as the host system software now controls data placement by explicitly identifying which zone to store data in, thus following the "Grouping by Death Time" unwritten contract [17], [30]. Further on the host also implicitly controls garbage collection, as a zone is the unit of garbage collection-a host can decide when a zone of the SSD, run on the background and can be issued at needs to be reclaimed by issuing an explicit ZNS command

### Result [1 / 3]: Write vs Append Parallelism Management



mq-deadline merges adjacent writes

**Single Zone Parallelism (intra-zone)** 

### Result [1 / 3]: Write vs Append Parallelism Management



- Intra-Zone parallelism has higher performance
- Writes have better performance scalability than Appends (!)
- Append scalability is independent of intra- or inter-zone, but limited in performance

## Result [2 / 3]: The Cost of Reset and Finish Operations



## Result [2 / 3]: The Cost of Reset and Finish Operations



- The zone utilization --- Very important factor
- Finish is an extremely expensive operation (100 1,000s of milliseconds)
- Leverage intra-zone parallelism (minimize half-written zones)

## Result [3 / 3]: Read-Write Isolation on ZNS



- ZNS provides good read-write isolation when operating on multiple zones
- Stable performance (in comparison to NVMe)



## **New Interference**: Reset on I/O Operations





(a) on write with inter-zone concurrency.

Concurrent resets with a controlled rate on a different zone

## **New Interference**: Reset on I/O Operations



Modeling and quantifying interference with a first-order Earth Mover's Distance (EMD)-style model

#### Interference Results: Micro- and Workload-level





| 0% Reset   | 50% Reset  |            |
|------------|------------|------------|
| 78.9 KIOPS | 72.1 KIOPS | -8.7% drop |

Workload-level interference!

# ZINC: Zone-Interface aware NVMe I/O Command Scheduler

#### All NVMe commands need scheduling for QoS

ZINC is derived from mq-deadline scheduler with a Kyber-style Reset-throttling logic

Extra code to connect the io\_uring passthrough commands to the I/O scheduler in the Linux block layer

Open source based on v6.3

The paper is under review



### The Design of ZINC





#### The decision process in the each epoch:

- (1) Are all the write tokens consumed? If yes, then issue the Reset command to the ZNS
- (2) If an epoch is reached and the Reset command is still held, then increase its priority
- (3) In any epoch if the Reset command has more priority than "x" (configurable count), then issue it immediately

## **Impact of using ZINC**

#### Controlled degradation



### **Impact of using ZINC**





|      | 0% Reset   | 50% Reset  |
|------|------------|------------|
| MQ-D | 78.9 KIOPS | 72.1 KIOPS |
| ZINC | 78.2 KIOPS | 80.0 KIOPS |

~11% gain

- ZINC helps to control the interference between NVMe I/O and NVMe zone-management commands
- ZINC helps to deliver workload-level performance gains

[Part - 1/3]: Performance and Scheduling Overheads

[Part - 2/3]: New Interfaces - Zone Namespace (ZNS) SSDs

[Part - 3/3]: (WiP) Building Workload-Specialized Storage Stacks

### **Constructing an End-to-End Picture**





Workload

File-A File-B File-C File-D

File System: F2FS

Storage stack

NVMe / ZNS Devices

### **Constructing an End-to-End Picture**



### **Constructing an End-to-End Picture**







NVMe / ZNS Devices



#### **ZNS-Tools: e-BPF powered whole-stack tracing framework**

Collects traces for workload-level data operations

- Workload: RocksDB (WAL, compaction, GC)
- File system: F2FS (log, GC)
- Block layer: ZNS (read, write, reset scheduling)
- Device Driver: NVMe (command issuing)

Builds offline location and movement profile

https://github.com/stonet-research/zns-tools

### **ZNS-tools: I/O and Trace visualization**



#### **ZNS-tools: Zone Utilization**



Same workload: YCSB-A Very different ZNS utilization and placement Data grouping interference!



**Result**: Not all ZNS software stacks are equal, hence software specialization matters!

## [1 / 2] Workload-Specialized Control - msF2FS



#### <u>Multistream F2FS (msF2FS)</u> optimized for NVMe ZNS devices

- Gives control over: file ⇒ F2FS zone ⇒ ZNS Zone sharing (exclusive, sharing)
- Physical separation in zones
- Performance scaling with inter-zone parallelism (F2FS does not support Appends)
- https://github.com/stonet-research/msF2FS

## [2 / 2] Workload-Specialized Control - zWAL



zWAL: A ZNS-native Write-Ahead-Log (WAL) design with Appends (parallel I/O)

Idea: Write in any order with ordering information with Append, then sort out later when reading

Open-sourced code: <a href="https://github.com/Krien/ZenFS-append/tree/appends">https://github.com/Krien/ZenFS-append/tree/appends</a>

## [2 / 2] Workload-Specialized Control - zWAL



(a) Micro-benchmarks

(b) Replay cost

(c) YCSB workload

[Part - 1/3]: Performance and Scheduling Overheads

[Part - 2/3]: New Interfaces - Zone Namespace (ZNS) SSDs

<del>[Part - 3/3]: (WiP) Building Workload-Optimized Storage Stacks</del>

### Workload-specialized QoS in an End-to-End Manner



### Workload-specialized QoS in an End-to-End Manner



#### Workload-specialized QoS in an End-to-End Manner



#### **Conclusion**

Vision: use your favorite workload-specialized data structure I/O stack!

The era of workload-specialized storage stacks is here

We are exploring:

- Workload-specialized storage software abstractions
- Mapping software interfaces to the available hardware interfaces
  - NVMe ZNS, KV-SSD, CXL (new)

**WiP:** [Network (CXL) + Storage = Disaggregation] File system, Key-value store, and ML workloads

## Thank you!

#### https://stonet-research.github.io/





Acknowledgments: Work generously funded by the Dutch Research Council (NWO) grants and donations from Xilinx, Western Digital, Mellanox, AWS, and VU Amsterdam.

# **Backup**

### Revisiting Storage APIs: Rise of io\_uring



#### Libaio:

- + Async I/O
- + Any files/FSes
- + Any device: HDD, NVMe
- Async only with direct I/O
- Performance
- Metadata management



#### SPDK:

- + Performance
- + Close application integration
- + No syscall or interrupts
- Only NVMe
- No kernel assistance
- Scalability and brittle



NVMe Device

#### Io\_uring

- + Command-based interface
- + Extensible

**Best of both worlds?** 

### Three Modes of io\_uring API







(a) default with syscalls

- **(b)** [iou+p] with completion polling
- (c) [iou+k] with submission polling

## **Benchmarking Setup**

#### Setup 1 [Systor'22]:

- 2x Intel® Xeon® E5-2630 (Sandy Bridge), 10 cores/socket ⇒ 20 CPU cores
- 20 Intel® DC P3600 400GB NVMe Flash SSDs
   ⇒ ~6 Million IOPS

#### Setup 2 [CHEOPS'23]:

- 2x Intel® Xeon® Silver 4210R (Cascade Lake), 10 cores/socket ⇒ 20 CPU cores
- $7 \times Intel Corporation 900P NVMe Optane SSD <math>\Rightarrow 4.2 Million IOPS$

### **Results: Scalability**

Systor'22



io\_uring kernel polling: Performance collapses when not enough cores to poll

### **Results: Scalability**

Systor'22 CHEOPS'23





io\_uring kernel polling: Performance collapses when not enough cores to pollCPU efficiency is still bad: 10x more CPU cores needed to match the SPDK performance

#### **Results: CPU Profile**



The Block layer takes a big chunk of the CPU cycles

The kernel overheads with blocking interfaces

For SPDK, fio itself becomes the bottleneck

## **Results: Efficiency (single CPU core)**



## **Analysis: CPU Profile**





Poor scheduling, and CPU sharing - Careful!

SPDK is still 5x more efficient

## Results: Efficiency with <u>TWO</u> CPU cores



[ aio < iou < iou with polling < iou with kernel poll < SPDK ]

Normal service order can be resumed (**but** at the cost of 2x CPU cores)!

### Result [1 / 4]: Write vs Append Latencies



- 4KiB block size has lower latencies (up to 2x)
- Writes have lower latencies than Append operations in our experiments
- SPDK has lower latencies than the Linux I/O stack (none, mq-deadline)

## Write and Append: Bandwidth



## **New Interference**: Reset on I/O Operations



(a) on write with inter-zone concurrency.



(b) on append with intra-zone concurrency.



- (c) on read with intra-zone concurrency.
- Concurrent Reset commands slow down I/O (write, append, reads)
- Namespace based isolation does not help

## **New Interference**: Finish on I/O Operations







(b) on append with intra-zone concurrency.



(c) on read with intra-zone concurrency.

Make a first-order linear model using the EMD distance:



#### **ZINC Interference Model**

$$Z^{Inter} = \frac{1}{n} \sum_{i=1}^{n} \sqrt{\alpha \times (\Delta T_i)^2 + \beta \times (\Delta L_i)^2}$$
 (1)

With:

$$[\alpha + \beta = 1, \quad 0 \le \alpha \le 1, \quad 0 \le \beta \le 1]$$

$$\Delta T_i = \left(\frac{T_i^{int} - T_i^{iso}}{T_i^{iso}}\right) \tag{2}$$

$$\Delta L_i = \left(\frac{L_i^{int} - L_i^{iso}}{L_i^{iso}}\right) \tag{3}$$

## Impact of using ZINC



| 0% Reset   | 50% Reset  |            |
|------------|------------|------------|
| 78.9 KIOPS | 72.1 KIOPS | -8.7% drop |
| 78.2 KIOPS | 80.0 KIOPS | +2.3% gain |



- ZINC helps to control the interference between I/O & zone-management commands
- ZINC helps to deliver workload-level performance gains

#### **ZINC: Reset Profile**

