By Damien Le Moal and Javier González

This June, NVM Express™ ratified the NVMe Zoned Namespace (ZNS) Command Set, which divides the NVMe namespace into independent zones that support specific data or workload types. NVMe ZNS is a breakthrough in the Zoned Storage ecosystem and will provide several performance benefits for SSDs, including reduced write amplification, improved throughput and latency, reduced media over-provisioning and more.

We recently participated in the NVM Express Q3 webcast to educate the public on this new technology. We covered the ZNS command set, the Linux zoned storage ecosystem and the tools and libraries to enable ZNS SSDs.

We received several great audience questions that we were unable to cover during the live webcast. You can find the answers to these below:

Testing

  • Was the ZNS related work tested in QEMU (Quick EMUlator) only, or on some actual ZNS drives?

Both QEMU and actual ZNS drives were used for testing.

Implementation

  • If a hybrid drive has both ZNS namespace and random write namespaces, can the f2fs (flash-friendly file system) directly work on this type of drive?

Yes. Each namespace will be exposed by the Linux host as different block devices, one regular block device and a zoned block device. These two block devices can be used together to format a single f2fs volume. In this case, the regular namespace will be used by f2fs to store fixed-location metadata blocks. From the Linux host perspective, this would be the same as having two different SSDs, one with a conventional namespace and one with a zoned namespace.

 

  • What happens if a zone is full? Does the write from application fail in this scenario, or will a new zone be allocated and linked to the first zone?

Writing to a full zone results in a write error. The user (application or kernel component) must be aware of the amount of unwritten space in a zone to ensure that such an error does not happen. If the zone that the user was writing to becomes full (the zone write pointer reaches the zone capacity), the user must choose another non-full zone to process write operations.

 

  • What is the behavior of the Linux kernel if zone size is not the power of 2? What is the behavior of the Linux kernel if zone capacity is not available for a device while in use?

Linux only supports zoned block devices that have a constant zone size that is a power of 2 number of LBAs (logical block address). Zoned block devices with a zone configuration that does not match these constraints will not be exposed as a block device file and will only be accessible by user applications using a passthrough interface. That is, file systems and device mappers will not work with the device. Note that devices can expose a zone capacity that is less than zone size and not a power of 2.

 

Only ZBC (Zoned Block Commands in SCSI) and ZAC (Zoned Device ATA Command Set in SATA) SMR (Shingled Magnetic Recording) hard disks do not define a zone capacity. For these devices, the Linux kernel uses the zone size as the zone capacity, that is, zone capacity is always reported as being equal to the zone size. For NVMe ZNS devices, reporting a zone capacity is mandatory.

 

  • If the logical block size is 4KB and flash page size is 16KB, how is the write/append of a single logical block persisted? Two options I can think of are (1) partial page programming and (2) buffering in NVRAM (non-volatile random-access memory). Can you please comment on the feasibility of these and other options?

This is a device-level implementation problem, in which the vendor must decide the most appropriate way to handle sub flash page writes. The NVMe ZNS standard only specifies that the device controller must accept write operations as small as a single LBA.

 

  • How is GC (garbage collection) handled with btrfs (b-tree file system)? Is there over provisioning?

The current btrfs rebalancing operation is reused. This operation copies the valid blocks from one block group (zone) to another empty block group (zone), compacting the blocks in the destination block group. Some zones will be reserved to guarantee the availability of empty block groups/zones to ensure that a rebalancing operation can always be executed.

 

Performance

  • How does the btrfs ZNS implementation performance compare with non-ZNS?

Our initial performance evaluation shows that the performance is similar (with comparable drive raw performance) for most workloads. For some workloads, performance improvements of up to 60% can be observed with btrfs-ZNS. So far only a single workload (extreme case) show a performance drop of 10-20%.

 

  • Do SMR drives give better performance than CMR drives?

This depends on the SMR zoned model being used. however, they have very similar raw (mechanical) performance profiles. Comparing host managed SMR disks with a per zone sequential write constraint (unlike host aware and drive managed SMR disks) to regular CMR disks, drives from the same generation are mostly equivalent mechanically (seek performance) and have comparable linear bit density (maximum throughput). This generally results in similar I/O rates and throughput for equal read and write patterns. For host-aware and drive managed SMR disks, performance may differ from CMR disks depending on the workload.

Applications

  • Is there any application on the market now that can hit optimized ZNS performance?

We are not aware of any at this time. Open source application work, for instance with RocksDB, show optimal drive performance without any added overhead due to the zone interface.

 

  • Which companies are making these ZBD and ZNS drives other than Western Digital?

Some companies have recently announced ZNS NVMe SSD products. In general, storage device vendors should be contacted for information about the availability of products and the technology used.

Future Plans

  • What is the plan to support TP4076 in the block I/O stack?

TP4076 is still a work in progress, however, it will be supported either in-kernel or in passthru.

 

  • When will the ZNS standard be publicly available?

The ZNS Command Set is available as a ratified technical proposal (TP 4053) in the NVMe 1.4a specification and does not require an NVM Express membership to download. You can access the NVM Express specification page here.

 

We hope we were able to answer all your questions on NVMe ZNS. If you were unable to attend the live webcast, we recommend you watch the full video recording on the NVM Express YouTube channel. For the latest specification updates and technical proposals, you can visit the NVM Express website.