2015 June 21 block

blk-mq paper note

The generation IO intensive algorithms and systems are based on two fundamental characteristics:

Random accesses that require disk head movement are slow and
sequential accesses that only require rotation of the disk platter are fast. (For HDD)

Three contributions to solve the block layer bottleneck:

The current request queue lock is a single coarse lock, it is the main bottleneck. Single lock design is especially painful on parallel CPUs, as all cores must agree on the state of the request queue lock, which quickly results in significant performance degradation.
A new design for IO management. Multiple IO submission/completion queues to minimize cache coherence across CPU cores. Introduct two levels of queues: software queue and hardware queue. (i) software queues manage the IOs submitted from a given CPU core, (ii) hardware queues mapped on the underlying SSD driver submission queue.
Implement a new no-op block driver that allows developers to investigate OS block layer improvements. Two-level locking design reduces the number of cache and pipeline flushes compared to a single level design.

Three main problems within current block layer:

Request Queue Locking: The block layer fundamentally synchronizes shared accesses to an exclusive resource: the IO request queue. (i) Whenever a block IO is inserted or removed from the request queue, this lock must be acquired. (ii) Whenever the request queue is manipulated via IO submission, this lock must be acquired. (iii) As IOs are submitted, the block layer proceeds to optimizations such as plugging. (iv) IO recording, and (v) fainess scheduling. This is a major resource contention.
Hardware Interrupts. Most of today’s storage devices are designed such that one core is responsible for handling all hardware interrupts and forwarding them to other cores as soft interrupts regardless of the CPU issuing and completing the IO. As a result, a single core may spend considerable time in handling these interrupts, context switching, and polluting L1 and L2 caches that applications could rely on for data locality. The other cores then also must take an IPI to perform the IO completion routine. As a result, in many cases two interrupts and context switches are required to complete just a single IO.
Remote Memory Accesses: Request queue lock contention is exacerbated when it forces remote memory accesses across CPU cores ( or across sockets in a NUMA architecture).

Block multi-queu architecture:

Distribute the lock contention on the single request queue lock to multiple queues through the use of two levels of queues:

Software Staging Queues. Block IO requests are now mantained in a collection of one or more request queues. These staging queues can be configured per socket or per core on the system.
Hardware Dispatch Queues. The number of hardware dispatch queues will typically match the number of hardware contexts supported by the device driver. Because IO ordering is not supported within the block layer any software queue may feed any hardware queue without needing to maintain a global ordering. This allow hardware to implement one or more queues that map onto NUMA nodes or CPU’s directly and provides a fast IO path from application to hardware that never has to access remote memory on any other node.

These two level design explicitly separates the two buffering functions of the staging area that was previously merged into a single queue in the Linux block layer: (i) support for IO scheduling (software level) and (ii) means to adjust the submission rate (hardware level) to prevent device buffer over run.

Device driver extensions to archieve optimal performance:

HW dispatch queue registration: The device driver must export the number of submission queues that it supports as well as the size of these queues, so that the block layer can allocate the matching hardware dispatch queues.
HW submission queue mapping function: The device driver must export a function that return a mapping between a given software level queue.
IO tag handling: The device driver tag management mechanism must be revised so that it accepts tags generated by the block driver. While not strictly required, using a single data tag will result in optimal CPU usage between the device driver and block layer.

Refer to the paper: Linux Block IO: Introducing Multi-queue SSD Access on Multi-core Systems