Fixing ext4 IO distribution on RAID 0 arrays

Nicolae Vartolomei · 2023/02

Some time ago, I conducted an investigation of a performance issue with a database1. Said database was running on ext4 file system on top of a RAID 0 array of 12 hard disk drives2. All disk metrics (such as IOPS and latency) displayed an anomaly, as shown in the following diagram:

IOPS distribution anomaly

For a certain period, the disk IO distribution went haywire, with three out of twelve disks becoming saturated and restricting the overall system throughput. I will attempt to explain the underlying cause of this and suggest methods for avoiding this issue.

Is the database itself responsible for creating these hotspots? This is highly unlikely, particularly given that the exact same behaviour (i.e., the same three out of twelve disks receiving more IOs or being saturated) was observed on hundreds of other hosts.

Intuition pointed towards the interplay between the RAID 0 and the file system. And indeed, that was the case. Although documentation3 stated that the ext4 file system should take the underlying device into account for avoiding exactly this situation, it failed to do so properly in this case.

RAID 0

RAID 0, also known as striping, is a type of RAID (Redundant Array of Independent Disks) that distributes data blocks across multiple devices. This allows for parallel read and write operations, which can significantly increase the speed of data transfer compared to a single disk.

Illustration of a RAID 0 array with 12 devices

When creating a RAID 0 array (mdadm) there are 2 important variables to consider:

  1. chunk size (also known as stride, defaults to 512K on modern Linux4): number of consecutive blocks written to each device,
  2. stripe size: usually it is the number of chunks multiplied by the number of devices in the array.

ext4 file system

File systems can be quite intricate, with various abstractions and indirections used to organize data for optimal efficiency and performance. To keep this note brief, I will only highlight the crucial components. More extensive overviews are available5 elsewhere6.

ext4 allocates storage space in units of blocks. Blocks are grouped into block groups of predefined size (in 64-bit mode with 4 KiB blocks this defaults to 32768).

I’m ignoring flexible block groups concept as it doesn’t change the problem.

ext4 layout overview

In general, most of the IO will be dispersed across the entire device, with good locality on a per-file/directory basis7. However, creating/updating/deleting files will entail alterations to the bitmaps situated in the block group headers. This presents an opportunity for hotspots to arise on the device.

Where are these block group headers located?

disk_hits = [0] * 12
chunk_size_blocks = 512 // 4
blocks_per_group = 32768

for group_ix in range(1000):
  group_on_disk_ix = ((group_ix * blocks_per_group) // chunk_size_blocks) % len(disk_hits)
  disk_hits[group_on_disk_ix] += 1

print(disk_hits)

The result of the script is this:

[334, 0, 0, 0, 333, 0, 0, 0, 333, 0, 0, 0]

D’oh. Is it indeed the case? This only considers where block groups start. However, a chunk size of 512 KiB accommodates the whole header even in the case of flexible block groups.

We could use the dumpe2fs utility to double-check where bitmaps land on a real system just to be sure. We need to parse it’s output which looks like this:

...
Group 4: (Blocks 131072-163839) csum 0xf5f6 [INODE_UNINIT, BLOCK_UNINIT, ITABLE_ZEROED]
  Block bitmap at 1179 (bg #0 + 1179), csum 0x00000000
  Inode bitmap at 1195 (bg #0 + 1195), csum 0x00000000
  Inode table at 3255-3766 (bg #0 + 3255)
  32768 free blocks, 8192 free inodes, 0 directories, 8192 unused inodes
  Free blocks: 131072-163839
  Free inodes: 32769-40960
...
import re
import subprocess

fs_info = subprocess.run(['dumpe2fs', '/dev/md127'], stdout=subprocess.PIPE).stdout.decode('utf-8')

bitmap_re = re.compile('bitmap at (\d+)')

disk_hits = [0] * 12
chunk_size_blocks = 512 // 4
for line in fs_info.splitlines():
  if 'bitmap at' not in line:
    continue
  block_ix = int(bitmap_re.search(line)[1])
  disk_ix = (block_ix // chunk_size_blocks) % len(disk_hits)
  disk_hits[disk_ix] += 1

print(disk_hits)

Output:

[6368, 0, 0, 0, 6400, 0, 0, 0, 6388, 32, 0, 0]

It appears that ext4 is unable to optimize metadata placement after all. I came across a commit that was somewhat related; it disabled an optimization that was linked to RAID arrays. However, I did not have enough time to delve deeply into it.

Instead, I came up with a custom blocks_per_group value to mitigate the problem when using 12 disks in a RAID 0 array, 32768-24=32744.

disk_hits = [0] * 12
chunk_size_blocks = 512 // 4
blocks_per_group = 32768 - 24

for group_ix in range(1000):
  group_on_disk_ix = ((group_ix * blocks_per_group) // chunk_size_blocks) % len(disk_hits)
  disk_hits[group_on_disk_ix] += 1

print(disk_hits)
[83, 83, 84, 84, 83, 84, 83, 83, 82, 84, 84, 83]

Bingo! By making this adjustment, we no longer encounter metadata hotspots.

It is worth noting that the severity of the issue is contingent on the number of devices or disks in the array. The new value, which is less roundy (i.e., 215242^{15}-24), completely eliminates the hotspot problem, regardless of the number of disks in the array.

Could this be a better default? Or, in general, does the disadvantage of block groups beginning at somewhat arbitrary chunk locations (instead of being aligned with chunk start) outweigh the advantages? Thus far, this adjustment has functioned smoothly for us.

Conclusion

If you have an even number of disks in a RAID 0 array and are using the ext4 file system, most likely you want to pass -g 32744 option to mkfs when creating it for optimal performance.

Appendix A: Simulating with a synthetic workload (click to open)

Create 12 sparse files of size 100G:

for i in $(seq 12); do
    dd if=/dev/zero of=disk$i.img bs=1 count=0 seek=100G
    losetup /dev/loop$i ./disk$i.img
done

Create the RAID 0 array:

mdadm --create /dev/md127 --level 0 --raid-devices 12 /dev/loop{1..12}

Create an ext4 filesystem on it and mount:

mkfs.ext4 /dev/md127 && mkdir -p /disk && mount /dev/md127 /disk

# df -h /disk
Filesystem      Size  Used Avail Use% Mounted on
/dev/md127      1.2T   28K  1.1T   1% /disk

Create 100K files:

cd /disk
for i in $(seq 100000); do touch file$i; done

Check the number of writes per loop device and spot the outliers:

for i in $(seq 12); do echo -n "loop$i: "; cat /sys/block/loop$i/stat | awk '{print $5}'; done
loop1: 9892 **
loop2: 8824
loop3: 8673
loop4: 8675
loop5: 9772 **
loop6: 8836
loop7: 8632
loop8: 8536
loop9: 9128 **
loop10: 8603
loop11: 8443
loop12: 8399

Now retry again, but update the Blocks Per Block Group size to be 32768-24=32744:

mkfs.ext4 -g 32744 /dev/md127 && mkdir -p /disk && mount /dev/md127 /disk
cd /disk
for i in $(seq 100000); do touch file$i; done
for i in $(seq 12); do echo -n "loop$i: "; cat /sys/block/loop$i/stat | awk '{print $5}'; done

Better:

loop1: 9151
loop2: 8856
loop3: 8974
loop4: 8975
loop5: 8881
loop6: 8777
loop7: 8937
loop8: 9345
loop9: 8945
loop10: 9009
loop11: 9075
loop12: 8827
import re
import subprocess

fs_info = subprocess.run(['dumpe2fs', '/dev/md127'], stdout=subprocess.PIPE).stdout.decode('utf-8')

bitmap_re = re.compile('bitmap at (\d+)')

disk_hits = [0] * 12
chunk_size_blocks = 512 // 4
for line in fs_info.splitlines():
  if 'bitmap at' not in line:
    continue
  block_ix = int(bitmap_re.search(line)[1])
  disk_ix = (block_ix // chunk_size_blocks) % len(disk_hits)
  disk_hits[disk_ix] += 1

print(disk_hits)

Output:

[1570, 1600, 1600, 1600, 1600, 1600, 1600, 1600, 1600, 1632, 1600, 1600]