I/O Performance Benchmarking Linux

On at least Linux, all synthetic benchmarking answers should mention fio - it really is a swiss army knife I/O generator.

A brief summary of its capabilities:

  • It can generate I/O to devices or files
  • Submitting I/O using a variety different methods
    • Sync, psync, vsync
    • Native/posix aio, mmap, splice
  • Queuing I/O up to a specified depth
  • Specifying the size I/O is submitted in
  • Specifying I/O type
    • Sequential/random
      • If I/O is random you can specify what distribution you want to skew it in to be more realistic
    • Reads/writes or some mixture of the two
    • I/O recorded with blktrace can be replayed

The statistics it gives you cover

  • Amount of I/O generated in MBytes
  • Average bandwidth
  • Submission/completion latency with minimum, maximum, average and standard deviation
  • IOPS
  • Average queue depth

The list of features and output goes on and on.

It won't generate one unified number that represents everything at the end but if you're serious about understanding storage performance you will know that single number cannot explain all of what you need to know. Even Linus Torvalds thinks fio is good:

[G]et Jens' FIO code. It does things right [...] Anything else is suspect - forget about bonnie or other traditional tools.

Brendan Gregg (a Netflix Performance Engineer) has also mentioned fio positively:

My other favorite benchmarks are fio by @axboe [...]

PS: Are you about to publish benchmarks that you did using fio on a website/in a paper etc? Don't forget to follow https://github.com/axboe/fio/blob/master/MORAL-LICENSE !


I'd recommend using bonnie++ for disk performance testing. It's made specifically for doing that sort of thing.


I can suggest you to read these two posts/articles:

http://www.linuxinsight.com/how_fast_is_your_disk.html http://www.linuxforums.org/forum/red-hat-fedora-linux/153927-iscsi-raid1-lvm-setup-poor-write-performance.html

In particular:

First, I would suggest you using a more accurate and controllable tool to test performance. hdparm was designed to change IDE device parameters and the test it does is quite basic. You can't also tell what's is going on when using hdparm on compound devices on LVM and iSCSI. Also, hdparm does not test write speed, which is not related with read speed as there are different optimizations for both (write back caches, read ahead and prefetching algorithms, etc).

I prefer to use the old&good dd command which allows you to fine control block sizes, length of tests and use of the buffer-cache. It also gives you a nice and short report on transfer rate. You can also choose to test buffer-cache performance.

Also, do realize that there are several layers involved here, including the filesystem. hdparm only tests access to the RAW device.

TEST COMMANDS I suggest using the following commands for tests:

a) For raw devices, partitions, LVM volumes, software RAIDs, iSCSI LUNs (initiator side). Block size of 1M is OK to test bulk transfer speed for most modern devices. For TPS tests, please use small sizes like 4k. Change count to make a more realistic test (I suggest long test to test sustained rate against transitory interferences). "odirect" flag avoids using buffer-cache, so the test results should be repeatable.

Read test: dd if=/dev/zero of=/dev/ bs=1M count=1024 oflag=direct Write test: dd if=/dev/ of=/dev/null bs=1M count=1024 iflag=direct

Example output for dd with 512x1M blocks: 536870912 bytes (537 MB) copied, 10.1154 s, 53.1 MB/s

The WRITE test is DESTRUCTIVE!!!!!! You should do it BEFORE CREATING FILESYSTEM ON THE DEVICE!!!! On raw devices, beware that the partition table will be erased. You should force the kernel to reread the partition table on that case to avoid problems (with fdisk). However, performance on the whole device and on a single partition should be the same.

b) For filesystem, just change the device for a file name under the mount point. Read test: dd if=/dev/zero of=/mount-point/test.dat bs=1M count=1024 oflag=direct Write test: dd if=/mount-point/test.dat of=/dev/null bs=1M count=1024 iflag=direct

Note that even accessing a file, we are not using the buffer-cache.

c) For the network, just test raw TCP sockets on both directions between servers. Beware of the firewall blocking TCP port 5001.

server1# dd if=/dev/zero bs=1M count=1024 | netcat 5001 server2# netcat -l -p 5001 | dd of=/dev/null

TEST LAYERS Now you have a tool to test disk performance for each layer. Just follow this sequence:

a) Test local disk performance on iSCSI servers. b) Test network TCP performance between iSCSI targets and initiators. c) Test disk performance on iSCSI LUNs on iSCSI initiator (this is the final raw performance of iSCSI protocol). d) Test performance on LVM logical volume. e) Test performance on large files on top of filesystem.

There should be a large performance gap between the layer being responsible for the loss and the following layer. But I don't think this is LVM. I suspect of the filesystem layer.

Now some tips for possible problems:

a) You didn't describe if you defined a stripped LVM volume on iSCSI LUNs. Stripping could create a bottleneck if synchronous writing were used on iSCSI targets (see issue with atime below). Remember that default iSCSI target behaviour is synchronous write (no RAM caching). b) You didn't describe the kind of access pattern to your files: -Long sequential transfers of large amounts of data (100s of MB)? -Sequences of small block random accesses? -Many small files?

I may be wrong, but I suspect that your system could be suffering the effects of the "ATIME" issue. The "atime" issue is a consequence of "original ideas about Linux kernel design", which we suffer in the last years because of people eager to participate in the design of an OS which is not familiar with performance and implications of design decisions.

Just in a few words. For almost 40 years, UNIX has updated the "last access time" of an inode each time a single read/write operation is done on its file. The buffer cache holds data updates which don't propagate to disk for a while. However, in Linux design, each update to inode's ATIME has to be updated SYNCHRONOUSLY AND INMEDIATELY to disk. Just realize the implications of interleaving sync. transfers in a stream of operations on top of iSCSI protocol.

To check if this applies, just do this test: -Read a long file (at least 30 seconds) without using the cache. Of course with dd!!! -At the same time, monitor the I/O with "iostat -k 5".

If you observe a small, but continuous flow of write operations while reading data, it could be the inode updates.

Solution: The thing is becoming so weird with Linux that they have added a mount option to some filesystems (XFS, EXT3, etc) to disable the update of atime. Of course that makes filesystems semantics different from the POSIX standards. Some applications observing last access time of files could fail (mostly email readers and servers like pine, elm, Cyrus, etc). Just remount your file system with options "noatime,nodiratime". There is also a "norelatime" on recent distributions which reduces obsolescence in "atime" for inodes.

Please, drop a note about results of these tests and the result of your investigation.

Tags:

Io

Benchmark