Results 1 to 3 of 3
Hi,
I have one server with 2 disks of 1TB in raid1 made via mdadm, on which I made 100GB lun files that are hosted to another system over iscsi ...
- 10-06-2009 #1Just Joined!
- Join Date
- Oct 2009
- Posts
- 1
iscsi / raid1 / lvm setup poor write performance
Hi,
I have one server with 2 disks of 1TB in raid1 made via mdadm, on which I made 100GB lun files that are hosted to another system over iscsi on centos 5 boxes.
hdparm -t on the raid volume on the iscsi target host is around 60mb/sec
on the iscsi-initiator boxes the resulting iscsi drives have similar speeds tested with hdparm -t. The iscsi devices are put into an lvm device so I can lateron add more space to the filesystems when needed..
the resulting lvm filesystem has an hdparm -t read speed of about 45MB/s but once I start to write on it, my speed drops a lot to like 5MB/s. Using iptraf I can see that network isnt consuming more then around 50Mbits.
Both boxes are connected on a decent GB switch
at the time of testing Im getting around 50%waitstate on the iscsi target host, which is probably the source of the performance drop.
on he iscsi initiator host I am also getting a lot of wa, from 50 to 80%.
Is this iscsi which is performing badly?
Any ideas what I can try to determine the actual problem and even better, fix it
Cheers,
- 10-29-2009 #2Just Joined!
- Join Date
- Oct 2009
- Posts
- 2
Testing performance
Hi,
I'm interested in complex devices based on iSCSI LUNs and their performance.
First, I would suggest you using a more accurate and controllable tool to test performance. hdparm was designed to change IDE device parameters and the test it does is quite basic. You can't also tell what's is going on when using hdparm on compound devices on LVM and iSCSI. Also, hdparm does not test write speed, which is not related with read speed as there are different optimizations for both (write back caches, read ahead and prefetching algorithms, etc).
I prefer to use the old&good dd command which allows you to fine control block sizes, length of tests and use of the buffer-cache. It also gives you a nice and short report on transfer rate. You can also choose to test buffer-cache performance.
Also, do realize that there are several layers involved here, including the filesystem. hdparm only tests access to the RAW device.
TEST COMMANDS
I suggest using the following commands for tests:
a) For raw devices, partitions, LVM volumes, software RAIDs, iSCSI LUNs (initiator side). Block size of 1M is OK to test bulk transfer speed for most modern devices. For TPS tests, please use small sizes like 4k. Change count to make a more realistic test (I suggest long test to test sustained rate against transitory interferences). "odirect" flag avoids using buffer-cache, so the test results should be repeatable.
Read test: dd if=/dev/zero of=/dev/<device inode> bs=1M count=1024 oflag=direct
Write test: dd if=/dev/<device inode> of=/dev/null bs=1M count=1024 iflag=direct
Example output for dd with 512x1M blocks:
536870912 bytes (537 MB) copied, 10.1154 s, 53.1 MB/s
The WRITE test is DESTRUCTIVE!!!!!! You should do it BEFORE CREATING FILESYSTEM ON THE DEVICE!!!! On raw devices, beware that the partition table will be erased. You should force the kernel to reread the partition table on that case to avoid problems (with fdisk). However, performance on the whole device and on a single partition should be the same.
b) For filesystem, just change the device for a file name under the mount point.
Read test: dd if=/dev/zero of=/mount-point/test.dat bs=1M count=1024 oflag=direct
Write test: dd if=/mount-point/test.dat of=/dev/null bs=1M count=1024 iflag=direct
Note that even accessing a file, we are not using the buffer-cache.
c) For the network, just test raw TCP sockets on both directions between servers. Beware of the firewall blocking TCP port 5001.
server1# dd if=/dev/zero bs=1M count=1024 | netcat <server2's IP> 5001
server2# netcat -l -p 5001 | dd of=/dev/null
TEST LAYERS
Now you have a tool to test disk performance for each layer. Just follow this sequence:
a) Test local disk performance on iSCSI servers.
b) Test network TCP performance between iSCSI targets and initiators.
c) Test disk performance on iSCSI LUNs on iSCSI initiator (this is the final raw performance of iSCSI protocol).
d) Test performance on LVM logical volume.
e) Test performance on large files on top of filesystem.
There should be a large performance gap between the layer being responsible for the loss and the following layer. But I don't think this is LVM. I suspect of the filesystem layer.
Now some tips for possible problems:
a) You didn't describe if you defined a stripped LVM volume on iSCSI LUNs. Stripping could create a bottleneck if synchronous writing were used on iSCSI targets (see issue with atime below). Remember that default iSCSI target behaviour is synchronous write (no RAM caching).
b) You didn't describe the kind of access pattern to your files:
-Long sequential transfers of large amounts of data (100s of MB)?
-Sequences of small block random accesses?
-Many small files?
I may be wrong, but I suspect that your system could be suffering the effects of the "ATIME" issue. The "atime" issue is a consequence of "original ideas about Linux kernel design", which we suffer in the last years because of people eager to participate in the design of an OS which is not familiar with performance and implications of design decisions.
Just in a few words. For almost 40 years, UNIX has updated the "last access time" of an inode each time a single read/write operation is done on its file. The buffer cache holds data updates which don't propagate to disk for a while. However, in Linux design, each update to inode's ATIME has to be updated SYNCHRONOUSLY AND INMEDIATELY to disk. Just realize the implications of interleaving sync. transfers in a stream of operations on top of iSCSI protocol.
To check if this applies, just do this test:
-Read a long file (at least 30 seconds) without using the cache. Of course with dd!!!
-At the same time, monitor the I/O with "iostat -k 5".
If you observe a small, but continuous flow of write operations while reading data, it could be the inode updates.
Solution: The thing is becoming so weird with Linux that they have added a mount option to some filesystems (XFS, EXT3, etc) to disable the update of atime. Of course that makes filesystems semantics different from the POSIX standards. Some applications observing last access time of files could fail (mostly email readers and servers like pine, elm, Cyrus, etc).
Just remount your file system with options "noatime,nodiratime". There is also a "norelatime" on recent distributions which reduces obsolescence in "atime" for inodes.
Please, drop a note about results of these tests and the result of your investigation.
Regards,
Guillermo
- 10-29-2009 #3Just Joined!
- Join Date
- Oct 2009
- Posts
- 2
I forgot that.
The "system waiting time" can be the time spent inside the kernel waiting for synchronous write operations of inodes to disk.
As sync. writes require data to be physically on disk before returning from the operation, processes are blocked into kernel in "waiting state".


Reply With Quote