real-time transfers to GPU/Fusion I/O card
I am working on a machine vision application, and am trying to transfer data from host memory onto a pair of PCIE cards on a CentOS 5.5 distribution. I'm trying to understand the latency in performing these transfers, and how I can improve the latency performance. I picked the linux kernel forum because I felt the subscribers would be best able to advise on this topic.
The first card is an NVIDIA GEForce 8600 card (x16 pcie), and I am trying to transfer data onto this card so that I can perform image processing using CUDA. The second card that I'm using is a Fusion I/O ioXtreme card (x4 pcie), which I'm employing for storage. I wanted to transfer 32kB blocks of data to these cards. I'm using an Intel DX38BT motherboard, which has 2x16 lane pcie slots in which these cards reside.
The Intel mb uses an X38 memory controller hub, and this hub provides registers in its pcie configuration space that record transaction layer packets to and from the pcie cards. These registers are called the PCI Express Sequence Status Registers. I wrote a user-space application that invokes transfers to the cards via the NVIDIA and Fusion I/O drivers. In a separate thread, I polled the x38 TLP counter registers to measure the progress of the transfers. I also polled the system clock using clock_gettime(). These polls occur in a tight loop that exits after the transfer has completed. In this application I set the scheduler to use the SCHED_FIFO
policy at the highest priority and ran as root.
Upon completion of the transfers, I collected all the timestamps and register values that I sampled and plotted them. I've attached a couple of plots showing the results. In these plots
the red and blue traces show the packet count to and from the ioXtreme card. The green and cyan traces show the packet count to and from the NVIDIA GEForce card. One sees the packet count to both cards increase in short bursts, each of which is consistent with the size of a 32 kB memory block. A single frame transfer to the NVIDIA card takes about 50 microseconds, and there is a 100 microsecond delay associated with my CUDA program and driver overhead. The transfer to the Fusion I/O card takes a bit longer.
I would be very happy with this outcome if I could guarantee that the transfers always occurred at this rate. However, in the second plot it is apparent that in two spots the transfers are stalling for hundreds of microseconds. Unfortunately this behavior appears regularly in these transfers, and these delays are too long for my real-time application.
My question to this forum is to ask for suggestions as to where I can go from here to try to avoid these stalls.