Find the answer to your Linux question:
Results 1 to 3 of 3
Recently a crucial low latency application was moved to a new server and it is suffering from sporadic huge latency spikes. We think it might be caused by some change ...
Enjoy an ad free experience by logging in. Not a member yet? Register.
  1. #1
    Just Joined!
    Join Date
    Mar 2013
    Posts
    1

    latency spikes with SCHED_RR (rhel 6.3)


    Recently a crucial low latency application was moved to a new server and it is suffering from sporadic huge latency spikes. We think it might be caused by some change in behavior of SCHED_RR in the newer kernel and hope you might help us figure this out.

    Here are the details


    Old server 2.6.32-279.2.1.el6.x86_64 - 8 physical CPUs New server - 2.6.32-279.19.1.el6.x86_64 - 16 physical CPUs
    Redhat 6.3

    The application has 15 threads. 4 threads are always spinning/polling I/O or Network and 1 thread is blocking but usually active. The remaining 10 threads are blocking and not very active. In the old server we would see consistent cpu utilization of 504% with low (sub-millisecond) latency in the 5 very active threads. In the new server we see that cpu utilization sporadically drops to 400% from 500% correlated to latency spikes of more than 5 seconds for the blocking thread and two of the spinning threads.

    This bad behavior occurs when we run the process with all its threads SCHED_RR priority 50; we used the following commands to give the threads said priority (here PIDS is the list of every thread's PID in the process)

    for pid in $PIDS; do
    ionice -c 1 -n 7 -p $pid
    chrt -r -p 50 $pid
    done

    On the other hand, if on the new server we instead use SCHED_OTHER with -20 niceness

    for pid in $PIDS; do
    ionice -c 1 -n 7 -p $pid
    renice -20 -p $pid
    done

    the application works well again giving us performance consistent with the old server SCHED_RR setup. However, this configuration does not give us the same realtime guarantees we want that we believe SCHED_RR is meant to provide. Note the old server even has half as many physical CPUs! We also run this application with 2.6.32-279.2.1.el6.x86_64 SCHED_RR on a 24 core machine with excellent results. Only the 2.6.32-279.19.1 server is giving us trouble with SCHED_RR.

    Are you aware of any differences between 2.6.32-279.2.1 and 2.6.32-279.19.1 that could explain or even relate to this behavior? Do you have any other thoughts on what might be causing this difference in behavior?

  2. #2
    Linux Engineer
    Join Date
    Apr 2012
    Location
    Virginia, USA
    Posts
    917
    I'll be honest with you, I have nothing but an elementary understanding of these topics. I 100% recommend opening a case with Red Hat, that's what the subscription is for, after all

    Anyway, you might find the following information useful:
    https://access.redhat.com/knowledge/...s_Binding.html

    Also, I would review the /etc/sysctl.conf on both systems.

  3. #3
    Just Joined!
    Join Date
    Mar 2009
    Location
    Norway
    Posts
    67
    Quote Originally Posted by kernelexplorer View Post
    Recently a crucial low latency application was moved to a new server and it is suffering from sporadic huge latency spikes. We think it might be caused by some change in behavior of SCHED_RR in the newer kernel and hope you might help us figure this out.

    Here are the details


    Old server – 2.6.32-279.2.1.el6.x86_64 - 8 physical CPUs New server - 2.6.32-279.19.1.el6.x86_64 - 16 physical CPUs
    Redhat 6.3
    This could be a scheduling anomaly, thing is, you are running static priorities and the old setup could "just work" by luck. These things can be quite delicate... Also, afaik, that kernel is PREEMPT_VOLUNTARY, have you tried running a fully preemptible kernel? what about a kernel with PREEMPT_RT?

    Quote Originally Posted by kernelexplorer View Post
    The application has 15 threads. 4 threads are always spinning/polling I/O or Network and 1 thread is blocking but usually active. The remaining 10 threads are blocking and not very active. In the old server we would see consistent cpu utilization of 504% with low (sub-millisecond) latency in the 5 very active threads. In the new server we see that cpu utilization sporadically drops to 400% from 500% correlated to latency spikes of more than 5 seconds for the blocking thread and two of the spinning threads.

    This bad behavior occurs when we run the process with all its threads SCHED_RR priority 50; we used the following commands to give the threads said priority (here PIDS is the list of every thread's PID in the process)

    for pid in $PIDS; do
    ionice -c 1 -n 7 -p $pid
    chrt -r -p 50 $pid
    done

    On the other hand, if on the new server we instead use SCHED_OTHER with -20 niceness

    for pid in $PIDS; do
    ionice -c 1 -n 7 -p $pid
    renice -20 -p $pid
    done

    the application works well again giving us performance consistent with the old server SCHED_RR setup. However, this configuration does not give us the same realtime guarantees we want that we believe SCHED_RR is meant to provide. Note the old server even has half as many physical CPUs! We also run this application with 2.6.32-279.2.1.el6.x86_64 SCHED_RR on a 24 core machine with excellent results. Only the 2.6.32-279.19.1 server is giving us trouble with SCHED_RR.

    Are you aware of any differences between 2.6.32-279.2.1 and 2.6.32-279.19.1 that could explain or even relate to this behavior? Do you have any other thoughts on what might be causing this difference in behavior?
    SCHED_RR is not meant to provide real-time guarantees out of the box! YOUR application needs to be verified, YOU need to know what happens, when and why. What sched_rt tries to give you, is a system that is as deterministic as possible, nothing more. If you decide to blow your foot off, that's your thing. See?

    What you need to to, is to start tracing your system to see what's going on. The kernel has excellent support for both event-tracing as well as function tracing. I _highly_ recommend you to check out that - knowledge trumps everything!

    edit: looking back, I'm not trying to be harsh or anything, but you _really_ need to profile your system and make sure you understand how it behaves.
    Last edited by henrikau; 03-18-2013 at 09:40 PM.

  4. $spacer_open
    $spacer_close

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •