Find the answer to your Linux question:
Results 1 to 8 of 8
Enjoy an ad free experience by logging in. Not a member yet? Register.
  1. #1

    Tuning for high throughput reverse-proxy server

    I have an enormous quadcore machine with 16gb ram and dual gigabit NICs. It used to be for MySQL but we have upgraded the whole database infrastructure so now this server is left floating.

    I had the great idea of turning this into a reverse-proxy (using apache mod_proxy) and it really handles a ton of requests. But I have a feeling that we are not getting the most use out of what it can offer. Our traffic consists of a few thousand very small (less than 10 byte) ajax calls per second, and frequently I find we are running out of kernel allocated network stack to handle all the requests. Often we get the kern.log warning "possible SYN flooding on port 80. Sending cookies." and other things like this. Obviously we are not getting SYN flooded, we just have very high demand.

    So far I have found a few kernel tuning guides to tell the kernel to allocate more of the base system memory for networking but every guide I have found has been for the purpose of increasing the performance between WAN links (direct backbones between offices etc) and usually with very large file sizes being the priority.

    One such example (and great) write up is here:
    I was hoping some people could provide further input, such as along the lines of disabling nf_conntrack (to speed up socket set up/tear down time) or anything that will speed up a high throughput proxy like mine.

    Any links to studies or benchmarks between different configurations or hardware gets extra points!

  2. #2
    Linux Engineer Kloschüssel's Avatar
    Join Date
    Oct 2005
    The problem is not simple and so are the solutions. As always you may have to pull several levers to achieve better performance. But you may need to answer me some more questions.

    1] How many different users usually query the server per second?
    2] What is the overall load of the system?
    3] How long do each of these steps take approximately?
    * establish connection
    * request send through proxy
    * process request (by whoever)
    * send response back
    4] $ uname -r

    I ask cause my opinion is that you are searching for a solution in the wrong place. Playing around with the tcp connection can improve performance on one hand but loose lots on other places. The default configuration works well for more than 99.999% of the world, so: why not for you?

    What I try to say is that most problems are not solved on the IP stack layer but on layer 7 (application). Maybe you have there things that negatively affect performance like automatic reverse dns lookups of client requests to have a clients hostname in the logfiles? Or maybe one sends multiple requests where the request could be bundled into one and return all results at once?

    Anyway, you can crawl /proc/sys/net/ipv4/* for a list of all configuration options available in your kernel.

    For memory trouble for example you could decrease the minimum amount of memory in the triple of tcp_rmem and tcp_wmem to match your default request minimum size.
    Last edited by Kloschüssel; 01-20-2011 at 11:54 AM.

  3. #3
    1] I am currently averaging about 2-3,000 unique users querying the server per second.
    2] The system load seems fine for a quadcore system. load average: 0.74, 0.66, 0.52
    3] I do not know of a good way to test each leg of the round-trip journey as you have broken it down. The best I was able to do was this perl, running it from a remote location to simulate an average user:
    #!/usr/bin/env perl
    # Proxy Tuning Step 1
    # (Loop this many times for fun and profit)
    # Example use:
    # time ./
    use strict;
    use warnings;
    use IO::Socket;
    my $httpSock = new IO::Socket::INET(
            PeerAddr => '',
            PeerPort => '80',
            Proto    => 'tcp',
    my $content = "HEAD / HTTP/1.0\r\nHost:\r\n\r\n";
    my @lines = $httpSock->getlines();
    foreach my $line (@lines) {
            print $line;
    After running it 5 times and averaging the results:
    real 0m0.323s
    user 0m0.020s
    sys 0m0.009s
    Do you have a better way to time each part of the request the way you described it?

    4] `uname -r` is 2.6.24-23-xen. I realize that virtualizing the OS is a rather effective performance killer, but I didn't have a choice in this so I am working with what I have. The base system is Citrix XenServer but it is currently only running this single VM so at least I am not sharing resources with potentially hungry applications.

    To answer other questions, my apache is not doing rdns lookup. And I'm fairly sure all request is handled in a single transaction without having to split packets or requests. It's not that I am experiencing problems with the current user load on the proxy, but I want to *know* (and not guess) that I am handling requests efficiently. I want to be able to serve the absolute most users that the hardware can handle.

    I have tried a few http benchmark tests like Apache `ab` and HP Labs `httperf` but I have no idea what to change on the server to make different tests. I will try lowering the minimum memory allocation like you say and see if this changes anything.

    Also, thank you very much for your response, I am not very knowledgeable in this area so what you have said is already a huge help.
    Last edited by cipherus; 01-21-2011 at 03:57 PM.

  4. $spacer_open
  5. #4
    Linux Engineer Kloschüssel's Avatar
    Join Date
    Oct 2005
    I just have seen that most times not the transfer of data is the bottleneck but simply the logic behind the calculation of a meaningful response. Usually it requires to query a database, do some calculations etc. I would therefore recommend to performance test your services. How to do that would depend on the service you run (tomcat, plain php, ..)

  6. #5
    The application layer is not what I'm worried about. I currently have 8 application servers behind the proxy. I can scale out sideways as much as I need and *very* quickly, a new application server can be up and serving live users in under 20 minutes.

    I'm focusing on *only* the proxy server performance because scaling out more proxy servers is not as easy.

  7. #6
    Linux Engineer Kloschüssel's Avatar
    Join Date
    Oct 2005
    I understand. Then the biggest bottleneck I can think about will be the virtualization and the method of using a proxy server. When one has to deal with lots of users usually one would try to reduce traffic to one single server as much as possible.

    That could be done by using multiple dns server entries so that you can pool your workers and the worker ip addresses will be shared equally among all users.

    Or of course (the better solution) would be a master that deals a connection contract for a client session and refers to the respective worker that the client should use.

    Both solutions reduce the overall load to the users bypassing single points of failure and should both perform much better without the single bottleneck of a proxy server.

    By the way: 0.7 is high load. It would mean that 70% of the overall system resources are currently needed to get the job done fast enough to be called realtime.
    Last edited by Kloschüssel; 01-21-2011 at 09:05 PM.

  8. #7
    Load (computing) - Wikipedia, the free encyclopedia

    For single-CPU systems that are CPU-bound, one can think of load average as a percentage of system utilization during the respective time period. For systems with multiple CPUs, one must divide the number by the number of processors in order to get a comparable percentage.


    In a system with four CPUs, a load average of 3.73 would indicate that there were, on average, 3.73 processes ready to run, and each one could be scheduled into a CPU.
    So load of 4.0 for a quadcore computer == 100% utilization.

    Quote Originally Posted by Kloschüssel View Post
    By the way: 0.7 is high load. It would mean that 70% of the overall system resources are currently needed to get the job done fast enough to be called realtime.
    0.7 is 70% load for a single cpu/core system. For a quad core system the same load would be 2.8.

    For the rest, I do like the idea of having a geographical location based distribution of traffic similar to but I'm just not there yet. I already have 3 primary proxies (3 DNS A records as you mentioned), and it works okay. It is slow to change IP addresses this way.

    If you say the the vanilla kernel settings for any major linux distribution is already as efficient as it will get, then I have nothing to say otherwise. But it makes me sad :(

  9. #8
    Linux Engineer Kloschüssel's Avatar
    Join Date
    Oct 2005
    Well, there may be things you can do, but I don't know these tricks. Let us know what you find out!

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts