Find the answer to your Linux question:
Results 1 to 6 of 6
Enjoy an ad free experience by logging in. Not a member yet? Register.
  1. #1

    odd network issues

    I'm playing as support for some Linux test servers for some software developers. (I say "playing" because I have little experience with Linux, but extensive with networking and Windows Server.) They now want me to actually troubleshoot a networking issue. I have been able to solve quite a few networking issues with these machines in the past, but this particular one is beyond my abilities.

    During their testing, they need to make connections between machines and copy large amounts of data over several hours. The machines are Dell R710 servers hooked to a Dell PowerConnect 6224 switch running at regular 1Gb speeds running RHEL 5. For a brief time, they'll connect and transfer data properly, but after about 5-20 minutes (seemingly random time period) they will both disconnect and quit transferring. Those machines are then unable to connect to any other machines until the network service is restarted, however, they can ping other machines and surf the internet, and they can be pinged.

    I'm stumped on this. it's partially disabling the network service, but not entirely. Originally, the machines were hooked to a Dell PowerConnect 2724 with several FC raid controllers' management interfaces, and the 6224 switch just had some more of the raid controllers. I moved around the connections so the servers were on the 6224 and all the raid controllers were on the 2724. That didn't make any difference. I've been told by the developers that they updated the drivers to the latest version, but that made no difference.

    It just doesn't seem like a hardware or network issue to me. If it was hardware, it would totally lose connection, not partially like this case. If it was network, the transfers would likely stop right away, or at the very least the switch move should have solved it.

    I don't know what to check next. Any suggestions?

  2. #2
    Trusted Penguin Irithori's Avatar
    Join Date
    May 2009
    Looking at the r710 specs, it is not entirely clear to me, if they come with intel or broadcom nics onboard.
    I remember problems with broadcoms under high load, but that was several years ago and hopefully should be fixed by now (cant tell from recent experience, we run 100% intel now)

    Another thought might be this:
    Are you sure about the MTU aka JumboFrames in your network?
    And I am not only talking about OS side.
    Some NICs try to be extra smart and assemble several normal packets to jumbo frames *on their own* and without the OS.
    The reason why they do this is to reduce interrupts.

    However, if not all components speak jumbo or icmp is blocked, then there is trouble..

    You might be able to verify or exclude that by using tcpdumps.
    Look for icmp-error typ 3 code 4: "destination unreachable fragmentation needed, dont fragment set"

    Some extra kick: The tcp packets you can see with tcpdump on the linux box will have regular size in this case. As these are packets, that the nic has disassembled and sent to the OS

    The behaviour of such NIC features can be configured while loading the appropiate kernel module via options.

    Another approach is this:
    Replace your app with netcat.
    ie: saturate the network, but nothing else (cpu, io) and see if you can reproduce the error.

    Choose one machine to be the server:
    netcat -l 2222 > /dev/null
    And another to be the client:
    cat /dev/zero | netcat <serverip> 2222

    Note: netcat is simply nc on some distributions.

    Edit: typo in client port number
    Last edited by Irithori; 06-30-2010 at 07:58 PM.
    You must always face the curtain with a bow.

  3. #3
    Thanks for the suggestions.

    I can't say for certain, but I believe the R710 uses 2 dual port Broadcom controllers. We're only using 1 of the 4 on each.

    Funny thing is, we're actually moving away from an Intel 10Gb controller on these same machines to a Broadcom 10Gb because of that load problem. the Intel chip will hang if large loads are placed on it. (These are test machines, so they switch back and forth between 1Gb and 10Gb connections.)

    I won't be able to test it myself. I just got 14 servers in that I need to rack and configure. However, I have forwarded this on to the devs to see if they can do anything with it. They usually handle the Linux issues, but were unable to find a solution to this one.

  4. $spacer_open
  5. #4
    The guys decided to move the machines to a SLES core. they experimented with two under SLES instead of RHEL, and they didn't get the problem. So, they're booting RHEL. This solves their problem, but not qutie the way I would have preferred.

  6. #5
    Trusted Penguin Irithori's Avatar
    Join Date
    May 2009
    SLES core? Did I miss something?
    I know SLES and fedora core, but not SLES core

    If it works ootb with another distribution, then these reasons come to mind
    - a config issue of the nics
    - the rhes kernel modules for these nics are have an unfixed bug

    Both seem fixable, as your rhes subscription gives access to redhats commercial support,
    which is actually quite knowledgeable and helpful.
    Once you get past the level 1 guys, of course

    But if
    - the intended application is available for the new distro as well
    - and your ops team is willing and capable of supporting yet another platfrom
    then this is a acceptable quickfix, imho.

    A unix is a unix, so meh

    And sure it is better to know reasons and gain experience, but then there are these strange guys in suits demanding results
    You must always face the curtain with a bow.

  7. #6
    sorry, local terminology. the devs refer to the host OS as the "core" they're using.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts