Find the answer to your Linux question:
Page 1 of 2 1 2 LastLast
Results 1 to 10 of 12
I have a very strange problem on one of our servers: When a file is copied it get's corrupted, regardless of using cp or dd. Usually small files (a few ...
Enjoy an ad free experience by logging in. Not a member yet? Register.
  1. #1
    Just Joined!
    Join Date
    Apr 2012
    Posts
    4

    Question Copied files are corrupted


    I have a very strange problem on one of our servers:
    When a file is copied it get's corrupted, regardless of using cp or dd. Usually small files (a few mb) are ok, but with bigger files (~ 1gb or more) the data corruption occurs certainly.

    Last night I've run memtest86+ for about 15 hours, it showed up no errors; the raid controller shows no errors; sometimes fsck reports errors, but even after correcting these errors this "file-corruption-by-copying problem" reoccurs. Does anyone have an idea what else to check?

    System: Gentoo, Filesystem: ext3 on RAID5 (hardware raid controller)

    I really hope someone can give me another hint...

    Cheers,
    Dominik


    Example:
    Code:
    > md5sum refseq_rna.00.tar.gz
    e0220d9623f2940c1f112b249e5285a0  refseq_rna.00.tar.gz
    
    > cp refseq_rna.00.tar.gz refseq_rna.00.tar.gz.2
    
    >  md5sum refseq_rna.00.tar.gz.2
    1079307091f6796725ed830ee6e6e55a  refseq_rna.00.tar.gz.2
    
    # the sizes are equal
    > ls -l
    -rw-r--r-- 1 root root 1122490886 Apr  3 15:50 refseq_rna.00.tar.gz
    -rw-r--r-- 1 root root 1122490886 Apr  3 15:52 refseq_rna.00.tar.gz.2
    
    # but the differences are quite big, not just a single bit flip:
    >  cmp -l refseq_rna.00.tar.gz refseq_rna.00.tar.gz.2
    593540065 121 226
    593540066   5 134
    593540067   2 200
    ...
    1113101310  47  43
    1113101311  70 323
    1113101312  34 376
    
    # in total 85 corrupted bytes

  2. #2
    Linux Newbie reginaldperrin's Avatar
    Join Date
    Oct 2010
    Location
    Christchurch, New Zealand
    Posts
    123
    One thing that comes to mind is possibly the PSU. A PSU which doesnt deliver proper voltages and wattages can produce all manner of wierd problems in any computer.
    Hope this helps.

  3. #3
    Just Joined!
    Join Date
    Apr 2006
    Posts
    19
    I'd still try different memory if you have some available! It sure sounds like flaky memory to me. I had this happen to me on a brand new computer and lost several photos because of it. Didn't realize it had happened until I deleted the backups. Of course by then it was too late.

    Webwalker

  4. $spacer_open
    $spacer_close
  5. #4
    Linux Guru Rubberman's Avatar
    Join Date
    Apr 2009
    Location
    I can be found either 40 miles west of Chicago, in Chicago, or in a galaxy far, far away.
    Posts
    11,737
    So, you are copying files from a client to a server? Have you checked the client system for memory problems? What about the network? Is it corrupting/dropping packets? When we see this sort of problem in our networks (1000's of servers serving millions of clients), it is usually a network problem, such as a gateway or firewall dropping packets because of a NIC (network interface controller) problem.
    Sometimes, real fast is almost as good as real time.
    Just remember, Semper Gumbi - always be flexible!

  6. #5
    Just Joined!
    Join Date
    Feb 2011
    Posts
    7
    I have to disagree with Rubberman. If TCP/Ip is used on the network, even if packets are dropped they would be resent.
    If corrupted packets are received they will be verified found corrupted and resent. TCP/IP Is super reliable and I really don't think it could be the culprit.
    If data corruption occur in such a data / network transmission or the client and / or the server has a problem
    I totally agree about a potential very marginal PSU problem and / or memory problem.
    Before changing any memory try to verify if your bios setup as been affected and may be you just have to tweak down the memory timings.

    Cheers,

  7. #6
    Linux Guru Rubberman's Avatar
    Join Date
    Apr 2009
    Location
    I can be found either 40 miles west of Chicago, in Chicago, or in a galaxy far, far away.
    Posts
    11,737
    Quote Originally Posted by Learnix View Post
    I have to disagree with Rubberman. If TCP/Ip is used on the network, even if packets are dropped they would be resent.
    If corrupted packets are received they will be verified found corrupted and resent. TCP/IP Is super reliable and I really don't think it could be the culprit.
    If data corruption occur in such a data / network transmission or the client and / or the server has a problem
    I totally agree about a potential very marginal PSU problem and / or memory problem.
    Before changing any memory try to verify if your bios setup as been affected and may be you just have to tweak down the memory timings.

    Cheers,
    Even though tcp protocols (not udp) will retransmit if an ack is not received in a given amount of time, this is still an issue and can cause serious problems, depending upon how your applications are written. Absolutely reliable is tcp? Close to it. However, many systems have time out parameters that will disconnect or issue an error if a packet is not acknowledged within a given time frame, or too many are dropped (no acks). Packet corruptions are also detected with checksums in their headers, and those too will generate a resend request.

    In this case, however, it is entirely possible that the data source system is corrupting the data before it is sent via tcp/ip to the receiver, so the checksum is correct for the packet, but the data was corrupted before the checksum was computed - garbage in, garbage out. This would indicate that there is a memory or other problem with the system in question. In any case, a proper root-cause analysis is indicated to determine exactly where the problem is originating from.
    Sometimes, real fast is almost as good as real time.
    Just remember, Semper Gumbi - always be flexible!

  8. #7
    Just Joined!
    Join Date
    Apr 2012
    Posts
    4
    Thanks for your answer!
    Yes, I'll try to replace the memory modules and see if this fixes the errors!

    Btw, the RAM modules currently installed are ECC enabled... shouldn't ECC RAM prevent this kind of errors anyway?

    The network can't be the cause because these file corruptions also happen when big files are copied to another destination even on the same hard disk partition. Potential network problems were also my first idea, as I noticed this problem the first time when downloading files with wget; but unfortunately this can't be the underlying problem.

    Cheers,
    Dominik

  9. #8
    Just Joined!
    Join Date
    Apr 2012
    Posts
    4
    FYI: After extensive testing of different RAM modules and slot combinations I found out that the error was caused by two certain RAM modules running together in two dual-channel slots; separating theses two modules seems to fixed the problem. Very weird... But I guess it'll be better on the long run to exchange the whole RAM.

    Cheers,
    Dominik

  10. #9
    Linux Guru Rubberman's Avatar
    Join Date
    Apr 2009
    Location
    I can be found either 40 miles west of Chicago, in Chicago, or in a galaxy far, far away.
    Posts
    11,737
    Quote Originally Posted by Dominikl View Post
    FYI: After extensive testing of different RAM modules and slot combinations I found out that the error was caused by two certain RAM modules running together in two dual-channel slots; separating theses two modules seems to fixed the problem. Very weird... But I guess it'll be better on the long run to exchange the whole RAM.

    Cheers,
    Dominik
    This is one of the reasons why I use ECC ram in my systems, and won't consider any that don't support it. If something is wonky with the RAM, current Linux kernels will disable those banks, send you a message about the problem, and continue to operate with the remaining SIMMs.
    Sometimes, real fast is almost as good as real time.
    Just remember, Semper Gumbi - always be flexible!

  11. #10
    Just Joined!
    Join Date
    Apr 2012
    Posts
    4
    Again FYI: Now about three months later, the same problem occurred. I've now exchanged the memory modules with new ones. So it seems to be better to just dump the modules, when this strange copy errors happen. By the way, the modules were ECC RAMs... the linux kernel didn't disable them... maybe I have to enable the respective option first.

Page 1 of 2 1 2 LastLast

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •