Find the answer to your Linux question:
Results 1 to 2 of 2
Hello, I've been searching the internet a very long time for a program (open source) that can help me locate files on my harddrive that are duplicates. I am running ...
Enjoy an ad free experience by logging in. Not a member yet? Register.
  1. #1
    Just Joined!
    Join Date
    Sep 2012
    Posts
    1

    Program to find the same file Multiple times


    Hello,

    I've been searching the internet a very long time for a program (open source) that can help me locate files on my harddrive that are duplicates. I am running debian if this helps with any ideas.

    Thank you for any assistance!

    AbsoluteZ3r0

  2. #2
    Trusted Penguin
    Join Date
    May 2011
    Posts
    4,353
    i seem to recall reading about such a program somewhere recently, but now i can't remember where. that makes for a good opportunity to write some code to do it, though!

    caveat: this will take a while to run and is terribly inefficient.

    this is a perl script that will use MD5 checksums to determine the uniqueness of all files found on your system. copy it to a file named "find-dupes.pl". it takes a single argument to it: the name of the directory to search. Use "/" to search your entire system. Test it out by running it on /tmp for instance, e.g.:

    Code:
    find-dupes.pl /tmp
    Here's the code:
    Code:
    #!/usr/bin/perl
    use strict;
    use warnings;
    
    $| = 1; # flush buffer
    
    # the directory to search (use "/" to search your entire system)
    my $dir = shift || die "Usage: $0 <DIRECTORY>\n";
    die "$dir: No such directory\n" unless(-d$dir);
    
    # hash to hold lists of files by checksum
    my %hash;
    
    # file counter
    my $cnt = 0;
    
    # find all non-empty files and get their MD5 checksums
    print 'Finding all files...';
    open(PH,'find '.$dir.' -type f ! -size 0 -exec md5sum {} \;|')
      or die "can't run 'find': $!\n";
    while(<PH>){
      chomp;
      s/[ \t]+/ /g;                     # shrink white spaces to single space
      my($cksum,$filename) = split;     # split on checksum and filename
    #  print "FILE $filename == $cksum\n";
      push(@{$hash{$cksum}},$filename); # save to list (array) in the hash
      $cnt += 1;
    }
    close(PH);
    print "done.\n";
    
    # tally up the totals
    print "Found [",$cnt,"] files\n";
    print "Found [",scalar keys %hash,"] unique file checksums\n";
    
    # see if no duplicates were found
    if($cnt == scalar keys %hash){
      print "Not a single duplicate file was found...er, this is unlikely!\n";
      exit(0);
    }
    
    # display duplicates
    for my $key(keys %hash){
      my @files = @{$hash{$key}};
    
      # skip checksums that only have one unique file associated with them
      next if($#files<1);
    
      # display the duplicate files
      print "\nChecksum $key has ",$#files + 1," files:\n";
      print "\t",$_,"\n" for(@files);
    }
    Like I said earlier, this will take a while to run. You will probably want to pipe it to tee and/or redirect the output to a log. you may also want to run it in a screen b/c it may take so long.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •