Find the answer to your Linux question:
Results 1 to 9 of 9
Like Tree2Likes
  • 2 Post By drl
I have got a folder filled with thousand of videos and many of them are the same just with different name, how could I find out files with same size ...
Enjoy an ad free experience by logging in. Not a member yet? Register.
  1. #1
    Just Joined!
    Join Date
    Feb 2013
    Posts
    34

    find same size file


    I have got a folder filled with thousand of videos and many of them are the same just with different name, how could I find out files with same size and list them ?

    thank you

  2. #2
    Trusted Penguin Irithori's Avatar
    Join Date
    May 2009
    Location
    Munich
    Posts
    3,445
    The easiest, but manual method is to list the directory by size
    Code:
    ls -laS
    A slower, but more reliable method would be to generate a md5sum for all files and sort/compare by this.
    You must always face the curtain with a bow.

  3. #3
    Just Joined!
    Join Date
    Feb 2013
    Posts
    34
    Quote Originally Posted by Irithori View Post
    A slower, but more reliable method would be to generate a md5sum for all files and sort/compare by this.
    How can apply this ?

  4. $spacer_open
    $spacer_close
  5. #4
    Trusted Penguin Irithori's Avatar
    Join Date
    May 2009
    Location
    Munich
    Posts
    3,445
    If all files are in one folder with no subfolders, then something like this should do:
    Code:
    cd <DIRECTORY_WITH_VIDEOS>
    md5sum * |sort > ~/duplicate_check
    less ~/duplicate_check
    You must always face the curtain with a bow.

  6. #5
    drl
    drl is offline
    Linux Engineer drl's Avatar
    Join Date
    Apr 2006
    Location
    Saint Paul, MN, USA / CentOS, Debian, Slackware, {Free, Open, Net}BSD, Solaris
    Posts
    1,304
    Hi.
    Code:
    FDUPES(1)                                                            FDUPES(1)
    
    NAME
           fdupes - finds duplicate files in a given set of directories
    
    SYNOPSIS
           fdupes [ options ] DIRECTORY ...
    
    DESCRIPTION
           Searches the given path for duplicate files. Such files are found by
           comparing file sizes and MD5 signatures, followed by a byte-by-byte
           comparison.
    
    -- excerpt from man fdupes, q.v.
    Best wishes ... cheers, drl
    Irithori and Lakshmipathi like this.
    Welcome - get the most out of the forum by reading forum basics and guidelines: click here.
    90% of questions can be answered by using man pages, Quick Search, Advanced Search, Google search, Wikipedia.
    We look forward to helping you with the challenge of the other 10%.
    ( Mn, 2.6.n, AMD-64 3000+, ASUS A8V Deluxe, 1 GB, SATA + IDE, Matrox G400 AGP )

  7. #6
    Just Joined!
    Join Date
    Feb 2013
    Posts
    34
    ok fdupes has generated the list of files. How can I remove the duplicated files ?

  8. #7
    Just Joined!
    Join Date
    Apr 2013
    Posts
    53
    Quote Originally Posted by umbloaded View Post
    ok fdupes has generated the list of files. How can I remove the duplicated files ?
    Like this:

    Code:
    #!/bin/bash
    rm -f /tmp/sums.txt
    OLDIFS=$IFS
    IFS=$(printf "%b" "\n")
    for fn in *
    do
    if [ "$fn" != "mshowdupfiles" ]
    then
    if [ -d "$fn" ]
    then
    continue
    fi
    	echo $(md5sum "$fn") >> /tmp/sums.txt
    fi
    done
    sort /tmp/sums.txt > /tmp/srtdsums.txt
    linecount=0
    #for line in $(cat /tmp/srtdsums.txt)
    while read line
    do
    (( linecount++ ))
    if [ $linecount -eq 1 ]
    then
    	previousline=$line
    	previoussum=${line%% *}
    	continue
    fi
    currentline=$line
    currentsum=${line%% *}
    currentfile=${line##$currentsum  }
    if [ $currentsum == $previoussum ]
    then
    	rm -f $currentfile
    fi
    previousline=$currentline
    previoussum=$currentsum
    done < /tmp/srtdsums.txt
    IFS=$OLDIFS

  9. #8
    Linux Guru Rubberman's Avatar
    Join Date
    Apr 2009
    Location
    I can be found either 40 miles west of Chicago, in Chicago, or in a galaxy far, far away.
    Posts
    11,755
    If many are the same video (not just same subject, but REALLY identical), then a check sum would be the best way to suss this out. Fastest is the old cksum tool. Better, but slower is md5sum. Much better, but much slower is one of the shasum (like sha1sum, sha256sum, etc) tools. Myself, for this sort of requirement I'd use cksum - it outputs the checksum, size, and name in that order, so sorting the output will place the identical files together in sequence. IE:
    Code:
    cksum * | sort | less
    This may take awhile, depending upon the number of files in the directory.
    Sometimes, real fast is almost as good as real time.
    Just remember, Semper Gumbi - always be flexible!

  10. #9
    drl
    drl is offline
    Linux Engineer drl's Avatar
    Join Date
    Apr 2006
    Location
    Saint Paul, MN, USA / CentOS, Debian, Slackware, {Free, Open, Net}BSD, Solaris
    Posts
    1,304
    Hi.

    Long post, demonstrating standard utilities to remove duplicate files. Here is fdupes. Note that some files in the same directory are duplicates, and a file in another directory is a duplicate of those:
    Code:
    #!/usr/bin/env bash
    
    # @(#) s1	Demonstrate duplicate file removal, fdupes.
    
    # Utility functions: print-as-echo, print-line-with-visual-space, debug.
    # export PATH="/usr/local/bin:/usr/bin:/bin"
    pe() { for _i;do printf "%s" "$_i";done; printf "\n"; }
    pl() { pe;pe "-----" ;pe "$*"; }
    edges() { local _f _n _l;: ${1?"edges: need file"}; _f=$1;_l=$(wc -l $_f);
      head -${_n:=3} $_f ; pe "--- ( $_l: lines total )" ; tail -$_n $_f ; }
    db() { ( printf " db, ";for _i;do printf "%s" "$_i";done;printf "\n" ) >&2 ; }
    db() { : ; }
    C=$HOME/bin/context && [ -f $C ] && $C tree fdupes
    
    # Create trees.
    ./create-tree
    
    pl " Results, tree before:"
    tree -s m
    
    pl " Content of files:"
    head m/d1/* >f1
    pe ; pe "---" ; pe
    head m/d2/* > f2
    paste f1 f2 | expand -t30
    
    pl " Results for fdupes:"
    fdupes --recurse --delete --noprompt m
    
    pl " Results, tree after:"
    tree -s m
    
    exit 0
    producing:
    Code:
    % ./s1
    
    Environment: LC_ALL = C, LANG = C
    (Versions displayed with local utility "version")
    OS, ker|rel, machine: Linux, 3.0.0-1-amd64, x86_64
    Distribution        : Debian GNU/Linux wheezy/sid 
    bash GNU bash 4.1.5
    tree v1.6.0 (c) 1996 - 2011 by Steve Baker, Thomas Moore, Francesc Rocher, Kyosuke Tokoro 
    fdupes 1.50-PR2
    
    
    -----
     Results, tree before:
    m
    |-- [       4096]  d1
    |   |-- [          6]  f1
    |   |-- [          6]  f2
    |   |-- [          4]  f3
    |   `-- [          6]  f4
    `-- [       4096]  d2
        |-- [          6]  f1
        |-- [          6]  f2
        `-- [          8]  f3
    
    2 directories, 7 files
    
    -----
     Content of files:
    
    ---
    
    ==> m/d1/f1 <==               ==> m/d2/f1 <==
    a                             a
    b                             b
    c                             c
                                  
    ==> m/d1/f2 <==               ==> m/d2/f2 <==
    a                             c
    b                             a
    d                             b
                                  
    ==> m/d1/f3 <==               ==> m/d2/f3 <==
    a                             a
    b                             b
                                  c
    ==> m/d1/f4 <==               d
    a                             
    b                             
    c                             
    
    -----
     Results for fdupes:
                                            
       [+] m/d2/f1
       [-] m/d1/f1
       [-] m/d1/f4
    
    
    -----
     Results, tree after:
    m
    |-- [       4096]  d1
    |   |-- [          6]  f2
    |   `-- [          4]  f3
    `-- [       4096]  d2
        |-- [          6]  f1
        |-- [          6]  f2
        `-- [          8]  f3
    
    2 directories, 5 files
    The utility rdfind is more verbose, which some people like:
    Code:
    ... same framework, except:
    rdfind -deleteduplicates true m
    ...
    producing:
    Code:
    % ./s2
    
    Environment: LC_ALL = C, LANG = C
    (Versions displayed with local utility "version")
    OS, ker|rel, machine: Linux, 3.0.0-1-amd64, x86_64
    Distribution        : Debian GNU/Linux wheezy/sid 
    bash GNU bash 4.1.5
    tree v1.6.0 (c) 1996 - 2011 by Steve Baker, Thomas Moore, Francesc Rocher, Kyosuke Tokoro 
    rdfind (local) 1.3.1
    
    
    -----
     Results, tree before:
    m
    |-- [       4096]  d1
    |   |-- [          6]  f1
    |   |-- [          6]  f2
    |   |-- [          4]  f3
    |   `-- [          6]  f4
    `-- [       4096]  d2
        |-- [          6]  f1
        |-- [          6]  f2
        `-- [          8]  f3
    
    2 directories, 7 files
    
    -----
     Content of files:
    
    ---
    
    ==> m/d1/f1 <==               ==> m/d2/f1 <==
    a                             a
    b                             b
    c                             c
                                  
    ==> m/d1/f2 <==               ==> m/d2/f2 <==
    a                             c
    b                             a
    d                             b
                                  
    ==> m/d1/f3 <==               ==> m/d2/f3 <==
    a                             a
    b                             b
                                  c
    ==> m/d1/f4 <==               d
    a                             
    b                             
    c                             
    
    -----
     Results for rdfind:
    Now scanning "m", found 7 files.
    Now have 7 files in total.
    Removed 0 files due to nonunique device and inode.
    Now removing files with zero size from list...removed 0 files
    Total size is 42 bytes or 42 b
    Now sorting on size:removed 2 files due to unique sizes from list.5 files left.
    Now eliminating candidates based on first bytes:removed 2 files from list.3 files left.
    Now eliminating candidates based on last bytes:removed 0 files from list.3 files left.
    Now eliminating candidates based on md5 checksum:removed 0 files from list.3 files left.
    It seems like you have 3 files that are not unique
    Totally, 12 b can be reduced.
    Now making results file results.txt
    Now deleting duplicates:
    Deleted 2 files.
    
    -----
     Results, tree after:
    m
    |-- [       4096]  d1
    |   |-- [          6]  f2
    |   `-- [          4]  f3
    `-- [       4096]  d2
        |-- [          6]  f1
        |-- [          6]  f2
        `-- [          8]  f3
    
    2 directories, 5 files
    See man pages for details.

    Best wishes ... cheers, drl
    Welcome - get the most out of the forum by reading forum basics and guidelines: click here.
    90% of questions can be answered by using man pages, Quick Search, Advanced Search, Google search, Wikipedia.
    We look forward to helping you with the challenge of the other 10%.
    ( Mn, 2.6.n, AMD-64 3000+, ASUS A8V Deluxe, 1 GB, SATA + IDE, Matrox G400 AGP )

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •