Find the answer to your Linux question:
Page 1 of 2 1 2 LastLast
Results 1 to 10 of 11
I have two files I want to compare, but I want diff (or a similar program) to ignore the order of the lines. In other words, I don't care what ...
  1. #1
    Just Joined!
    Join Date
    Oct 2007
    Posts
    5

    Basics: How to compare files (ignoring line number)

    I have two files I want to compare, but I want diff (or a similar program) to ignore the order of the lines. In other words, I don't care what line the text is on, if it matches in both documents (wherever it appears in the documents) I don't want to know about that line--Just lines that appear in one of the documents and not the other. Sorting the files first doesn't work because the lines still don't line up after the first few lines.

    Anyone know how to do this with Bash / GNU Utils? Thank you.

  2. #2
    drl
    drl is offline
    Linux Engineer drl's Avatar
    Join Date
    Apr 2006
    Location
    Saint Paul, MN, USA / CentOS, Debian, Solaris, SuSE
    Posts
    1,117
    Hi.

    This sounds like a situation where comm would be useful. See man comm for details -- don't forget to to sort the files first ... cheers, drl
    Welcome - get the most out of the forum by reading forum basics and guidelines: click here.
    90% of questions can be answered by using man pages, Quick Search, Advanced Search, Google search, Wikipedia.
    We look forward to helping you with the challenge of the other 10%.
    ( Mn, 2.6.n, AMD-64 3000+, ASUS A8V Deluxe, 1 GB, SATA + IDE, Matrox G400 AGP )

  3. #3
    Just Joined!
    Join Date
    Oct 2007
    Posts
    5
    Thanks for the response but the program still doesn't do what I need it to do. Comm is comparing line by line. I want a program to take line 1 from document A, and see if that line matches _any_ line from document B. And the opposite: Take line 1 from document B, and see if matches any line from document A. Than take line 2........etc..... I want to know about lines unique to each document, with no regard to where (line wise) those lines appear in the document.

  4. #4
    drl
    drl is offline
    Linux Engineer drl's Avatar
    Join Date
    Apr 2006
    Location
    Saint Paul, MN, USA / CentOS, Debian, Solaris, SuSE
    Posts
    1,117
    Hi.

    Here's an example of how it works for me:
    Code:
    #!/usr/bin/env sh
    
    # @(#) s2       Demonstrate basic use of comm.
    
    set -o nounset
    echo
    
    ## Use local command version for the commands in this demonstration.
    
    echo "(Versions of codes used in this script -- local code \"version\")"
    version bash comm
    
    echo
    
    echo " File data3:"
    cat data3
    
    echo
    echo " File data4:"
    cat data4
    
    echo
    echo " Results from comm:"
    echo " Columns: unique to data3, unique to data4, common"
    
    sort data3 >f1
    sort data4 >f2
    comm f1 f2
    
    exit 0
    Producing:
    Code:
    % ./s2
    
    (Versions of codes used in this script -- local code "version")
    GNU bash, version 2.05b.0(1)-release (i386-pc-linux-gnu)
    comm (coreutils) 5.2.1
    
     File data3:
    b
    a
    d
    c
    
     File data4:
    e
    f
    d
    g
    
     Results from comm:
     Columns: unique to data3, unique to data4, common
    a
    b
    c
                    d
            e
            f
            g
    cheers, drl
    Welcome - get the most out of the forum by reading forum basics and guidelines: click here.
    90% of questions can be answered by using man pages, Quick Search, Advanced Search, Google search, Wikipedia.
    We look forward to helping you with the challenge of the other 10%.
    ( Mn, 2.6.n, AMD-64 3000+, ASUS A8V Deluxe, 1 GB, SATA + IDE, Matrox G400 AGP )

  5. #5
    Linux Engineer wje_lf's Avatar
    Join Date
    Sep 2007
    Location
    Mariposa
    Posts
    1,192

    another solution

    I have two solutions for you.

    To use one of these scripts, enter its name on the command line, followed (on the same command line) by the names of the two files you wish to compare.

    The first solution assumes that within each file no two lines are the same. The second solution makes no such assumption.

    Disclaimer: I have tested these scripts, but you may wish to test them too.

    Solution one:
    Code:
    #!/bin/bash
    
    do_one_way()
    {
      echo === lines in $1 which are not in $2:
      cat $1 $2 $2 | sort | uniq -u
    }
    
    do_one_way $1 $2
    do_one_way $2 $1
    Solution two:
    Code:
    #!/bin/bash
    
    do_one_way()
    {
      echo === lines in $1 which are not in $2:
      (uniq $1; cat $2 $2) | sort | uniq -u
    }
    
    do_one_way $1 $2
    do_one_way $2 $1
    Edit:
    The differences between drl's fine solution and mine are these:
    1. The above solution lists only the lines that are different, leaving out completely the lines that are not different.
    2. The above solution does not add any tabs or artificial sweeteners to the output. You can remove the === line from the script, and call function one_way just once instead of twice, to get exactly the unique lines going one way, suitable for processing elsewhere.


    Hope this helps.
    Last edited by wje_lf; 10-16-2007 at 04:13 PM. Reason: comparison of solutions

  6. #6
    Just Joined!
    Join Date
    Oct 2007
    Posts
    5
    drl: That is working for me, I think I was just get confused by duplicate lines within the file.

    wje_lf: Thanks, I will play around with that too, always trying to improve my scripting.

  7. #7
    drl
    drl is offline
    Linux Engineer drl's Avatar
    Join Date
    Apr 2006
    Location
    Saint Paul, MN, USA / CentOS, Debian, Solaris, SuSE
    Posts
    1,117
    Hi, wje_lf.

    In your solution # 2:
    Code:
      echo === lines in $1 which are not in $2:
      (uniq $1; cat $2 $2) | sort | uniq -u
    on the systems I use I would need to add a sort before the uniq on $1 because uniq will only remove duplicates that are immediately adjacent. If I were not to do that and say I had 2 lines "a" and "a" in $1 that were not adjacent and not in $2, then the first uniq would not remove them, but the final uniq would, omitting the report of "a" as being unique in $1 ... cheers, drl

    ( edit 1: typo )
    Welcome - get the most out of the forum by reading forum basics and guidelines: click here.
    90% of questions can be answered by using man pages, Quick Search, Advanced Search, Google search, Wikipedia.
    We look forward to helping you with the challenge of the other 10%.
    ( Mn, 2.6.n, AMD-64 3000+, ASUS A8V Deluxe, 1 GB, SATA + IDE, Matrox G400 AGP )

  8. #8
    drl
    drl is offline
    Linux Engineer drl's Avatar
    Join Date
    Apr 2006
    Location
    Saint Paul, MN, USA / CentOS, Debian, Solaris, SuSE
    Posts
    1,117
    Hi, miles800.

    If you have duplicates inside a file, then use
    Code:
    sort -u
    in place of a simple sort to remove them before using comm ... cheers, drl
    Welcome - get the most out of the forum by reading forum basics and guidelines: click here.
    90% of questions can be answered by using man pages, Quick Search, Advanced Search, Google search, Wikipedia.
    We look forward to helping you with the challenge of the other 10%.
    ( Mn, 2.6.n, AMD-64 3000+, ASUS A8V Deluxe, 1 GB, SATA + IDE, Matrox G400 AGP )

  9. #9
    Linux Engineer wje_lf's Avatar
    Join Date
    Sep 2007
    Location
    Mariposa
    Posts
    1,192

    oops!

    drl, you're right.

    My amended second script:
    Code:
    #!/bin/bash
    
    do_one_way()
    {
      echo === lines in $1 which are not in $2:
      (sort -u $1; cat $2 $2) | sort | uniq -u
    }
    
    do_one_way $1 $2
    do_one_way $2 $1
    (blush)

  10. #10
    drl
    drl is offline
    Linux Engineer drl's Avatar
    Join Date
    Apr 2006
    Location
    Saint Paul, MN, USA / CentOS, Debian, Solaris, SuSE
    Posts
    1,117
    Hi.
    Quote Originally Posted by wje_lf
    1 The above solution lists only the lines that are different, leaving out completely the lines that are not different.
    2 The above solution does not add any tabs or artificial sweeteners to the output.
    For both points, if the simple difference (file1 - intersection(file1,file2)), then we can use:
    Code:
    comm -23 file1 file2
    which produces the plain text, no tabs or additives for the sorted files.

    As Ron Popeil would say " But wait, there's more!"

    Consider the passes across the data files:
    Code:
    Suppose file1 has x characters, file2 has y characters.
    
    second solution:
    
    file1: sort sort    -> 2x
    file2: cat cat sort -> 3y
    
    comm solution:
    
    file1: sort comm awk -> 3x
    file2: sort comm awk -> 3y
    (The second solution's final step sort | uniq -u can be shortened to sort -u)

    Which looks like the second solution has one less pass over the data. However, upon closer inspection, there is an awk step in the comm solution.

    It is true that comm inserts a tab for more than one option. However, that can work in our favor. With the addition of a small awk script, we can key on that tab to produce a separate file for each simple difference. To accomplish that with the second solution (as it stands), one would need to re-run the script, incurring double the number of passes over the data. That would sum up to:
    Code:
    second solution:
    
    file1: sort sort    -> 4x
    file2: cat cat sort -> 6y
    RP might then say: "Now how much would you pay?"

    Of course, for small files, the difference is negligible -- but a billion here, a billion there, pretty soon we're talking about real resources.

    Here is the comm-based script with the little awk trailer that splits up the data into separate files:
    Code:
    #!/usr/bin/env sh
    
    # @(#) s7       Demonstrate use of comm with an awk chaser.
    
    set -o nounset
    echo
    
    ## Use local command version for the commands in this demonstration.
    
    echo "(Versions of codes used in this script -- local code \"version\")"
    version bash comm awk
    
    echo
    
    echo " File data3:"
    cat data3
    
    echo
    echo " File data4:"
    cat data4
    
    echo
    echo " Results from comm, unique to both data3, data4:"
    
    sort -u data3 >f1
    sort -u data4 >f2
    comm -3 f1 f2 |
    awk '
    /^\t/   { sub(/^\t/,"") ;  print > "f4" ; next }
            { print > "f3" }
    '
    
    echo " Unique to data3, file f3:"
    cat f3
    
    echo
    echo " Unique to data4, file f4:"
    cat f4
    
    exit 0
    Producing:
    Code:
    % ./s7
    
    (Versions of codes used in this script -- local code "version")
    GNU bash, version 2.05b.0(1)-release (i386-pc-linux-gnu)
    comm (coreutils) 5.2.1
    GNU Awk 3.1.4
    
     File data3:
    b
    a
    d
    c
    
     File data4:
    e
    f
    d
    g
    
     Results from comm, unique to both data3, data4:
     Unique to data3, file f3:
    a
    b
    c
    
     Unique to data4, file f4:
    e
    f
    g
    This was an interesting exercise, but more care would need to be taken to identify input lines which already had leading tabs, for example, to make the kernel pipeline of s7 into a production script ... cheers, drl
    Welcome - get the most out of the forum by reading forum basics and guidelines: click here.
    90% of questions can be answered by using man pages, Quick Search, Advanced Search, Google search, Wikipedia.
    We look forward to helping you with the challenge of the other 10%.
    ( Mn, 2.6.n, AMD-64 3000+, ASUS A8V Deluxe, 1 GB, SATA + IDE, Matrox G400 AGP )

Page 1 of 2 1 2 LastLast

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
...