Results 1 to 10 of 11
I have two files I want to compare, but I want diff (or a similar program) to ignore the order of the lines. In other words, I don't care what ...
- 10-16-2007 #1Just Joined!
- Join Date
- Oct 2007
- Posts
- 5
Basics: How to compare files (ignoring line number)
I have two files I want to compare, but I want diff (or a similar program) to ignore the order of the lines. In other words, I don't care what line the text is on, if it matches in both documents (wherever it appears in the documents) I don't want to know about that line--Just lines that appear in one of the documents and not the other. Sorting the files first doesn't work because the lines still don't line up after the first few lines.
Anyone know how to do this with Bash / GNU Utils? Thank you.
- 10-16-2007 #2Linux Engineer
- Join Date
- Apr 2006
- Location
- Saint Paul, MN, USA / CentOS, Debian, Solaris, SuSE
- Posts
- 1,117
Hi.
This sounds like a situation where comm would be useful. See man comm for details -- don't forget to to sort the files first ... cheers, drlWelcome - get the most out of the forum by reading forum basics and guidelines: click here.
90% of questions can be answered by using man pages, Quick Search, Advanced Search, Google search, Wikipedia.
We look forward to helping you with the challenge of the other 10%.
( Mn, 2.6.n, AMD-64 3000+, ASUS A8V Deluxe, 1 GB, SATA + IDE, Matrox G400 AGP )
- 10-16-2007 #3Just Joined!
- Join Date
- Oct 2007
- Posts
- 5
Thanks for the response but the program still doesn't do what I need it to do. Comm is comparing line by line. I want a program to take line 1 from document A, and see if that line matches _any_ line from document B. And the opposite: Take line 1 from document B, and see if matches any line from document A. Than take line 2........etc..... I want to know about lines unique to each document, with no regard to where (line wise) those lines appear in the document.
- 10-16-2007 #4Linux Engineer
- Join Date
- Apr 2006
- Location
- Saint Paul, MN, USA / CentOS, Debian, Solaris, SuSE
- Posts
- 1,117
Hi.
Here's an example of how it works for me:
Producing:Code:#!/usr/bin/env sh # @(#) s2 Demonstrate basic use of comm. set -o nounset echo ## Use local command version for the commands in this demonstration. echo "(Versions of codes used in this script -- local code \"version\")" version bash comm echo echo " File data3:" cat data3 echo echo " File data4:" cat data4 echo echo " Results from comm:" echo " Columns: unique to data3, unique to data4, common" sort data3 >f1 sort data4 >f2 comm f1 f2 exit 0
cheers, drlCode:% ./s2 (Versions of codes used in this script -- local code "version") GNU bash, version 2.05b.0(1)-release (i386-pc-linux-gnu) comm (coreutils) 5.2.1 File data3: b a d c File data4: e f d g Results from comm: Columns: unique to data3, unique to data4, common a b c d e f gWelcome - get the most out of the forum by reading forum basics and guidelines: click here.
90% of questions can be answered by using man pages, Quick Search, Advanced Search, Google search, Wikipedia.
We look forward to helping you with the challenge of the other 10%.
( Mn, 2.6.n, AMD-64 3000+, ASUS A8V Deluxe, 1 GB, SATA + IDE, Matrox G400 AGP )
- 10-16-2007 #5
another solution
I have two solutions for you.
To use one of these scripts, enter its name on the command line, followed (on the same command line) by the names of the two files you wish to compare.
The first solution assumes that within each file no two lines are the same. The second solution makes no such assumption.
Disclaimer: I have tested these scripts, but you may wish to test them too.
Solution one:
Solution two:Code:#!/bin/bash do_one_way() { echo === lines in $1 which are not in $2: cat $1 $2 $2 | sort | uniq -u } do_one_way $1 $2 do_one_way $2 $1
Edit:Code:#!/bin/bash do_one_way() { echo === lines in $1 which are not in $2: (uniq $1; cat $2 $2) | sort | uniq -u } do_one_way $1 $2 do_one_way $2 $1
The differences between drl's fine solution and mine are these:
- The above solution lists only the lines that are different, leaving out completely the lines that are not different.
- The above solution does not add any tabs or artificial sweeteners to the output. You can remove the === line from the script, and call function one_way just once instead of twice, to get exactly the unique lines going one way, suitable for processing elsewhere.
Hope this helps.Last edited by wje_lf; 10-16-2007 at 04:13 PM. Reason: comparison of solutions
- 10-16-2007 #6Just Joined!
- Join Date
- Oct 2007
- Posts
- 5
drl: That is working for me, I think I was just get confused by duplicate lines within the file.
wje_lf: Thanks, I will play around with that too, always trying to improve my scripting.
- 10-16-2007 #7Linux Engineer
- Join Date
- Apr 2006
- Location
- Saint Paul, MN, USA / CentOS, Debian, Solaris, SuSE
- Posts
- 1,117
Hi, wje_lf.
In your solution # 2:
on the systems I use I would need to add a sort before the uniq on $1 because uniq will only remove duplicates that are immediately adjacent. If I were not to do that and say I had 2 lines "a" and "a" in $1 that were not adjacent and not in $2, then the first uniq would not remove them, but the final uniq would, omitting the report of "a" as being unique in $1 ... cheers, drlCode:echo === lines in $1 which are not in $2: (uniq $1; cat $2 $2) | sort | uniq -u
( edit 1: typo )Welcome - get the most out of the forum by reading forum basics and guidelines: click here.
90% of questions can be answered by using man pages, Quick Search, Advanced Search, Google search, Wikipedia.
We look forward to helping you with the challenge of the other 10%.
( Mn, 2.6.n, AMD-64 3000+, ASUS A8V Deluxe, 1 GB, SATA + IDE, Matrox G400 AGP )
- 10-16-2007 #8Linux Engineer
- Join Date
- Apr 2006
- Location
- Saint Paul, MN, USA / CentOS, Debian, Solaris, SuSE
- Posts
- 1,117
Hi, miles800.
If you have duplicates inside a file, then use
in place of a simple sort to remove them before using comm ... cheers, drlCode:sort -u
Welcome - get the most out of the forum by reading forum basics and guidelines: click here.
90% of questions can be answered by using man pages, Quick Search, Advanced Search, Google search, Wikipedia.
We look forward to helping you with the challenge of the other 10%.
( Mn, 2.6.n, AMD-64 3000+, ASUS A8V Deluxe, 1 GB, SATA + IDE, Matrox G400 AGP )
- 10-16-2007 #9
oops!
drl, you're right.
My amended second script:
(blush)Code:#!/bin/bash do_one_way() { echo === lines in $1 which are not in $2: (sort -u $1; cat $2 $2) | sort | uniq -u } do_one_way $1 $2 do_one_way $2 $1
- 10-16-2007 #10Linux Engineer
- Join Date
- Apr 2006
- Location
- Saint Paul, MN, USA / CentOS, Debian, Solaris, SuSE
- Posts
- 1,117
Hi.
For both points, if the simple difference (file1 - intersection(file1,file2)), then we can use:
Originally Posted by wje_lf
which produces the plain text, no tabs or additives for the sorted files.Code:comm -23 file1 file2
As Ron Popeil would say " But wait, there's more!"
Consider the passes across the data files:
(The second solution's final step sort | uniq -u can be shortened to sort -u)Code:Suppose file1 has x characters, file2 has y characters. second solution: file1: sort sort -> 2x file2: cat cat sort -> 3y comm solution: file1: sort comm awk -> 3x file2: sort comm awk -> 3y
Which looks like the second solution has one less pass over the data. However, upon closer inspection, there is an awk step in the comm solution.
It is true that comm inserts a tab for more than one option. However, that can work in our favor. With the addition of a small awk script, we can key on that tab to produce a separate file for each simple difference. To accomplish that with the second solution (as it stands), one would need to re-run the script, incurring double the number of passes over the data. That would sum up to:
RP might then say: "Now how much would you pay?"Code:second solution: file1: sort sort -> 4x file2: cat cat sort -> 6y
Of course, for small files, the difference is negligible -- but a billion here, a billion there, pretty soon we're talking about real resources.
Here is the comm-based script with the little awk trailer that splits up the data into separate files:
Producing:Code:#!/usr/bin/env sh # @(#) s7 Demonstrate use of comm with an awk chaser. set -o nounset echo ## Use local command version for the commands in this demonstration. echo "(Versions of codes used in this script -- local code \"version\")" version bash comm awk echo echo " File data3:" cat data3 echo echo " File data4:" cat data4 echo echo " Results from comm, unique to both data3, data4:" sort -u data3 >f1 sort -u data4 >f2 comm -3 f1 f2 | awk ' /^\t/ { sub(/^\t/,"") ; print > "f4" ; next } { print > "f3" } ' echo " Unique to data3, file f3:" cat f3 echo echo " Unique to data4, file f4:" cat f4 exit 0
This was an interesting exercise, but more care would need to be taken to identify input lines which already had leading tabs, for example, to make the kernel pipeline of s7 into a production script ... cheers, drlCode:% ./s7 (Versions of codes used in this script -- local code "version") GNU bash, version 2.05b.0(1)-release (i386-pc-linux-gnu) comm (coreutils) 5.2.1 GNU Awk 3.1.4 File data3: b a d c File data4: e f d g Results from comm, unique to both data3, data4: Unique to data3, file f3: a b c Unique to data4, file f4: e f g
Welcome - get the most out of the forum by reading forum basics and guidelines: click here.
90% of questions can be answered by using man pages, Quick Search, Advanced Search, Google search, Wikipedia.
We look forward to helping you with the challenge of the other 10%.
( Mn, 2.6.n, AMD-64 3000+, ASUS A8V Deluxe, 1 GB, SATA + IDE, Matrox G400 AGP )


Reply With Quote