Results 1 to 9 of 9
I need to compare two lists, and if a line exists in file 1, delete it from file 2.
Here is what I am using, and it works on samples ...
- 01-06-2011 #1Just Joined!
- Join Date
- Jan 2011
- Posts
- 4
script is taking hours, hints on better alternative?
I need to compare two lists, and if a line exists in file 1, delete it from file 2.
Here is what I am using, and it works on samples of a few hundred lines, but I have to get it to work faster on 50k lines. Has taken 7 hours so far and still no results.
result2 is master list, result1 is new list where i want every line to be completely unique--not in result2.
grep -vf result2.txt result1.txt | sed 's/ .*//' > diffs.txt
Any suggestions?
Thanks,
Rob
- 01-06-2011 #2
That works? Really? It doesn't seem to work for me...
Probably the way to do it would be to load file 1 into an associative array (even with 50k lines that's only a couple megabytes of memory, right?) and then use that to filter individual lines in file 2...
You can implement this sort of thing pretty easily in C++ or Perl, as well. There's a slight problem with this version, which is that "read" won't read the last line of the text file if there's not a newline at the end of it... So if your text files don't necessarily have a newline at the end of the last line, you might want to look into another implementation.Code:#!/bin/bash exec 3<$1 exec 4<$2 declare -A a while read -r x <&3; do a[":$x"]=1 done exec 3<&- while read -r x <&4; do if [ -z ${a[":$x"]} ]; then echo "$x"; fi done exec 4<&-
- 01-07-2011 #3Just Joined!
- Join Date
- Jan 2011
- Posts
- 4
Thank you. But I'm not sure where to define my two input files.
is it 1=input1.txt and 2=input2.txt?
and can i write results to file?
thank you!
- 01-07-2011 #4
- 01-07-2011 #5Just Joined!
- Join Date
- Jan 2011
- Posts
- 4
thanks, getting error
here is what i typed:
./uniques.sh r1.txt as1.txt > diff.txt
./uniques.sh: line 6: declare: -A: invalid option
declare: usage: declare [-afFirtx] [-p] [name[=value] ...]
./uniques.sh: line 8: :testing123.txt: syntax error: operand expected (error token is ":testing123.txt")
")syntax error: operand expected (error token is ":testing123.txt
and here is the uniques.sh file:
#!/bin/bash
exec 3<$1
exec 4<$2
declare -A a
while read -r x <&3; do
a[":$x"]=1
done
exec 3<&-
while read -r x <&4; do
if [ -z ${a[":$x"]} ]; then
echo "$x";
fi
done
exec 4<&-
What am I doing wrong?
Rob
- 01-07-2011 #6
It looks like your version of Bash doesn't support associative arrays... It has to be version 4, I think.
- 01-08-2011 #7Just Joined!
- Join Date
- Jan 2011
- Posts
- 4
OK, got BASH 4 installed. Got script working. Made a sample file with duplicates and uniques and script worked flawlessly on those two-digit numbers.
But.....on real world example with 40k email addresses in one file and 6k in another it seems to not work. These are two opt-in email lists I have and I don't want to send to the smaller file if I've already sent to them--the bigger file.
Testing I've done indicates that if I copy a small subset out, the program works--maybe a few hundred cases. But on the whole big file comparison it copies the smaller file completely over to the output file even though on visual inspection the first dozen or more are obviously repeats.
At first i thought there was some invisible difference in text format or something between files. But I've opened them in several programs and text editors and they seem to be identical in regards to formatting. I even copied the text from one file to the other to make sure they were exactly duplicated and it still gets identified as unique.
To emphasize: all testing I do on small files shows perfect performance of the script. But on the big script it falls apart. Even testing with email addresses in small files works fine.
This may be way more than you ever meant to hel with, but if it makes you curious I'd love to get this solved.
Thanks,
Rob
- 01-08-2011 #8Just Joined!
- Join Date
- Jan 2011
- Posts
- 3
I don't know if you have an RDBMS, but if you do or can install one without any hassles, it would simply be a matter of loading the files into two different tables and then do:
You might need to index the tables before issuing this query, but after that, even 60K should be a piece of cake. Other than that, I would second tetsujin's suggestion of giving another language, such as Perl or C++, a shot.Code:SELECT email FROM small_table MINUS SELECT email FROM big_table;
- 01-09-2011 #9
I have created the following script for you (I was bored at work.)
save this to a file, and chmod it to be executable (I assume you know how to do that already since you've been scripting.)Code:#!/usr/bin/perl -w #File is released under any OSI approved open source license. Apache, BSD, GNU GPL, and GNU LGPL are all explicitely approved. #I am not a professional coder. I do not take responsibility for the effects of this script, you run this at your own risk. #Please test using a non production environment to make sure script meets your needs. ############################################################################### use strict; $^I = ".bak"; #backup all files that are opened with <>, does not backup argument 1, as the file is shifted off the array before <>. $_ = shift @ARGV; #remove the first file listed from the array, so it is not read by <>. open FILE1, "$_" or die "Can't open ${_}: $!"; # open file matching first arguement to the command line my %file1_hash; #Create hash to hold unique lines of FILE1. while (<FILE1>) { #read through each line of FILE1, and create a key in %file1_hash for each unique entry. chomp; # remove newline at end of each file line. I don't know how newlines affect hashes, so it's better to just get rid of it. $file1_hash{$_} += 1; #will only create 1 entry per unique line. Duplicates incriment value by 1, but no new key is created. } while (<>) { #Compare each file listed on the command line to the lines of FILE1. If the line exists, do not print. chomp; #have to remove so it matches what's in the hash. print $_, "\n" if !(exists $file1_hash{$_}); } close FILE1;
run it like so
or even like thisCode:/location/of/script ./master_file ./file_to_have_lines_deleted_from
you can have as many secondary files as you wish, but only 1 master file per execution of the script.Code:/location/of/script ./master_file ./file1_to_have_lines_deleted_from ./file2_to_have_lines_deleted_from ./file3_to_have_lines_deleted_from
The file will automatically backup all files given after the master file. If you don't want to back it up (which is crazy, but can be necessary for space reasons,) then remove the
$^I = ".bak"
line. PLEASE PLEASE PLEASE backup the files manually first though. I'm not a professional coder, just a hobbyist. While I have tested this many times, I can't know what your environment may be like.
after running it on my system (with 2 files of 50k lines of junk information,) I got this.
[meton@server1 temp]$ wc -l rand1 rand2
50001 rand1
50001 rand2
100002 total
[meton@server1 temp]$ time ./2uniq rand1 rand2
real 0m0.643s
user 0m0.613s
sys 0m0.031s
That's 6/10 of a second to complete. Your lines may be longer, and this was all numerical data, which may process faster, but it's still most likely quicker than an hour.Last edited by meton_magis; 01-09-2011 at 11:12 AM.
New to the internet, technical forums, or the hacker / open source community??
Read this to learn good posting habits http://www.catb.org/~esr/faqs/smart-questions.html
RHCE for RHEL version 5
RHCT for RHEL version 4


Reply With Quote
