Find the answer to your Linux question:
Results 1 to 9 of 9
I need to compare two lists, and if a line exists in file 1, delete it from file 2. Here is what I am using, and it works on samples ...
  1. #1
    Just Joined!
    Join Date
    Jan 2011
    Posts
    4

    script is taking hours, hints on better alternative?

    I need to compare two lists, and if a line exists in file 1, delete it from file 2.

    Here is what I am using, and it works on samples of a few hundred lines, but I have to get it to work faster on 50k lines. Has taken 7 hours so far and still no results.

    result2 is master list, result1 is new list where i want every line to be completely unique--not in result2.

    grep -vf result2.txt result1.txt | sed 's/ .*//' > diffs.txt

    Any suggestions?

    Thanks,

    Rob

  2. #2
    Linux Newbie tetsujin's Avatar
    Join Date
    Oct 2008
    Posts
    115
    Quote Originally Posted by rcr2011 View Post
    I need to compare two lists, and if a line exists in file 1, delete it from file 2.

    Here is what I am using, and it works on samples of a few hundred lines, but I have to get it to work faster on 50k lines. Has taken 7 hours so far and still no results.

    result2 is master list, result1 is new list where i want every line to be completely unique--not in result2.

    grep -vf result2.txt result1.txt | sed 's/ .*//' > diffs.txt
    That works? Really? It doesn't seem to work for me...

    Probably the way to do it would be to load file 1 into an associative array (even with 50k lines that's only a couple megabytes of memory, right?) and then use that to filter individual lines in file 2...

    Code:
    #!/bin/bash
    
    exec 3<$1
    exec 4<$2
    
    declare -A a
    while read -r x <&3; do
            a[":$x"]=1
    done
    
    exec 3<&-
    
    while read -r x <&4; do
            if [ -z ${a[":$x"]} ]; then
                    echo "$x";
            fi
    done
    
    exec 4<&-
    You can implement this sort of thing pretty easily in C++ or Perl, as well. There's a slight problem with this version, which is that "read" won't read the last line of the text file if there's not a newline at the end of it... So if your text files don't necessarily have a newline at the end of the last line, you might want to look into another implementation.

  3. #3
    Just Joined!
    Join Date
    Jan 2011
    Posts
    4
    Thank you. But I'm not sure where to define my two input files.

    is it 1=input1.txt and 2=input2.txt?

    and can i write results to file?

    thank you!

  4. #4
    Linux Newbie tetsujin's Avatar
    Join Date
    Oct 2008
    Posts
    115
    Quote Originally Posted by rcr2011 View Post
    Thank you. But I'm not sure where to define my two input files.

    is it 1=input1.txt and 2=input2.txt?

    and can i write results to file?

    thank you!
    Save the code as a script, then call the script. Redirect the output of the script to a file, and you'll write the results to a file.

    Code:
    $ exclude.sh file1 file2 > file3

  5. #5
    Just Joined!
    Join Date
    Jan 2011
    Posts
    4

    thanks, getting error

    here is what i typed:

    ./uniques.sh r1.txt as1.txt > diff.txt

    ./uniques.sh: line 6: declare: -A: invalid option
    declare: usage: declare [-afFirtx] [-p] [name[=value] ...]
    ./uniques.sh: line 8: :testing123.txt: syntax error: operand expected (error token is ":testing123.txt")
    ")syntax error: operand expected (error token is ":testing123.txt

    and here is the uniques.sh file:
    #!/bin/bash

    exec 3<$1
    exec 4<$2

    declare -A a
    while read -r x <&3; do
    a[":$x"]=1
    done

    exec 3<&-

    while read -r x <&4; do
    if [ -z ${a[":$x"]} ]; then
    echo "$x";
    fi
    done

    exec 4<&-

    What am I doing wrong?

    Rob

  6. #6
    Linux Newbie tetsujin's Avatar
    Join Date
    Oct 2008
    Posts
    115
    It looks like your version of Bash doesn't support associative arrays... It has to be version 4, I think.

  7. #7
    Just Joined!
    Join Date
    Jan 2011
    Posts
    4
    OK, got BASH 4 installed. Got script working. Made a sample file with duplicates and uniques and script worked flawlessly on those two-digit numbers.

    But.....on real world example with 40k email addresses in one file and 6k in another it seems to not work. These are two opt-in email lists I have and I don't want to send to the smaller file if I've already sent to them--the bigger file.

    Testing I've done indicates that if I copy a small subset out, the program works--maybe a few hundred cases. But on the whole big file comparison it copies the smaller file completely over to the output file even though on visual inspection the first dozen or more are obviously repeats.

    At first i thought there was some invisible difference in text format or something between files. But I've opened them in several programs and text editors and they seem to be identical in regards to formatting. I even copied the text from one file to the other to make sure they were exactly duplicated and it still gets identified as unique.

    To emphasize: all testing I do on small files shows perfect performance of the script. But on the big script it falls apart. Even testing with email addresses in small files works fine.

    This may be way more than you ever meant to hel with, but if it makes you curious I'd love to get this solved.

    Thanks,

    Rob

  8. #8
    Just Joined!
    Join Date
    Jan 2011
    Posts
    3
    I don't know if you have an RDBMS, but if you do or can install one without any hassles, it would simply be a matter of loading the files into two different tables and then do:

    Code:
    SELECT email FROM small_table
    MINUS
    SELECT email FROM big_table;
    You might need to index the tables before issuing this query, but after that, even 60K should be a piece of cake. Other than that, I would second tetsujin's suggestion of giving another language, such as Perl or C++, a shot.

  9. #9
    Linux Enthusiast meton_magis's Avatar
    Join Date
    Oct 2006
    Location
    arizona
    Posts
    665
    I have created the following script for you (I was bored at work.)

    Code:
    #!/usr/bin/perl -w
    #File is released under any OSI approved open source license. Apache, BSD, GNU GPL, and GNU LGPL are all explicitely approved.
    #I am not a professional coder. I do not take responsibility for the effects of this script, you run this at your own risk.
    #Please test using a non production environment to make sure script meets your needs.
    ###############################################################################
    use strict;
    
    $^I = ".bak"; #backup all files that are opened with <>, does not backup argument 1, as the file is shifted off the array before <>.
    $_ = shift @ARGV; #remove the first file listed from the array, so it is not read by <>.
    open FILE1, "$_" or die "Can't open ${_}: $!"; # open file matching first  arguement to the command line
    
    my %file1_hash; #Create hash to hold unique lines of FILE1.
    while (<FILE1>) { #read through each line of FILE1, and create a key in %file1_hash for each unique entry.
      chomp; # remove newline at end of each file line. I don't know how newlines affect hashes, so it's better to just get rid of it.
      $file1_hash{$_} += 1; #will only create 1 entry per unique line. Duplicates incriment value by 1, but no new key is created.
    }
    
    while (<>) { #Compare each file listed on the command line to the lines of FILE1. If the line exists, do not print.
      chomp; #have to remove so it matches what's in the hash.
      print $_, "\n" if !(exists $file1_hash{$_});
    }
    
    close FILE1;
    save this to a file, and chmod it to be executable (I assume you know how to do that already since you've been scripting.)

    run it like so

    Code:
     /location/of/script  ./master_file  ./file_to_have_lines_deleted_from
    or even like this

    Code:
     /location/of/script  ./master_file  ./file1_to_have_lines_deleted_from  ./file2_to_have_lines_deleted_from  ./file3_to_have_lines_deleted_from
    you can have as many secondary files as you wish, but only 1 master file per execution of the script.

    The file will automatically backup all files given after the master file. If you don't want to back it up (which is crazy, but can be necessary for space reasons,) then remove the
    $^I = ".bak"
    line. PLEASE PLEASE PLEASE backup the files manually first though. I'm not a professional coder, just a hobbyist. While I have tested this many times, I can't know what your environment may be like.


    after running it on my system (with 2 files of 50k lines of junk information,) I got this.

    [meton@server1 temp]$ wc -l rand1 rand2
    50001 rand1
    50001 rand2
    100002 total
    [meton@server1 temp]$ time ./2uniq rand1 rand2

    real 0m0.643s
    user 0m0.613s
    sys 0m0.031s


    That's 6/10 of a second to complete. Your lines may be longer, and this was all numerical data, which may process faster, but it's still most likely quicker than an hour.
    Last edited by meton_magis; 01-09-2011 at 11:12 AM.
    New to the internet, technical forums, or the hacker / open source community??
    Read this to learn good posting habits http://www.catb.org/~esr/faqs/smart-questions.html

    RHCE for RHEL version 5
    RHCT for RHEL version 4

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
...