Find the answer to your Linux question:
Results 1 to 9 of 9
I have 2 lists, one contains a list of words, the other contains a list of words that I would like to exclude. An example of the lists: word list ...
  1. #1
    Just Joined!
    Join Date
    Mar 2006
    Posts
    65

    Filtering a list of certain words.

    I have 2 lists, one contains a list of words, the other contains a list of words that I would like to exclude. An example of the lists:

    word list
    Code:
    train
    car
    boat
    apple
    carrot
    drive
    ram
    cell
    exclude list
    Code:
    apple
    ram
    So what I want to do is re-create the list, without the words apple or ram. So I wrote this script (Which is by far the most advanced thing i've ever written. I'm new to this)
    Code:
    #!/bin/bash
    for x in `cat wordlist`;do
    
            for y in `cat excludelist`; do
    
                    if [ "$x" != "$y" ]; then
                            echo "$x" >> finalwordlist
                    else
                            echo "word excluded"
                    fi
            done
    done
    So, after I execute the script, it prints "word excluded" twice to my terminal, but when I check the finalwordlist file I see this:

    Code:
    train
    train
    car
    car
    boat
    boat
    apple
    carrot
    carrot
    drive
    drive
    ram
    cell
    cell
    Every word is doubled, except apple and ram. I'm not sure how to go about this anymore. My bash experience ends right about here here (actually it really ended at if )

    Can anyone help me out?

  2. #2
    Linux Engineer Zelmo's Avatar
    Join Date
    Jan 2006
    Location
    Riverton, UT, USA
    Posts
    1,001
    Your inner loop is executing once for each word in the exclude list. That means for each $x, you're comparing $y once for "apple" and once for "ram" and issuing an echo command each time. So when neither one matches, the echo to the finalwordlist gets called twice. Otherwise, it gets called once, and the "word excluded" message is echoed once.

    One way around this is to declare a flag that keeps track of whether the inner loop found an $x == $y condition. Then after the inner loop finishes, check the flag and call the appropriate echo. I'd give you code for that, but apparently my own bash skills aren't up to it.

    Someone else may have a more elegant solution, so I'd welcome any more input you can get.
    Stand up and be counted as a Linux user!

  3. #3
    drl
    drl is offline
    Linux Engineer drl's Avatar
    Join Date
    Apr 2006
    Location
    Saint Paul, MN, USA / CentOS, Debian, Solaris, SuSE
    Posts
    1,117
    Hi.

    As Zelmo said, a flag is useful. If you get a match, set the flag and exit -- no use continuing through the loop any farther. However, you need to loop until either you get a match or you're at the end of the exclude list. Here's a solution based on your script:
    Code:
    #!/bin/sh
    
    # @(#) s0       Demonstrate break.
    
    echo " sh version: $BASH_VERSION" >&2
    
    for x in `cat data1`
    do
      match=false
      for y in `cat data2`
      do
        if [ "$x" = "$y" ]
        then
          echo " ($x excluded)"
          match=true
          break
        fi
      done
      if [ "$match" != true ]
      then
        # echo "$x" >> finalwordlist
        echo "$x"
      fi
    done
    
    exit 0
    which produces:
    Code:
    % ./s0
     sh version: 2.05b.0(1)-release
    train
    car
    boat
     (apple excluded)
    carrot
    drive
     (ram excluded)
    cell
    Also, unless necessary I would write to STDOUT, and let the caller decide to append or not, but that's a personal preference.

    See man bash for details ... cheers, drl
    Welcome - get the most out of the forum by reading forum basics and guidelines: click here.
    90% of questions can be answered by using man pages, Quick Search, Advanced Search, Google search, Wikipedia.
    We look forward to helping you with the challenge of the other 10%.
    ( Mn, 2.6.n, AMD-64 3000+, ASUS A8V Deluxe, 1 GB, SATA + IDE, Matrox G400 AGP )

  4. #4
    drl
    drl is offline
    Linux Engineer drl's Avatar
    Join Date
    Apr 2006
    Location
    Saint Paul, MN, USA / CentOS, Debian, Solaris, SuSE
    Posts
    1,117
    Hi.

    Using utilities will be faster than reading lines in a script, especially for long files. If you can allow the files to be re-ordered, then command comm is useful:
    Code:
    #!/bin/sh
    
    # @(#) s2       Demonstrate use of comm to exclude strings.
    
    set -o nounset
    echo " sh version: $BASH_VERSION" >&2
    
    sort data1 >t1
    sort data2 >t2
    
    nl t1
    
    echo
    nl t2
    
    echo
    comm -23 t1 t2
    
    exit 0
    producing:
    Code:
    % ./s2
     sh version: 2.05b.0(1)-release
         1  apple
         2  boat
         3  car
         4  carrot
         5  cell
         6  drive
         7  ram
         8  train
    
         1  apple
         2  ram
    
    boat
    car
    carrot
    cell
    drive
    train
    cheers, drl
    Welcome - get the most out of the forum by reading forum basics and guidelines: click here.
    90% of questions can be answered by using man pages, Quick Search, Advanced Search, Google search, Wikipedia.
    We look forward to helping you with the challenge of the other 10%.
    ( Mn, 2.6.n, AMD-64 3000+, ASUS A8V Deluxe, 1 GB, SATA + IDE, Matrox G400 AGP )

  5. #5
    drl
    drl is offline
    Linux Engineer drl's Avatar
    Join Date
    Apr 2006
    Location
    Saint Paul, MN, USA / CentOS, Debian, Solaris, SuSE
    Posts
    1,117
    Hi.

    I wrote something similar last week. Here's a version of that adapted to your problem. It has the advantage of not re-ordering your data and is fast, but at the cost of some complexity.

    It uses sed, the stream editor. We use sed twice, once to prepare some editing commands to delete lines that match the words to be excluded, then we feed that back into sed, which then reads the main file, and when it matches an excluded word, it discards it from the output stream (sed does not modify the input data file):
    Code:
    #!/bin/sh
    
    # @(#) s3       Demonstrate creation of sed script to process deletes.
    
    set -o nounset
    echo " sh version: $BASH_VERSION" >&2
    
    nl data1
    
    echo
    nl data2
    
    # Create the sed script file with sed itself, for example:
    # make
    # apple
    # into
    # /apple/d
    
    sed 's|\(.*\)|/\1/d|' data2 >script
    
    # Run the script against the main data file.
    
    echo
    sed -f script data1
    
    exit 0
    producing
    Code:
    % ./s3
     sh version: 2.05b.0(1)-release
         1  train
         2  car
         3  boat
         4  apple
         5  carrot
         6  drive
         7  ram
         8  cell
    
         1  apple
         2  ram
    
    train
    car
    boat
    carrot
    drive
    cell
    cheers, drl
    Welcome - get the most out of the forum by reading forum basics and guidelines: click here.
    90% of questions can be answered by using man pages, Quick Search, Advanced Search, Google search, Wikipedia.
    We look forward to helping you with the challenge of the other 10%.
    ( Mn, 2.6.n, AMD-64 3000+, ASUS A8V Deluxe, 1 GB, SATA + IDE, Matrox G400 AGP )

  6. #6
    Trusted Penguin Cabhan's Avatar
    Join Date
    Jan 2005
    Location
    Seattle, WA, USA
    Posts
    3,230
    It seems to me that the simplest solution would be to use grep. After all, the purpose of grep is to look for a word or regex.
    Code:
    #!/bin/bash
    
    exec 3< wordlist
    
    while read line <&3; do
        if grep -q "$line" excludelist; then
            echo "word excluded" >&2;
        else
            echo "$line" >> finalwordlist
        fi
    done
    Basically, for each line in the wordlist, we check if it is listed in excludelist. If it is, then say "word excluded" on stderr, otherwise print the word to finalwordlist.
    DISTRO=Arch
    Registered Linux User #388732

  7. #7
    Just Joined!
    Join Date
    Mar 2006
    Posts
    65
    Thank you all for the excellent solutions.

  8. #8
    Linux User
    Join Date
    Aug 2006
    Posts
    458
    Code:
    awk 'FNR==NR{ arr[$0] ; next}
         {   
             if ( $0 in arr) { next }
             else { print }
         }
    ' "exclusion_list" "file"

  9. #9
    Linux User
    Join Date
    Jun 2007
    Posts
    318
    Quote Originally Posted by Cabhan View Post
    It seems to me that the simplest solution would be to use grep. After all, the purpose of grep is to look for a word or regex.
    Code:
    #!/bin/bash
    
    exec 3< wordlist
    
    while read line <&3; do
        if grep -q "$line" excludelist; then
            echo "word excluded" >&2;
        else
            echo "$line" >> finalwordlist
        fi
    done
    Basically, for each line in the wordlist, we check if it is listed in excludelist. If it is, then say "word excluded" on stderr, otherwise print the word to finalwordlist.
    I think the RE would be "^$line\$" instead of "$line". Otherwise, line 'aaa' would exclude line 'aaaaaaaa'.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
...