Find the answer to your Linux question:
Results 1 to 4 of 4
hello everyone, does anybody know how to filter databases? i have databases with maize sequences in fasta format but I'm only interested in those that mention endosperm in the id ...
Enjoy an ad free experience by logging in. Not a member yet? Register.
  1. #1
    Just Joined!
    Join Date
    Aug 2006
    Location
    Mexico
    Posts
    17

    Database filtering


    hello everyone,

    does anybody know how to filter databases? i have databases with maize sequences in fasta format but I'm only interested in those that mention endosperm in the id line, the format the databases have is

    >id acc
    sequence

    how could i extract only those that mention endosperm in the id? thank you for your help.

  2. #2
    Linux Guru Cabhan's Avatar
    Join Date
    Jan 2005
    Location
    Seattle, WA, USA
    Posts
    3,252
    FASTA files can be a lot of fun. Fortunately, Linux has a number of utilities for getting what you want out of them.

    This particular problem, well, it depends on how you want to do this. For instance, the following line will get you all IDs that contain 'endosperm', case-insensitive.
    Code:
    egrep -i '^>.*endosperm' FILE
    This line does the following:
    egrep - The grep utility, using extended regular expressions. grep searches a file for lines that match a given pattern
    -i - Case-insensitive match
    ^>.*endosperm - Look for a line beginning with '>', followed by 0 or more of any character, followed by 'endosperm'.

    If you are looking for all sequences that have endosperm in the ID line, I suggest the following Bash script:
    Code:
    #!/bin/bash
    
    word=endosperm
    
    exec 3< "$1"
    state=out
    while read line <&3; do
      if echo "$line" | egrep -iq "^>.*${word}"; then
        state=in
        echo "$line"
        continue
      fi
    
      if [ state == 'out' ]; then
        continue
      fi
    
      if ! echo "$line" | egrep -iq '^>.*${word}'; then
        state=out
      else
        echo "$line"
      fi
    done
    This is untested, but should print out all ID lines and their accompanying sequences that include 'endosperm'.

  3. #3
    Just Joined!
    Join Date
    Aug 2006
    Location
    Mexico
    Posts
    17
    hi Cabhan,

    I tried the two methods you recommended, the first one had no problems, it only gave me the string with the word "endosperm" but didn't include the sequence, so i tried the script you sent me, I just have a few doubts, after changing the attributes i ran it as :

    rodrigo@belladonna:~$ ./endoextract.sh /blastdb/zmexp >> endo.zmexp

    but it didn't do anything, i was waiting for a while but nothing, is there something i'm doing wrong? thanks again for your help.

  4. $spacer_open
    $spacer_close
  5. #4
    Linux Guru Cabhan's Avatar
    Join Date
    Jan 2005
    Location
    Seattle, WA, USA
    Posts
    3,252
    Well, I feel silly. The main problem was that, in my comparison, I was using "state == 'in'", rather than "$state == 'in'". I have also made the script simpler:
    Code:
    #!/bin/bash
    
    word=endosperm
    
    exec 3< "$1"
    state=out
    while read line <&3; do
      if echo "$line" | egrep -iq "^>"; then
        if echo "$line" | egrep -iq "$word"; then
          state=in
        else
          state=out
        fi
      fi
    
      if [ "$state" == 'in' ]; then
        echo "$line"
      fi
    done
    So basically, we start in the "out" state. We then begin reading the file. If the first line begins with a '^', we check if it also mentions the word we're looking for. If so, we enter the "in" state. Otherwise, we remain in the "out" state. Then, if we are in the "in" state, we print out every line, until we re-enter an "out" state.

    Because we don't leave the "in" state until we reach an ID that doesn't contain the word we're looking for, if an ID _does_ have the right word, all following sequence lines will be printed as well.

    Does this make sense?

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •