Results 1 to 4 of 4
hello everyone,
does anybody know how to filter databases? i have databases with maize sequences in fasta format but I'm only interested in those that mention endosperm in the id ...
Enjoy an ad free experience by logging in. Not a member yet? Register.
- 09-03-2007 #1Just Joined!
- Join Date
- Aug 2006
- Location
- Mexico
- Posts
- 17
Database filtering
hello everyone,
does anybody know how to filter databases? i have databases with maize sequences in fasta format but I'm only interested in those that mention endosperm in the id line, the format the databases have is
>id acc
sequence
how could i extract only those that mention endosperm in the id? thank you for your help.
- 09-04-2007 #2
FASTA files can be a lot of fun. Fortunately, Linux has a number of utilities for getting what you want out of them.
This particular problem, well, it depends on how you want to do this. For instance, the following line will get you all IDs that contain 'endosperm', case-insensitive.
This line does the following:Code:egrep -i '^>.*endosperm' FILE
egrep - The grep utility, using extended regular expressions. grep searches a file for lines that match a given pattern
-i - Case-insensitive match
^>.*endosperm - Look for a line beginning with '>', followed by 0 or more of any character, followed by 'endosperm'.
If you are looking for all sequences that have endosperm in the ID line, I suggest the following Bash script:
This is untested, but should print out all ID lines and their accompanying sequences that include 'endosperm'.Code:#!/bin/bash word=endosperm exec 3< "$1" state=out while read line <&3; do if echo "$line" | egrep -iq "^>.*${word}"; then state=in echo "$line" continue fi if [ state == 'out' ]; then continue fi if ! echo "$line" | egrep -iq '^>.*${word}'; then state=out else echo "$line" fi done
- 09-04-2007 #3Just Joined!
- Join Date
- Aug 2006
- Location
- Mexico
- Posts
- 17
hi Cabhan,
I tried the two methods you recommended, the first one had no problems, it only gave me the string with the word "endosperm" but didn't include the sequence, so i tried the script you sent me, I just have a few doubts, after changing the attributes i ran it as :
rodrigo@belladonna:~$ ./endoextract.sh /blastdb/zmexp >> endo.zmexp
but it didn't do anything, i was waiting for a while but nothing, is there something i'm doing wrong? thanks again for your help.
- 09-04-2007 #4
Well, I feel silly. The main problem was that, in my comparison, I was using "state == 'in'", rather than "$state == 'in'". I have also made the script simpler:
So basically, we start in the "out" state. We then begin reading the file. If the first line begins with a '^', we check if it also mentions the word we're looking for. If so, we enter the "in" state. Otherwise, we remain in the "out" state. Then, if we are in the "in" state, we print out every line, until we re-enter an "out" state.Code:#!/bin/bash word=endosperm exec 3< "$1" state=out while read line <&3; do if echo "$line" | egrep -iq "^>"; then if echo "$line" | egrep -iq "$word"; then state=in else state=out fi fi if [ "$state" == 'in' ]; then echo "$line" fi done
Because we don't leave the "in" state until we reach an ID that doesn't contain the word we're looking for, if an ID _does_ have the right word, all following sequence lines will be printed as well.
Does this make sense?


Reply With Quote
