Results 1 to 9 of 9
I want to extract all the common entries in the given 3 columns(fields) in the file using awk::can somebody tell me how it can be done in awk...the file is::
...
- 05-31-2010 #1Just Joined!
- Join Date
- May 2010
- Posts
- 5
finding common entries in fields using awk
I want to extract all the common entries in the given 3 columns(fields) in the file using awk::can somebody tell me how it can be done in awk...the file is::
NC_000858.ptt NC_000858.fna NC_001403.rnt
NC_001362.ptt NC_001436.fna NC_001407.rnt
NC_001364.ptt NC_001503.fna NC_001488.rnt
NC_001402.ptt NC_001819.fna NC_001499.rnt
NC_001403.ptt NC_001831.fna NC_001502.rnt
NC_001407.ptt NC_003059.fna NC_001503.rnt
NC_001408.ptt NC_003323.fna NC_001506.rnt
NC_001413.ptt NC_004455.fna NC_001702.rnt
NC_001414.ptt NC_004994.fna NC_001866.rnt
NC_001436.ptt NC_006934.fna NC_001871.rnt
NC_001450.ptt NC_009889.fna NC_008094.rnt
NC_001452.ptt NC_010819.fna
NC_001463.ptt NC_010955.fna
NC_001482.ptt NC_011546.fna
NC_001488.ptt NC_011800.fna
NC_001494.ptt
NC_001499.ptt
NC_001500.ptt
NC_001501.ptt
NC_001502.ptt
NC_001503.ptt
NC_001506.ptt
NC_001511.ptt
NC_001514.ptt
NC_001549.ptt
NC_001550.ptt
NC_001618.ptt
NC_001702.ptt
NC_001722.ptt
NC_001724.ptt
NC_001802.ptt
NC_001819.ptt
NC_001831.ptt
NC_001866.ptt
NC_001867.ptt
NC_001870.ptt
NC_001871.ptt
NC_001885.ptt
NC_001940.ptt
NC_002201.ptt
NC_003323.ptt
NC_004455.ptt
NC_004994.ptt
NC_005947.ptt
NC_006934.ptt
NC_007015.ptt
NC_007654.ptt
NC_007815.ptt
NC_008094.ptt
NC_009424.ptt
NC_009889.ptt
NC_010819.ptt
NC_010955.ptt
NC_011546.ptt
NC_011800.ptt
- 05-31-2010 #2Just Joined!
- Join Date
- Jul 2008
- Posts
- 81
Before you get much help you will have to explain what you mean by "common entries". Maybe point out a couple of examples in the data.
- 06-01-2010 #3Just Joined!
- Join Date
- May 2010
- Posts
- 5
meaning of common entries::
by saying common entries in the fields , i mean to extract entries like:--
NC_000858.ptt NC_000858.fna NC_001403.rnt
NC_001362.ptt NC_001436.fna NC_001407.rnt
NC_001364.ptt NC_001503.fna NC_001488.rnt
NC_001402.ptt NC_001819.fna NC_001499.rnt
NC_001403.ptt NC_001831.fna NC_001502.rnt
NC_001407.ptt NC_003059.fna NC_001503.rnt
NC_001408.ptt NC_003323.fna NC_001506.rnt
NC_001413.ptt NC_004455.fna NC_001702.rnt
NC_001414.ptt NC_004994.fna NC_001866.rnt
NC_001436.ptt NC_006934.fna NC_001871.rnt
NC_001450.ptt NC_009889.fna NC_008094.rnt
NC_001452.ptt NC_010819.fna
NC_001463.ptt NC_010955.fna
NC_001482.ptt NC_011546.fna
NC_001488.ptt NC_011800.fna
NC_001494.ptt
NC_001499.ptt
NC_001500.ptt
NC_001501.ptt
NC_001502.ptt
NC_001503.ptt
NC_001506.ptt
NC_001511.ptt
NC_001514.ptt
NC_001549.ptt
NC_001550.ptt
NC_001618.ptt
NC_001702.ptt
NC_001722.ptt
NC_001724.ptt
NC_001802.ptt
NC_001819.ptt
NC_001831.ptt
NC_001866.ptt
NC_001867.ptt
NC_001870.ptt
NC_001871.ptt
NC_001885.ptt
NC_001940.ptt
NC_002201.ptt
NC_003323.ptt
NC_004455.ptt
NC_004994.ptt
NC_005947.ptt
NC_006934.ptt
NC_007015.ptt
NC_007654.ptt
NC_007815.ptt
NC_008094.ptt
NC_009424.ptt
NC_009889.ptt
NC_010819.ptt
NC_010955.ptt
NC_011546.ptt
NC_011800.ptt
- 06-01-2010 #4Just Joined!
- Join Date
- Jan 2006
- Posts
- 3
Is this the result you expect ?
Code:awk '{ for ( i= 1; i<=NF; i++) { split( $i, a, "."); printf "%s\t%s\n", a[1], $0; } } ' yourfile.txt| sort | awk ' BEGIN{ R=$1; T=$0; n= 0 } { if ( $1==R){ print T; n++} else { if (n>= 1) print T; n= 0} R=$1; T=$0 } END{ if(n>=1) print T; } '
- 06-01-2010 #5Just Joined!
- Join Date
- May 2010
- Posts
- 5
thanx but the script seems not to be working..
the script is instead giving some unusual results, whereas i want my result to be like:
NC_001503.ptt NC_001503.fna NC_001503.rnt
i.e. all the common entries should be displayed after execution of the script
- 06-01-2010 #6Just Joined!
- Join Date
- Jan 2008
- Location
- Iowa
- Posts
- 6
Is awk the appropriate tool for this?
As I understand it, an entry is considered "common" if the NAME part (i.e., that which precedes the last dot of the entire filename) occurs once in field 0, field 1, and field 2 of a given file. The example suggests that each column/field is associated with a unique file extension, which probably means that you don't need to store the extension part of the name. I will also assume that filenames do not contain spaces. In any case, it seems like a Perl or Python script that uses a dictionary data structure would be a better choice than trying to do this with awk, as follows:
As you encounter each NAME in field K of the file add 2^K to a dictionary entry for NAME, i.e., dict[NAME]+=2**K. The common names are simply all keys in the dictionary set to 7 (it is slightly more complicated than this if NAME can occur more than once in field 0-2). I have seen awk scripts that simulate dictionary data structures, but it would be easier in a script language that supports that idea.
- 06-01-2010 #7Just Joined!
- Join Date
- Jan 2008
- Location
- Iowa
- Posts
- 6
Using dictionaries in awk
OOPs. I did not realize that arrays in awk ARE dictionaries (i.e., associative arrays). So what you want to do is easy in awk with the following awk script:
{
for ( i=1; i<=NF; i++) {
split( $i, a, ".");
dict[a[1]]+= 2**(i-1);
}
}
END{
for (n in dict) {
if (dict[n]==7)
printf "%s.ptt\t%s.fna\t%s.rnt\n",n,n,n;
}
}
Thus, if the above code is in a file called common.awk and your data is in a file called datafile.txt, then one of the following should do it for you:
awk -f common.awk datafile.txt
--or--
cat datafile.txt | awk -f common.awk
I am assuming here that .ptt files are in column 1, .fna files in column 2, and .rnt files are in column 3, and that lines in datafile.txt may be missing column 1 or 2 filenames as long as the tabs are still present. Thus, your example that shows more .ptt files than .fna files and more .rnt files than .rnt files need not always be the case.
- 06-01-2010 #8Just Joined!
- Join Date
- Jul 2008
- Posts
- 81
Your problem is still not very well specified. Can there be duplicate entries?
If the only suffixes are .ptt .fna .rnt why do you need to see them separately in the output?
Barring other complications, a really simple pipline of fundamental commands can do the job. Put each entry on a line by itself, sort, find items that are unique in the first 9 characters and occur exactly three times.
< data tr ' ' '\n' | sort | uniq -w9 -c | grep ' 3'
3 NC_001503.fna
- 06-01-2010 #9Just Joined!
- Join Date
- May 2010
- Posts
- 5
No there cant be any duplicate entries as each filename in a column is a unique identifier..So the problem of duplicate entries is not there...Secondly, it is not necessary to see the extensions in the output.
But,thanx as the problem seems to be getting solved now.Thanx to all...


Reply With Quote