Find the answer to your Linux question:
Results 1 to 9 of 9
I want to extract all the common entries in the given 3 columns(fields) in the file using awk::can somebody tell me how it can be done in awk...the file is:: ...
  1. #1
    Just Joined!
    Join Date
    May 2010
    Posts
    5

    Post finding common entries in fields using awk

    I want to extract all the common entries in the given 3 columns(fields) in the file using awk::can somebody tell me how it can be done in awk...the file is::

    NC_000858.ptt NC_000858.fna NC_001403.rnt
    NC_001362.ptt NC_001436.fna NC_001407.rnt
    NC_001364.ptt NC_001503.fna NC_001488.rnt
    NC_001402.ptt NC_001819.fna NC_001499.rnt
    NC_001403.ptt NC_001831.fna NC_001502.rnt
    NC_001407.ptt NC_003059.fna NC_001503.rnt
    NC_001408.ptt NC_003323.fna NC_001506.rnt
    NC_001413.ptt NC_004455.fna NC_001702.rnt
    NC_001414.ptt NC_004994.fna NC_001866.rnt
    NC_001436.ptt NC_006934.fna NC_001871.rnt
    NC_001450.ptt NC_009889.fna NC_008094.rnt
    NC_001452.ptt NC_010819.fna
    NC_001463.ptt NC_010955.fna
    NC_001482.ptt NC_011546.fna
    NC_001488.ptt NC_011800.fna
    NC_001494.ptt
    NC_001499.ptt
    NC_001500.ptt
    NC_001501.ptt
    NC_001502.ptt
    NC_001503.ptt
    NC_001506.ptt
    NC_001511.ptt
    NC_001514.ptt
    NC_001549.ptt
    NC_001550.ptt
    NC_001618.ptt
    NC_001702.ptt
    NC_001722.ptt
    NC_001724.ptt
    NC_001802.ptt
    NC_001819.ptt
    NC_001831.ptt
    NC_001866.ptt
    NC_001867.ptt
    NC_001870.ptt
    NC_001871.ptt
    NC_001885.ptt
    NC_001940.ptt
    NC_002201.ptt
    NC_003323.ptt
    NC_004455.ptt
    NC_004994.ptt
    NC_005947.ptt
    NC_006934.ptt
    NC_007015.ptt
    NC_007654.ptt
    NC_007815.ptt
    NC_008094.ptt
    NC_009424.ptt
    NC_009889.ptt
    NC_010819.ptt
    NC_010955.ptt
    NC_011546.ptt
    NC_011800.ptt

  2. #2
    Just Joined!
    Join Date
    Jul 2008
    Posts
    81
    Before you get much help you will have to explain what you mean by "common entries". Maybe point out a couple of examples in the data.

  3. #3
    Just Joined!
    Join Date
    May 2010
    Posts
    5

    meaning of common entries::

    by saying common entries in the fields , i mean to extract entries like:--

    NC_000858.ptt NC_000858.fna NC_001403.rnt
    NC_001362.ptt NC_001436.fna NC_001407.rnt
    NC_001364.ptt NC_001503.fna NC_001488.rnt
    NC_001402.ptt NC_001819.fna NC_001499.rnt
    NC_001403.ptt NC_001831.fna NC_001502.rnt
    NC_001407.ptt NC_003059.fna NC_001503.rnt
    NC_001408.ptt NC_003323.fna NC_001506.rnt
    NC_001413.ptt NC_004455.fna NC_001702.rnt
    NC_001414.ptt NC_004994.fna NC_001866.rnt
    NC_001436.ptt NC_006934.fna NC_001871.rnt
    NC_001450.ptt NC_009889.fna NC_008094.rnt
    NC_001452.ptt NC_010819.fna
    NC_001463.ptt NC_010955.fna
    NC_001482.ptt NC_011546.fna
    NC_001488.ptt NC_011800.fna
    NC_001494.ptt
    NC_001499.ptt
    NC_001500.ptt
    NC_001501.ptt
    NC_001502.ptt
    NC_001503.ptt
    NC_001506.ptt
    NC_001511.ptt
    NC_001514.ptt
    NC_001549.ptt
    NC_001550.ptt
    NC_001618.ptt
    NC_001702.ptt
    NC_001722.ptt
    NC_001724.ptt
    NC_001802.ptt
    NC_001819.ptt
    NC_001831.ptt
    NC_001866.ptt
    NC_001867.ptt
    NC_001870.ptt
    NC_001871.ptt
    NC_001885.ptt
    NC_001940.ptt
    NC_002201.ptt
    NC_003323.ptt
    NC_004455.ptt
    NC_004994.ptt
    NC_005947.ptt
    NC_006934.ptt
    NC_007015.ptt
    NC_007654.ptt
    NC_007815.ptt
    NC_008094.ptt
    NC_009424.ptt
    NC_009889.ptt
    NC_010819.ptt
    NC_010955.ptt
    NC_011546.ptt
    NC_011800.ptt

  4. #4
    Just Joined!
    Join Date
    Jan 2006
    Posts
    3

    Question

    Is this the result you expect ?
    Code:
    awk '{
      for ( i= 1; i<=NF; i++) {
        split( $i, a, "."); 
        printf "%s\t%s\n",  a[1], $0;
      }
    }
    ' yourfile.txt| sort | awk '
    BEGIN{ 
      R=$1; T=$0; n= 0
    } 
    { 
      if ( $1==R){ print T; n++} 
      else { if (n>= 1) print T; n= 0}
      R=$1; T=$0
    }
    END{
      if(n>=1) print T;
    }
    '

  5. #5
    Just Joined!
    Join Date
    May 2010
    Posts
    5

    thanx but the script seems not to be working..

    the script is instead giving some unusual results, whereas i want my result to be like:

    NC_001503.ptt NC_001503.fna NC_001503.rnt

    i.e. all the common entries should be displayed after execution of the script

  6. #6
    Just Joined!
    Join Date
    Jan 2008
    Location
    Iowa
    Posts
    6

    Is awk the appropriate tool for this?

    As I understand it, an entry is considered "common" if the NAME part (i.e., that which precedes the last dot of the entire filename) occurs once in field 0, field 1, and field 2 of a given file. The example suggests that each column/field is associated with a unique file extension, which probably means that you don't need to store the extension part of the name. I will also assume that filenames do not contain spaces. In any case, it seems like a Perl or Python script that uses a dictionary data structure would be a better choice than trying to do this with awk, as follows:

    As you encounter each NAME in field K of the file add 2^K to a dictionary entry for NAME, i.e., dict[NAME]+=2**K. The common names are simply all keys in the dictionary set to 7 (it is slightly more complicated than this if NAME can occur more than once in field 0-2). I have seen awk scripts that simulate dictionary data structures, but it would be easier in a script language that supports that idea.

  7. #7
    Just Joined!
    Join Date
    Jan 2008
    Location
    Iowa
    Posts
    6

    Using dictionaries in awk

    OOPs. I did not realize that arrays in awk ARE dictionaries (i.e., associative arrays). So what you want to do is easy in awk with the following awk script:

    {
    for ( i=1; i<=NF; i++) {
    split( $i, a, ".");
    dict[a[1]]+= 2**(i-1);
    }
    }
    END{
    for (n in dict) {
    if (dict[n]==7)
    printf "%s.ptt\t%s.fna\t%s.rnt\n",n,n,n;
    }
    }

    Thus, if the above code is in a file called common.awk and your data is in a file called datafile.txt, then one of the following should do it for you:

    awk -f common.awk datafile.txt
    --or--
    cat datafile.txt | awk -f common.awk

    I am assuming here that .ptt files are in column 1, .fna files in column 2, and .rnt files are in column 3, and that lines in datafile.txt may be missing column 1 or 2 filenames as long as the tabs are still present. Thus, your example that shows more .ptt files than .fna files and more .rnt files than .rnt files need not always be the case.

  8. #8
    Just Joined!
    Join Date
    Jul 2008
    Posts
    81
    Your problem is still not very well specified. Can there be duplicate entries?
    If the only suffixes are .ptt .fna .rnt why do you need to see them separately in the output?

    Barring other complications, a really simple pipline of fundamental commands can do the job. Put each entry on a line by itself, sort, find items that are unique in the first 9 characters and occur exactly three times.

    < data tr ' ' '\n' | sort | uniq -w9 -c | grep ' 3'
    3 NC_001503.fna

  9. #9
    Just Joined!
    Join Date
    May 2010
    Posts
    5
    No there cant be any duplicate entries as each filename in a column is a unique identifier..So the problem of duplicate entries is not there...Secondly, it is not necessary to see the extensions in the output.

    But,thanx as the problem seems to be getting solved now.Thanx to all...

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
...