Find the answer to your Linux question:
Results 1 to 5 of 5
Hello All, I am trying to divide a file(data) using a rank file in different files. For example. data file contains >seq A aagagagagtctctgtagatgatcgctgctcgctgct agctcgatcgtacgcatactacgactgacgactacgcat gctacgcatacgactcagcatgatcatgacgtactacg >seq B aagagagagtctctgtagatgatcgctgctcgctgct agctcgatcgtacgcatactacgactgacgactacgcat ...
  1. #1
    Just Joined!
    Join Date
    May 2007
    Posts
    8

    file division using C

    Hello All,

    I am trying to divide a file(data) using a rank file in different files.

    For example.
    data file contains

    >seq A
    aagagagagtctctgtagatgatcgctgctcgctgct
    agctcgatcgtacgcatactacgactgacgactacgcat
    gctacgcatacgactcagcatgatcatgacgtactacg
    >seq B
    aagagagagtctctgtagatgatcgctgctcgctgct
    agctcgatcgtacgcatactacgactgacgactacgcat
    gctacgcatacgactcagcatgatcatgacgtactacg
    >seq C
    aagagagagtctctgtagatgatcgctgctcgctgct
    agctcgatcgtacgcatactacgactgacgactacgcat
    gctacgcatacgactcagcatgatcatgacgtactacg


    and rank file contains

    seqA :0.5
    seqB :1.5
    seqc :2.5


    Now i want is that the sequences in the data file are divided into files using these ranks...

    i.e.

    if(0<rank of seq <1)
    put seq in file 1 (in this case the whole seq A would go in file1)

    if(1<rank of seq<2)
    put seq in file 2 (whole of seq B would go in file 2 from the above example)

    if(2<rank of seq <2)
    put seqA in file 3


    and so on and so forth...


    Now I wrote a bash script to do all this and it works fine...but the problem is that it takes a lot time using bash....and I want the program to be in C..

    any help in this regards by the C gurus would be highly apprciable as i am not good at C...


    Thanks in advance

  2. #2
    Just Joined!
    Join Date
    May 2007
    Posts
    8
    Any one please help!!!

  3. #3
    Trusted Penguin Cabhan's Avatar
    Join Date
    Jan 2005
    Location
    Seattle, WA, USA
    Posts
    3,230
    Well, this seems not too tough.

    Are we assured that the order of sequences is the same in both files? This matters for efficiency. If we are assured of this, we can simply traverse both simultaneously. Otherwise we need to start from the beginning for each sequence.

    Now then:

    I propose starting with the data file. The main reason for this is that if we are not guaranteed the same order, it is much faster to start from the beginning of the rank file.

    So we read the data file line-by-line. This is a FASTA file, and therefore the headers all start with '>'. This is very easy to check for.

    We have found the header, and therefore we know the ID. We now read through the rank file until we find that ID and its accompanying rank. We can use strtod() (check the man page) to convert the rank into a double, and use that double in a comparison. Alternatively, a better bet may be to simply round the double up (using the ceil() function in math.h), and this is the number of the file. By using ceil() instead of a comparison, we can do this for any rank, rather than by writing a ton of if statements.

    Do note that if ranks can be a whole number, you will want to decide if a file of rank 2 goes in file 2 or file 3 and check for this appropriately.

    This is not particularly difficult to write, but if you don't know much C, you may have trouble. Give it a try, and if you have difficulty, please let us know.
    DISTRO=Arch
    Registered Linux User #388732

  4. #4
    Just Joined!
    Join Date
    May 2007
    Posts
    8
    Thankyou...
    If someone could help me in writing these two portions of the code...it would be an enormous help...

    how to read the seqeucnes from ">nameofseq" to the end...

    how to read the rank of the sequence....i mean how do i read the number after "nameofseq:number"


    Thanksalot for all the help...It is really appreciable...

  5. #5
    Trusted Penguin Cabhan's Avatar
    Join Date
    Jan 2005
    Location
    Seattle, WA, USA
    Posts
    3,230
    Alrighty. So you want to read the entire sequence. Fairly simple:
    Code:
    FILE *data = fopen("data_file", "r");
    char *line = calloc(81, sizeof(char)); /* I am assuming a line size of 80, feel free to make this number larger or smaller as needed */
    char *id = calloc(100, sizeof(char));
    
    while(fgets(line, 81, data) != NULL)
    {
        if(line[0] == '>') /* line starts with '>', so this is the ID */
            id = line[1];
        else /* line does not start with '>', meaning that this line is a part of the actual sequence */
        {
            /* do something with the line */
        }
    }
    
    free(line);
    free(id);
    Does that all make sense?

    As far as reading the number:
    Code:
    FILE *rank = fopen("rank_file", "r");
    char *line = calloc(81, sizeof(char));
    double ranknum;
    
    while(fgets(line, 81, rank) != NULL)
    {
        char *s = line;
        while(*s != ':') s++;
        *s++ = '\0';
        printf("Name of Sequence: %s\n", line);
        ranknum = strtod(s, NULL);
        printf("Rank: %f", ranknum);
    }
    
    free(line);
    This one is a bit tougher. s traverses line until it reaches a ':'. When it does, it replaces the colon with a NUL byte, which makes line into just the name of the sequence (as a C string ends when it reaches the first NUL byte). s then bumps itself to point just after the NUL byte, and we convert that part of the string into a double. That's the rank.

    Note that this assumes that there are no spaces around the ':'. If there are, you simply need to bump s forward a bit more until it points just after the spaces.

    Let me know if you have any questions.
    DISTRO=Arch
    Registered Linux User #388732

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
...