Results 1 to 5 of 5
Hello All,
I am trying to divide a file(data) using a rank file in different files.
For example.
data file contains
>seq A
aagagagagtctctgtagatgatcgctgctcgctgct
agctcgatcgtacgcatactacgactgacgactacgcat
gctacgcatacgactcagcatgatcatgacgtactacg
>seq B
aagagagagtctctgtagatgatcgctgctcgctgct
agctcgatcgtacgcatactacgactgacgactacgcat
...
- 07-12-2007 #1Just Joined!
- Join Date
- May 2007
- Posts
- 8
file division using C
Hello All,
I am trying to divide a file(data) using a rank file in different files.
For example.
data file contains
>seq A
aagagagagtctctgtagatgatcgctgctcgctgct
agctcgatcgtacgcatactacgactgacgactacgcat
gctacgcatacgactcagcatgatcatgacgtactacg
>seq B
aagagagagtctctgtagatgatcgctgctcgctgct
agctcgatcgtacgcatactacgactgacgactacgcat
gctacgcatacgactcagcatgatcatgacgtactacg
>seq C
aagagagagtctctgtagatgatcgctgctcgctgct
agctcgatcgtacgcatactacgactgacgactacgcat
gctacgcatacgactcagcatgatcatgacgtactacg
and rank file contains
seqA :0.5
seqB :1.5
seqc :2.5
Now i want is that the sequences in the data file are divided into files using these ranks...
i.e.
if(0<rank of seq <1)
put seq in file 1 (in this case the whole seq A would go in file1)
if(1<rank of seq<2)
put seq in file 2 (whole of seq B would go in file 2 from the above example)
if(2<rank of seq <2)
put seqA in file 3
and so on and so forth...
Now I wrote a bash script to do all this and it works fine...but the problem is that it takes a lot time using bash....and I want the program to be in C..
any help in this regards by the C gurus would be highly apprciable as i am not good at C...
Thanks in advance
- 07-12-2007 #2Just Joined!
- Join Date
- May 2007
- Posts
- 8
Any one please help!!!
- 07-12-2007 #3
Well, this seems not too tough.
Are we assured that the order of sequences is the same in both files? This matters for efficiency. If we are assured of this, we can simply traverse both simultaneously. Otherwise we need to start from the beginning for each sequence.
Now then:
I propose starting with the data file. The main reason for this is that if we are not guaranteed the same order, it is much faster to start from the beginning of the rank file.
So we read the data file line-by-line. This is a FASTA file, and therefore the headers all start with '>'. This is very easy to check for.
We have found the header, and therefore we know the ID. We now read through the rank file until we find that ID and its accompanying rank. We can use strtod() (check the man page) to convert the rank into a double, and use that double in a comparison. Alternatively, a better bet may be to simply round the double up (using the ceil() function in math.h), and this is the number of the file. By using ceil() instead of a comparison, we can do this for any rank, rather than by writing a ton of if statements.
Do note that if ranks can be a whole number, you will want to decide if a file of rank 2 goes in file 2 or file 3 and check for this appropriately.
This is not particularly difficult to write, but if you don't know much C, you may have trouble. Give it a try, and if you have difficulty, please let us know.DISTRO=Arch
Registered Linux User #388732
- 07-13-2007 #4Just Joined!
- Join Date
- May 2007
- Posts
- 8
Thankyou...
If someone could help me in writing these two portions of the code...it would be an enormous help...
how to read the seqeucnes from ">nameofseq" to the end...
how to read the rank of the sequence....i mean how do i read the number after "nameofseq:number"
Thanksalot for all the help...It is really appreciable...
- 07-13-2007 #5
Alrighty. So you want to read the entire sequence. Fairly simple:
Does that all make sense?Code:FILE *data = fopen("data_file", "r"); char *line = calloc(81, sizeof(char)); /* I am assuming a line size of 80, feel free to make this number larger or smaller as needed */ char *id = calloc(100, sizeof(char)); while(fgets(line, 81, data) != NULL) { if(line[0] == '>') /* line starts with '>', so this is the ID */ id = line[1]; else /* line does not start with '>', meaning that this line is a part of the actual sequence */ { /* do something with the line */ } } free(line); free(id);
As far as reading the number:
This one is a bit tougher. s traverses line until it reaches a ':'. When it does, it replaces the colon with a NUL byte, which makes line into just the name of the sequence (as a C string ends when it reaches the first NUL byte). s then bumps itself to point just after the NUL byte, and we convert that part of the string into a double. That's the rank.Code:FILE *rank = fopen("rank_file", "r"); char *line = calloc(81, sizeof(char)); double ranknum; while(fgets(line, 81, rank) != NULL) { char *s = line; while(*s != ':') s++; *s++ = '\0'; printf("Name of Sequence: %s\n", line); ranknum = strtod(s, NULL); printf("Rank: %f", ranknum); } free(line);
Note that this assumes that there are no spaces around the ':'. If there are, you simply need to bump s forward a bit more until it points just after the spaces.
Let me know if you have any questions.DISTRO=Arch
Registered Linux User #388732


Reply With Quote