Find the answer to your Linux question:
Results 1 to 2 of 2
Quickest method to read large ASCII file. I have large ascii files. Each line is of fixed length (80 characters). Each line has a number of pieces of data (~1 ...
Enjoy an ad free experience by logging in. Not a member yet? Register.
  1. #1
    Just Joined!
    Join Date
    May 2006
    Posts
    14

    C/C++ Fast reading large ASCII data file to array or struct


    Quickest method to read large ASCII file.

    I have large ascii files. Each line is of fixed length (80 characters). Each line has a number of pieces of data (~1 of varying type, char, int, double of varying width but the widths are fixed from line to line.

    Example:
    Code:
    R   1  80541.39366165.3 6.8   2  80552.99366160.7 6.8   3  80564.69366156.1 6.81
    R   4  80576.29366151.5 6.8   5  80587.99366146.9 6.8   6  80599.69366142.4 6.81
    R   7  80611.29366137.8 6.9   8  80622.99366133.2 6.9   9  80634.69366128.7 6.91
    S2604I1-151     11   5229 54323.17S 111244.02E  80284.49366012.1 874.7329 846 7
    The first character denotes the type of data to follow on the rest of the line (in this case S and R).

    The method I have working at the moment is to read the entire file into a *char buffer, this is very fast.
    Code:
    ipf = fopen(“myfile”, "r");
    
    //determine the size of the file to allocate memory
    fseek(ipf,0,SEEK_END);
    fileSize = ftell(ipf);
    fseek(ipf,0,SEEK_SET);
    
    //now create array to take the input data from file.
    buffer = (char*) malloc(sizeof(char)*fileSize); 
    
    //read content of file to memory
    newFileSize = fread (buffer,1,fileSize,ipf);
    The next step is to work through each line and put the data into the appropriate arrays, in this case if the first character is an S. Extract the char of appropriate length, convert (using helper function extract...) to the right data type (int, double,char) and increment the buffer pointer, if the line does not begin with S skip to next line in buffer.

    Here is an extract of the code.

    Code:
    while (p<newFileSize){
    
                    if(buffer[0] == 'S') {
    
    
    	Name[i] = extractChar(buffer+=1,1,12);
                  ID1[i] = extractChar(buffer+=15,1,1);
                  ID2[i] = extractChar(buffer+=1,1,1);
                  Fid[i] = extractChar(buffer+=2,1,6);
    
    	Don[i]  = extractInt(buffer+=6,1,2);
                  Mon[i] = extractInt(buffer+=2,1,2);
                  Sos[i] = extractDoub(buffer+=2,1,5);
                  n_s[i] = extractChar(buffer+=5,1,1);
    
    	etc...
    		i++;
    	}else { //skip to next line
    		
    		buffer+=81
    	}
    
    	p=p+81;
    }
    As I said the initial read of the file to the buffer is very quick, the slow part is the iterating over the buffer to extract the data to the individual arrays. Is there a quicker and/or more efficient way to do this?

    As a bench mark I have a 85mb file with about 1.1 million lines and this takes about 15 sec to process. This is not too bad but it would be nice if I could speed this up if possible.

    Thanks,

    Pete.

  2. #2
    Super Moderator Roxoff's Avatar
    Join Date
    Aug 2005
    Location
    Nottingham, England
    Posts
    3,906
    It looks pretty quick as it is. My suggestions would be to try and avoid copying strings in memory (extractChar may do that), and don't try to process doubles (extractDoub) as a double precision floating point - read it as a fixed point number or even as a string and convert it when you need to use it.

    You may even find it easier to not convert any of this file at all, but have a container C++ class into which the file is loaded, internally it marks all the lines, and prepares to read off values as they're processed instead of doing it all the donkey work when you read the file in. That way, if you have any redundant values in the data, you're only processing the ones that you actually want to use. You could also mark off the string values by setting a pointer to the beginning of the string and inserting a static \0 into the data stream at the end of the string - you wouldn't ever copy the string value then, only read it off.

    If you're housing this in a class to contain it, you can set up accessors that read off each datum by line and name, and it could even return pointers to const when you need to access the strings inside it. You may want to consider using a std::vector<unsigned char> to store the data - it's a pretty quick handler to store the data, guarantees its contiguity, and gives you for free some of the quick STL methods for processing it.
    Linux user #126863 - see http://linuxcounter.net/

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •