Find the answer to your Linux question:
Results 1 to 6 of 6
Hi Guys, I posted this before, but somehow I guess my post got lost in cyberspace. Im extremely new to AWK. I have a scenario where I need to extract ...
  1. #1
    Just Joined!
    Join Date
    Aug 2010
    Posts
    8

    How do I parse such a complicated file?

    Hi Guys,

    I posted this before, but somehow I guess my post got lost in cyberspace. Im extremely new to AWK. I have a scenario where I need to extract variables and invoice information into an array. Without further adieu, can someone please help point me in the right direction? Here's the input file:

    X, Y, Z Company 50681 08/04/10

    07/01/10 3065 2,782.50 0.00 2,782.50
    07/01/10 3067 984.38 0.00 984.38
    07/01/10 3069 1,007.80 0.00 1,007.80

    4,774.68 0.00 4,774.68


    08/04/10 50681 ******4774.68

    FOUR THOUSAND SEVEN HUNDRED SEVENTY-FOUR AND 68/100----

    X, Y, Z Company
    123 Main St
    Toronto ON M8Y 3H8




    As you can tell, there's quite a bit of textual information and it's quite difficult to grab a dynamic list of invoices. Any ideas?

  2. #2
    Just Joined!
    Join Date
    Jul 2010
    Posts
    53
    what are you trying to get from the file - are there multiple invoices in the file and from that you are wanting to get the 'dynamic' list of invoices? are you then wanting to tally up the invoice totals or something?

    if there are multiple invoices in the file, is there any form feed (^L) or marker to help distinguish when one invoice finishes and another begins?

    for each line you can test NF if the number of fields might help narrow down what type of line you're looking at... also you can match a line using regex so perhaps could match the line that appears to have:

    companyname invoice# invoice_date

    if you've matched that line - then are all the following lines invoice detail - until you find the pattern that appears to be invoice total? and then the line that repeats

    invoice_date invoice# ****** total

    with a little more information on what context is available and what you'd want this is reasonably do-able.

    if you 'only' wanted the invoice# and amount - then maybe you can select that line if it is the only one that would have "******" - could even grep for '******' (note they are wrapped in ' ' to avoid shell expansion) and feed a simpler set of lines into awk.

    guess that's not really a 'quick' reply but hope it helps.

  3. #3
    Just Joined!
    Join Date
    Apr 2005
    Location
    Clinton Township, MI
    Posts
    84

    There are some things you can do.

    You do know, for example, that the X, Y, Z Company has - at most, commas in the first field, but no / - the / in the date is in the fourth field for company if you use whitespace as the separator. However, when you have the detailed invoice information, the first field is the date, the second field (I am guessing), is the invoice number, and so on.

    By searching for formats that you know, you can deduce what you have.

    You may then want to have two things you are looking for: company info., which has no / in the first field, and invoice information, which does have a / in the date.

    Structure your logic on the things you do know. I am assuming that you know how to use numbered fields with Awk, but if not, you really need to do some reading before even attempting to tackle this further. (Note that Perl uses the same techniques to separate fields because some of the syntax in Perl actually comes originally from both Sed and Awk).

  4. #4
    Just Joined!
    Join Date
    Aug 2010
    Posts
    8
    Quote Originally Posted by chaosless View Post
    what are you trying to get from the file - are there multiple invoices in the file and from that you are wanting to get the 'dynamic' list of invoices? are you then wanting to tally up the invoice totals or something?
    I need to grab everything from what I posted, the invoice list, represented here (with three entries), is dynamic:

    07/01/10 3065 2,782.50 0.00 2,782.50
    07/01/10 3067 984.38 0.00 984.38
    07/01/10 3069 1,007.80 0.00 1,007.80


    So everything that is listed in my previous msg, has to be placed in either a variable or array.


    Variables:
    X, Y, Z Company
    50681
    08/04/10

    Dynamic Array:
    07/01/10 3065 2,782.50 0.00 2,782.50
    07/01/10 3067 984.38 0.00 984.38
    07/01/10 3069 1,007.80 0.00 1,007.80

    More Variables:
    4,774.68
    0.00
    4,774.68
    08/04/10
    50681
    ******4774.68
    FOUR THOUSAND SEVEN HUNDRED SEVENTY-FOUR AND 68/100----
    X, Y, Z Company
    123 Main St
    Toronto
    ON
    M8Y 3H8


    if there are multiple invoices in the file, is there any form feed (^L) or marker to help distinguish when one invoice finishes and another begins?
    No, there are no distinguishing EOR (End of Record) markers. As you mentioned, The only consistent thing is the beging of each record which is Company, Inv #, Date. That is what I will use as a EOR marker, (as you so correctly indicated)

    with a little more information on what context is available and what you'd want this is reasonably do-able.
    Again, if you can really post some AWK code to this, so I can peruse through it and educate myself simultaneously, it would be greatly appreciated.

    if you 'only' wanted the invoice# and amount - then maybe you can select that line if it is the only one that would have "******" - could even grep for '******' (note they are wrapped in ' ' to avoid shell expansion) and feed a simpler set of lines into awk.

    guess that's not really a 'quick' reply but hope it helps.
    No, I require everything to be dumped into variables and then I will manipulate them as I see fit. (The variable list is indicated above).

    I appreciate everyone helping out with this, it seems as though there's a lot of valuable information and bright individuals ready to help. I honestly didn't expect such detail responses or responses this soon!

  5. #5
    Linux Guru Rubberman's Avatar
    Join Date
    Apr 2009
    Location
    I can be found either 40 miles west of Chicago, or in a galaxy far, far away.
    Posts
    8,974
    Your file has a pseudo-structure as follows:
    Code:
    <Invoice>
        <Invoice_Header>
            <Company_Name>X, Y, Z Company</Company_Name>
            <Invoice_Number>50681</Invoice_Number>
            <Invoice_Date>08/04/10</Invoice_Date>
        </Invoice_Header>
        <Line_Items>
            <Line_Item>
                <Order_Date>07/01/10</Order_Date>
                <Item_Number>3065</Item_Number>
                <Item_Price>2782.50</Item_Price>
                <Item_Discount>0.00</Item_Discount>
                <Item_Total>2782.50</Item_Total>
            </Line_Item>
            .
            .
            .
        </Line_Items>
        <Sub_Totals>
            <Sub_Total>4774.68</Sub_Total>
            <Total_Discount>0.00</Total_Discount>
            <Final_Sub_Total>4774.68</Final_Sub_Total>
        </Sub_Totals>
        <Invoice_Totals>
            <Invoice_Date>08/04/10</Invoice_Date>
            <Invoice_Number>80681</Invoice_Number>
            <Invoice_Total>4774.68</Invoice_Total>
        </Invoice_Totals>
        <Invoice_Total_Text>FOUR THOUSAND SEVEN...</Invoice_Total_text>
        <Customer_Address>
            X, Y, Z Company
            123 Main St
            Toronto ON M8Y 3H8
        </Customer_Address>
    </Invoice>
    Since major subject in the invoice is separated by one or more blank lines, determining the structure isn't hard, assuming that all the invoices in your file are consistent in that structure. That being the case, you should be able to extract the data without much difficulty. You might be able to script it with something like perl or python, though personally I would probably write a quick-and-dirty C program to do it.

    FWIW, I used XML in the example above to illustrate how the invoices are logically broken down, as per your original example.
    Sometimes, real fast is almost as good as real time.
    Just remember, Semper Gumbi - always be flexible!

  6. #6
    Just Joined!
    Join Date
    Jul 2010
    Posts
    53
    here is a quick awk script that will pick apart the beginning of the invoice and the detail lines - according to what you've posted...

    if saved as f.awk you can process your file (for example f.dat) as:

    awk -f f.awk f.dat

    Code:
    /[[:alnum:], ]+[[:space:]]+[[:digit:]]+[[:space:]]+/ {
      if(split($NF,d,"/")==3)
        printf("invoice# %s dated %s\n",$(NF-1),$NF);
      else if(NF==5)
        printf("detail: %s\n",$0);
    }
    should get you going...

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
...