Results 1 to 6 of 6
Hi Guys,
I posted this before, but somehow I guess my post got lost in cyberspace. Im extremely new to AWK. I have a scenario where I need to extract ...
- 08-09-2010 #1Just Joined!
- Join Date
- Aug 2010
- Posts
- 8
How do I parse such a complicated file?
Hi Guys,
I posted this before, but somehow I guess my post got lost in cyberspace.
Im extremely new to AWK. I have a scenario where I need to extract variables and invoice information into an array. Without further adieu, can someone please help point me in the right direction? Here's the input file:
X, Y, Z Company 50681 08/04/10
07/01/10 3065 2,782.50 0.00 2,782.50
07/01/10 3067 984.38 0.00 984.38
07/01/10 3069 1,007.80 0.00 1,007.80
4,774.68 0.00 4,774.68
08/04/10 50681 ******4774.68
FOUR THOUSAND SEVEN HUNDRED SEVENTY-FOUR AND 68/100----
X, Y, Z Company
123 Main St
Toronto ON M8Y 3H8
As you can tell, there's quite a bit of textual information and it's quite difficult to grab a dynamic list of invoices. Any ideas?
- 08-09-2010 #2Just Joined!
- Join Date
- Jul 2010
- Posts
- 53
what are you trying to get from the file - are there multiple invoices in the file and from that you are wanting to get the 'dynamic' list of invoices? are you then wanting to tally up the invoice totals or something?
if there are multiple invoices in the file, is there any form feed (^L) or marker to help distinguish when one invoice finishes and another begins?
for each line you can test NF if the number of fields might help narrow down what type of line you're looking at... also you can match a line using regex so perhaps could match the line that appears to have:
companyname invoice# invoice_date
if you've matched that line - then are all the following lines invoice detail - until you find the pattern that appears to be invoice total? and then the line that repeats
invoice_date invoice# ****** total
with a little more information on what context is available and what you'd want this is reasonably do-able.
if you 'only' wanted the invoice# and amount - then maybe you can select that line if it is the only one that would have "******" - could even grep for '******' (note they are wrapped in ' ' to avoid shell expansion) and feed a simpler set of lines into awk.
guess that's not really a 'quick' reply but hope it helps.
- 08-10-2010 #3Just Joined!
- Join Date
- Apr 2005
- Location
- Clinton Township, MI
- Posts
- 84
There are some things you can do.
You do know, for example, that the X, Y, Z Company has - at most, commas in the first field, but no / - the / in the date is in the fourth field for company if you use whitespace as the separator. However, when you have the detailed invoice information, the first field is the date, the second field (I am guessing), is the invoice number, and so on.
By searching for formats that you know, you can deduce what you have.
You may then want to have two things you are looking for: company info., which has no / in the first field, and invoice information, which does have a / in the date.
Structure your logic on the things you do know. I am assuming that you know how to use numbered fields with Awk, but if not, you really need to do some reading before even attempting to tackle this further. (Note that Perl uses the same techniques to separate fields because some of the syntax in Perl actually comes originally from both Sed and Awk).
- 08-10-2010 #4Just Joined!
- Join Date
- Aug 2010
- Posts
- 8
I need to grab everything from what I posted, the invoice list, represented here (with three entries), is dynamic:
07/01/10 3065 2,782.50 0.00 2,782.50
07/01/10 3067 984.38 0.00 984.38
07/01/10 3069 1,007.80 0.00 1,007.80
So everything that is listed in my previous msg, has to be placed in either a variable or array.
Variables:
X, Y, Z Company
50681
08/04/10
Dynamic Array:
07/01/10 3065 2,782.50 0.00 2,782.50
07/01/10 3067 984.38 0.00 984.38
07/01/10 3069 1,007.80 0.00 1,007.80
More Variables:
4,774.68
0.00
4,774.68
08/04/10
50681
******4774.68
FOUR THOUSAND SEVEN HUNDRED SEVENTY-FOUR AND 68/100----
X, Y, Z Company
123 Main St
Toronto
ON
M8Y 3H8
No, there are no distinguishing EOR (End of Record) markers. As you mentioned, The only consistent thing is the beging of each record which is Company, Inv #, Date. That is what I will use as a EOR marker, (as you so correctly indicated)if there are multiple invoices in the file, is there any form feed (^L) or marker to help distinguish when one invoice finishes and another begins?
Again, if you can really post some AWK code to this, so I can peruse through it and educate myself simultaneously, it would be greatly appreciated.with a little more information on what context is available and what you'd want this is reasonably do-able.
No, I require everything to be dumped into variables and then I will manipulate them as I see fit. (The variable list is indicated above).if you 'only' wanted the invoice# and amount - then maybe you can select that line if it is the only one that would have "******" - could even grep for '******' (note they are wrapped in ' ' to avoid shell expansion) and feed a simpler set of lines into awk.
guess that's not really a 'quick' reply but hope it helps.
I appreciate everyone helping out with this, it seems as though there's a lot of valuable information and bright individuals ready to help. I honestly didn't expect such detail responses or responses this soon!
- 08-10-2010 #5Linux Guru
- Join Date
- Apr 2009
- Location
- I can be found either 40 miles west of Chicago, or in a galaxy far, far away.
- Posts
- 8,974
Your file has a pseudo-structure as follows:
Since major subject in the invoice is separated by one or more blank lines, determining the structure isn't hard, assuming that all the invoices in your file are consistent in that structure. That being the case, you should be able to extract the data without much difficulty. You might be able to script it with something like perl or python, though personally I would probably write a quick-and-dirty C program to do it.Code:<Invoice> <Invoice_Header> <Company_Name>X, Y, Z Company</Company_Name> <Invoice_Number>50681</Invoice_Number> <Invoice_Date>08/04/10</Invoice_Date> </Invoice_Header> <Line_Items> <Line_Item> <Order_Date>07/01/10</Order_Date> <Item_Number>3065</Item_Number> <Item_Price>2782.50</Item_Price> <Item_Discount>0.00</Item_Discount> <Item_Total>2782.50</Item_Total> </Line_Item> . . . </Line_Items> <Sub_Totals> <Sub_Total>4774.68</Sub_Total> <Total_Discount>0.00</Total_Discount> <Final_Sub_Total>4774.68</Final_Sub_Total> </Sub_Totals> <Invoice_Totals> <Invoice_Date>08/04/10</Invoice_Date> <Invoice_Number>80681</Invoice_Number> <Invoice_Total>4774.68</Invoice_Total> </Invoice_Totals> <Invoice_Total_Text>FOUR THOUSAND SEVEN...</Invoice_Total_text> <Customer_Address> X, Y, Z Company 123 Main St Toronto ON M8Y 3H8 </Customer_Address> </Invoice>
FWIW, I used XML in the example above to illustrate how the invoices are logically broken down, as per your original example.Sometimes, real fast is almost as good as real time.
Just remember, Semper Gumbi - always be flexible!
- 08-10-2010 #6Just Joined!
- Join Date
- Jul 2010
- Posts
- 53
here is a quick awk script that will pick apart the beginning of the invoice and the detail lines - according to what you've posted...
if saved as f.awk you can process your file (for example f.dat) as:
awk -f f.awk f.dat
should get you going...Code:/[[:alnum:], ]+[[:space:]]+[[:digit:]]+[[:space:]]+/ { if(split($NF,d,"/")==3) printf("invoice# %s dated %s\n",$(NF-1),$NF); else if(NF==5) printf("detail: %s\n",$0); }


Reply With Quote
