Results 1 to 10 of 13
Hi All,
We have almost 45,000 data files created by a script daily. The file names are of format-ODS.POS.<pharmacyid>.<table name>.<timestamp>.dat. There will be one data file like this for each ...
- 07-27-2010 #1Just Joined!
- Join Date
- Jul 2010
- Posts
- 16
Linux scripting doubts
Hi All,
We have almost 45,000 data files created by a script daily. The file names are of format-ODS.POS.<pharmacyid>.<table name>.<timestamp>.dat. There will be one data file like this for each pharmacy and each table.(Totally around 45,000)
The requirement is to create a control file for each pharmacy id with file name, No.of rows , table name, timestamp and MD5 value.
The data files look like this -
ODS.POS.ABC89.ADT_LOG.07272010_033303.dat
ODS.POS.AEC12.ADT_LOG.07272010_033303.dat
ODS.POS.ABC78.ADT_LOG.07272010_033303.dat
ODS.POS.AEC12.TR_ITM_CPN_TND.07272010_033303.dat
ODS.POS.AEC13.TR_ITM_CPN_TND.07272010_033303.dat
ODS.POS.ABC89.TR_ITM_CPN_TND.07272010_033303.dat
ODS.POS.ABC78.TR_ITM_CPN_TND.07272010_033303.dat
The requirement is to create a control file for each pharmacyid
Controlfile 1 -> for pharmacy id - ABC89
Control file 1 should contain the filename, no.of rows, table name, time stamp and MD5 value for the below files.
ODS.POS.ABC89.ADT_LOG.07272010_033303.dat
ODS.POS.ABC89.TR_ITM_CPN_TND.07272010_033303.dat
Control file 2->for pharmacy id - AEC12
Control file 2 should contain the filename, no.of rows, table name, time stamp and MD5 value for the below files.
ODS.POS.AEC12.ADT_LOG.07272010_033303.dat
ODS.POS.AEC12.TR_ITM_CPN_TND.07272010_033303.dat
can somebody help on this?
Thanks
Maya
- 07-27-2010 #2Linux User
- Join Date
- Aug 2006
- Posts
- 458
Code:awk -F"." 'NF{print $0 > $3".dat}' file
- 07-28-2010 #3Just Joined!
- Join Date
- Feb 2009
- Posts
- 22
hopefully this should be helpful to meet your objective. This is a python script that will generate a small bash script which actually executes the commands. The parent python script will do all the necessary work of string crunching and will finally call the child bash script to generate the number of rows and md5sum etc and will put it in the Control file.
#/bin/python
import os, sys
prefix="ODS.POS."
md5SumBinary="md5sum"
tmpFile="tmp.txt"
cmdScript="./cmdScript.sh"
def genCmdScript():
os.system("echo '#!/bin/bash' > %s"%cmdScript)
os.system("echo 'echo ======================================== >> $2' >> %s"%cmdScript)
os.system("echo 'echo file: $1 >> $2' >> %s"%cmdScript)
os.system("echo 'echo -ne NumRows: >> $2' >> %s"%cmdScript)
os.system("echo 'cat $1 | wc -l >> $2' >> %s"%cmdScript)
os.system("echo 'echo Table : $3 >> $2' >> %s"%cmdScript)
os.system("echo 'echo Time : $4 >> $2' >> %s"%cmdScript)
os.system("echo 'md5sum $1 >> $2' >> %s"%cmdScript)
os.system("echo 'echo ---------------------------------------- >> $2' >> %s"%cmdScript)
os.system ("chmod +x %s"%cmdScript)
genCmdScript ()
def genCtrlFile(inputFile , ctrlFile, tableName, timeStamp):
print ("%s %s %s %s %s"%(cmdScript, inputFile.rstrip("\n") , ctrlFile, tableName, timeStamp))
os.system ("%s %s %s %s %s"%(cmdScript, inputFile.rstrip("\n") , ctrlFile, tableName, timeStamp))
os.system("find %s -name %s*.dat > %s"%(sys.argv[1],prefix,tmpFile))
f = open (tmpFile,"r")
files=f.readlines()
f.close()
#get the list of files in an array
def dispFileList():
print
if (files):
for i in range (0,len(files)):
print files[i]
else:
sys.exit()
#make a list pharmacy ids unique.
pharmacyid = []
for i in range (0,len(files)):
pharmacyid.append(files[i].split(prefix,2)[1].split(".",2)[0])
pharmacyid=list(set(pharmacyid))
#print pharmacyid
#dispFileList()
for phId in range (0,len(pharmacyid)):
for file in range (0,len(files)):
if (files[file].find(pharmacyid[phId]) != -1 ):
# print "id: %s is in file: %s"%(pharmacyid[phId],files[file])
# print files[file].split(prefix+pharmacyid[phId]+".",2)[1].split(".",2)
tableName=files[file].split(prefix+pharmacyid[phId]+".",2)[1].split(".",2)[0]
timeStamp=files[file].split(prefix+pharmacyid[phId]+".",2)[1].split(".",2)[1]
genCtrlFile( files[file],pharmacyid[phId], tableName, timeStamp)
just copy the code and make a file say cfgen.py in any convenient directory of your choice. TO invoke execute the command
example: python cfgen.py tmp/python cfgen.py <path to .dat files>
the control files will be placed in the present working directory.
NOTE: its not thoroughly tested. Please exercise caution when executing for the first time.
i see in the preview, all the indentation is lost. For that matter attached is the file name cfgen.py.txt. Just rename it to cfgen.py before executing otherwise executing cfgen.py.txt should also be fine
hope this helps
- 07-28-2010 #4Linux User
- Join Date
- Aug 2006
- Posts
- 458
Gosh, if you are using Python, then do everything in Python. There's no need to call system() commands.!!
- 07-29-2010 #5Just Joined!
- Join Date
- Feb 2009
- Posts
- 22
- 07-29-2010 #6Linux User
- Join Date
- Aug 2006
- Posts
- 458
- 07-30-2010 #7Just Joined!
- Join Date
- Jul 2010
- Posts
- 16
Thank you . I have not yet tested this. Will test and get back.
Thanks again.
Maya
- 07-30-2010 #8Linux Newbie
- Join Date
- Sep 2004
- Location
- UK
- Posts
- 160
Following should work (have not tested) (needs to run in directory where data files reside or change the code to cope running in another location)
If the timestamp is the actual file timestamp then use statCode:#!/bin/bash destDir="/tmp" for i in `ls ODS.POS.* | cut -d "." -f 3 | sort -u` do controlFile="${destDir}/${i}.ctl" echo > ${controlFile} for j in `ls ODS.POS.${i}.*` do rcount=`wc -l ${j}` tableName=`echo ${j} | cut -d '.' -f 4` timestamp=`echo ${j} | cut -d '.' -f 5` md5sum=`md5sum ${j} | cut -d " " -f 1` echo "${j},${rcount},${tableName},${timestamp},${md5sum}" >> ${controlFile} done done
eg.
timestamp=`stat --format=%Y ${j}`
If you want to set the timestamp of the ctl file to be same then use touch after the first loop ends
eg.
touch -d "@${timestamp}" ${controlFile}Last edited by blinky; 07-30-2010 at 12:44 PM.
In a world without walls and fences, who needs Windows and Gates?
- 08-01-2010 #9Linux User
- Join Date
- Aug 2006
- Posts
- 458
with bash, you can cut down the use of external tools. Also, use shell expansion for globbing files...
Code:destDir="/tmp" for i in *ODS.POS* do OIFS="$IFS" IFS="." set -- $i code=$3 tablename=$4 timestamp=$5 controlFile="${destDir}/${code}.ctl" IFS="$OIFS" for j in ODS.POS.${code}* do rcount=$(wc -l "${j}") set -- $(md5sum ODS.POS.ABC89.ADT_LOG.07272010_033303.dat) md5sum=$1 echo "${j},${rcount},${tableName},${timestamp},${md5sum}" >> ${controlFile} done done
- 08-01-2010 #10Linux Newbie
- Join Date
- Sep 2004
- Location
- UK
- Posts
- 160
With the mods from ghostdog74
Kept "ls ODS.POS.* | cut -d "." -f 3 | sort -u" in the first for loop, as it gives us the unique pharmacyid's
Code:#!/bin/bash destDir="/tmp" for pharmacyid in `ls ODS.POS.* | cut -d "." -f 3 | sort -u` do controlFile="${destDir}/${pharmacyid}.ctl" echo > ${controlFile} for dataFile in ODS.POS.${pharmacyid}.* do OIFS="$IFS" IFS="." set -- ${dataFile} tableName="${4}" timestamp="${5}" IFS="$OIFS" set -- $(wc -l "${dataFile}") rcount="$1" set -- $(md5sum "${dataFile}") md5sum="$1" echo "${dataFile},${rcount},${tableName},${timestamp},${md5sum}" > ${controlFile} done doneIn a world without walls and fences, who needs Windows and Gates?


Reply With Quote
