Find the answer to your Linux question:
Page 1 of 2 1 2 LastLast
Results 1 to 10 of 13
Enjoy an ad free experience by logging in. Not a member yet? Register.
  1. #1

    Linux scripting doubts


    Hi All,

    We have almost 45,000 data files created by a script daily. The file names are of format-ODS.POS.<pharmacyid>.<table name>.<timestamp>.dat. There will be one data file like this for each pharmacy and each table.(Totally around 45,000)

    The requirement is to create a control file for each pharmacy id with file name, No.of rows , table name, timestamp and MD5 value.


    The data files look like this -


    ODS.POS.ABC89.ADT_LOG.07272010_033303.dat
    ODS.POS.AEC12.ADT_LOG.07272010_033303.dat
    ODS.POS.ABC78.ADT_LOG.07272010_033303.dat

    ODS.POS.AEC12.TR_ITM_CPN_TND.07272010_033303.dat
    ODS.POS.AEC13.TR_ITM_CPN_TND.07272010_033303.dat
    ODS.POS.ABC89.TR_ITM_CPN_TND.07272010_033303.dat
    ODS.POS.ABC78.TR_ITM_CPN_TND.07272010_033303.dat


    The requirement is to create a control file for each pharmacyid

    Controlfile 1 -> for pharmacy id - ABC89

    Control file 1 should contain the filename, no.of rows, table name, time stamp and MD5 value for the below files.

    ODS.POS.ABC89.ADT_LOG.07272010_033303.dat
    ODS.POS.ABC89.TR_ITM_CPN_TND.07272010_033303.dat


    Control file 2->for pharmacy id - AEC12

    Control file 2 should contain the filename, no.of rows, table name, time stamp and MD5 value for the below files.

    ODS.POS.AEC12.ADT_LOG.07272010_033303.dat
    ODS.POS.AEC12.TR_ITM_CPN_TND.07272010_033303.dat


    can somebody help on this?

    Thanks
    Maya

  2. #2

  3. #3
    hopefully this should be helpful to meet your objective. This is a python script that will generate a small bash script which actually executes the commands. The parent python script will do all the necessary work of string crunching and will finally call the child bash script to generate the number of rows and md5sum etc and will put it in the Control file.

    #/bin/python
    import os, sys
    prefix="ODS.POS."
    md5SumBinary="md5sum"
    tmpFile="tmp.txt"
    cmdScript="./cmdScript.sh"

    def genCmdScript():
    os.system("echo '#!/bin/bash' > %s"%cmdScript)
    os.system("echo 'echo ======================================== >> $2' >> %s"%cmdScript)
    os.system("echo 'echo file: $1 >> $2' >> %s"%cmdScript)
    os.system("echo 'echo -ne NumRows: >> $2' >> %s"%cmdScript)
    os.system("echo 'cat $1 | wc -l >> $2' >> %s"%cmdScript)
    os.system("echo 'echo Table : $3 >> $2' >> %s"%cmdScript)
    os.system("echo 'echo Time : $4 >> $2' >> %s"%cmdScript)
    os.system("echo 'md5sum $1 >> $2' >> %s"%cmdScript)
    os.system("echo 'echo ---------------------------------------- >> $2' >> %s"%cmdScript)
    os.system ("chmod +x %s"%cmdScript)

    genCmdScript ()

    def genCtrlFile(inputFile , ctrlFile, tableName, timeStamp):
    print ("%s %s %s %s %s"%(cmdScript, inputFile.rstrip("\n") , ctrlFile, tableName, timeStamp))
    os.system ("%s %s %s %s %s"%(cmdScript, inputFile.rstrip("\n") , ctrlFile, tableName, timeStamp))


    os.system("find %s -name %s*.dat > %s"%(sys.argv[1],prefix,tmpFile))

    f = open (tmpFile,"r")
    files=f.readlines()
    f.close()
    #get the list of files in an array
    def dispFileList():
    print
    if (files):
    for i in range (0,len(files)):
    print files[i]
    else:
    sys.exit()

    #make a list pharmacy ids unique.
    pharmacyid = []
    for i in range (0,len(files)):
    pharmacyid.append(files[i].split(prefix,2)[1].split(".",2)[0])
    pharmacyid=list(set(pharmacyid))

    #print pharmacyid
    #dispFileList()

    for phId in range (0,len(pharmacyid)):
    for file in range (0,len(files)):
    if (files[file].find(pharmacyid[phId]) != -1 ):
    # print "id: %s is in file: %s"%(pharmacyid[phId],files[file])
    # print files[file].split(prefix+pharmacyid[phId]+".",2)[1].split(".",2)
    tableName=files[file].split(prefix+pharmacyid[phId]+".",2)[1].split(".",2)[0]
    timeStamp=files[file].split(prefix+pharmacyid[phId]+".",2)[1].split(".",2)[1]
    genCtrlFile( files[file],pharmacyid[phId], tableName, timeStamp)

    just copy the code and make a file say cfgen.py in any convenient directory of your choice. TO invoke execute the command

    python cfgen.py <path to .dat files>
    example: python cfgen.py tmp/

    the control files will be placed in the present working directory.

    NOTE: its not thoroughly tested. Please exercise caution when executing for the first time.

    i see in the preview, all the indentation is lost. For that matter attached is the file name cfgen.py.txt. Just rename it to cfgen.py before executing otherwise executing cfgen.py.txt should also be fine

    hope this helps
    Attached Files Attached Files

  4. $spacer_open
    $spacer_close
  5. #4
    Gosh, if you are using Python, then do everything in Python. There's no need to call system() commands.!!

  6. #5
    Quote Originally Posted by ghostdog74 View Post
    Gosh, if you are using Python, then do everything in Python. There's no need to call system() commands.!!
    Agreed!!!. Python would have been ideal way to do it. But here the objective was just to meet the criterion. I was having difficulty in executing commands like os.system('cat file|wc -l'), so to just save my time i had to go that route.

  7. #6
    Quote Originally Posted by kapsikum View Post
    Agreed!!!. Python would have been ideal way to do it. But here the objective was just to meet the criterion. I was having difficulty in executing commands like os.system('cat file|wc -l'), so to just save my time i had to go that route.
    with Python, if file is reasonably not huge

    Code:
    thelength=len(open("file").readlines())

  8. #7
    Thank you . I have not yet tested this. Will test and get back.

    Thanks again.

    Maya

  9. #8
    Linux Newbie
    Join Date
    Sep 2004
    Location
    UK
    Posts
    161
    Following should work (have not tested) (needs to run in directory where data files reside or change the code to cope running in another location)

    Code:
    #!/bin/bash
    
    destDir="/tmp"
    for i in `ls ODS.POS.* | cut -d "." -f 3 | sort -u`
    do
      controlFile="${destDir}/${i}.ctl"
      echo > ${controlFile}
      for j in `ls ODS.POS.${i}.*`
      do
         rcount=`wc -l ${j}`
         tableName=`echo ${j} | cut -d '.' -f 4`
         timestamp=`echo ${j} | cut -d '.' -f 5`
         md5sum=`md5sum ${j} | cut -d " " -f 1`
         
         echo "${j},${rcount},${tableName},${timestamp},${md5sum}" >> ${controlFile}
      done
    done
    If the timestamp is the actual file timestamp then use stat

    eg.
    timestamp=`stat --format=%Y ${j}`

    If you want to set the timestamp of the ctl file to be same then use touch after the first loop ends

    eg.
    touch -d "@${timestamp}" ${controlFile}
    Last edited by blinky; 07-30-2010 at 12:44 PM.
    In a world without walls and fences, who needs Windows and Gates?

  10. #9
    Quote Originally Posted by blinky View Post
    Following should work (have not tested) (needs to run in directory where data files reside or change the code to cope running in another location)

    Code:
    #!/bin/bash
    
    destDir="/tmp"
    for i in `ls ODS.POS.* | cut -d "." -f 3 | sort -u`
    do
      controlFile="${destDir}/${i}.ctl"
      echo > ${controlFile}
      for j in `ls ODS.POS.${i}.*`
      do
         rcount=`wc -l ${j}`
         tableName=`echo ${j} | cut -d '.' -f 4`
         timestamp=`echo ${j} | cut -d '.' -f 5`
         md5sum=`md5sum ${j} | cut -d " " -f 1`
         
         echo "${j},${rcount},${tableName},${timestamp},${md5sum}" >> ${controlFile}
      done
    done
    If the timestamp is the actual file timestamp then use stat

    eg.
    timestamp=`stat --format=%Y ${j}`

    If you want to set the timestamp of the ctl file to be same then use touch after the first loop ends

    eg.
    touch -d "@${timestamp}" ${controlFile}

    with bash, you can cut down the use of external tools. Also, use shell expansion for globbing files...
    Code:
    destDir="/tmp"
    for i in *ODS.POS*
    do
      OIFS="$IFS"
      IFS="."
      set -- $i
      code=$3
      tablename=$4
      timestamp=$5
      controlFile="${destDir}/${code}.ctl"
      IFS="$OIFS"
      for j in ODS.POS.${code}*
      do
         rcount=$(wc -l "${j}")
         set -- $(md5sum ODS.POS.ABC89.ADT_LOG.07272010_033303.dat)
         md5sum=$1
         echo "${j},${rcount},${tableName},${timestamp},${md5sum}" >> ${controlFile}
      done
    done

  11. #10
    Linux Newbie
    Join Date
    Sep 2004
    Location
    UK
    Posts
    161
    With the mods from ghostdog74

    Kept "ls ODS.POS.* | cut -d "." -f 3 | sort -u" in the first for loop, as it gives us the unique pharmacyid's

    Code:
    #!/bin/bash
    
    destDir="/tmp"
    for pharmacyid in `ls ODS.POS.* | cut -d "." -f 3 | sort -u`
    do
      controlFile="${destDir}/${pharmacyid}.ctl"
      echo  > ${controlFile}
      for dataFile in ODS.POS.${pharmacyid}.*
      do
        
         OIFS="$IFS"
         IFS="."
           set -- ${dataFile}
           tableName="${4}"
           timestamp="${5}"
         IFS="$OIFS"
    
         set -- $(wc -l "${dataFile}")
         rcount="$1"
         set -- $(md5sum "${dataFile}")
         md5sum="$1"
        
         echo "${dataFile},${rcount},${tableName},${timestamp},${md5sum}" > ${controlFile}
      done
    done
    In a world without walls and fences, who needs Windows and Gates?

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •