Find the answer to your Linux question:
Results 1 to 9 of 9
Hi, I have a script that performs a 5 period moving average on a data csv file with over 2 mil lines of data. The script works fine, but I ...
Enjoy an ad free experience by logging in. Not a member yet? Register.
  1. #1
    Just Joined!
    Join Date
    Jun 2011
    Posts
    17

    bottle-necking?


    Hi, I have a script that performs a 5 period moving average on a data csv file with over 2 mil lines of data. The script works fine, but I need it to go much faster. A friend of mine called my problem bottle-necking. How can I fix this?

  2. #2
    Trusted Penguin Irithori's Avatar
    Join Date
    May 2009
    Location
    Munich
    Posts
    3,440
    With such a number of datasets, I would seriously consider a database.
    Scratch that, I would insist on one.

    Not only will you gain structure into the data, but also a way of performing arbitrary queries.
    Plus, dbs are established, can be backuped, etc..


    If you still want to work with csv:
    It depends on what that script is doing and what kind of load it creates.
    Is it IO, network and/or cpu bound?
    Last edited by Irithori; 12-08-2011 at 04:58 PM.
    You must always face the curtain with a bow.

  3. #3
    Just Joined!
    Join Date
    Jun 2011
    Posts
    17
    here is the script itself..

    PHP Code:
    z=1
    wc
    =`cat ~/Forex/USDJPY1.csv | wc -l`
    while [ 
    $z -le $wc ]
    do
    x=$z
    y
    =$(($x 4))
    comp=0
    while [ $x -le $y ]
    do
    close=`cat ~/Forex/USDJPY1.csv | head -$x | tail -1 | cut -d "," -f 6 | sed -e "s/\.//g"`
    comp=`echo "$comp + $close" | bc`
    diff=`echo "$x - $y" | bc`
    if [ 
    $diff -eq 0 ]
    then
    echo "$comp / 5" bc
    fi
    x
    =$(($x 1))
    done
    z
    =$(($z 1))
    done 

  4. $spacer_open
    $spacer_close
  5. #4
    Just Joined!
    Join Date
    Jun 2011
    Posts
    17
    bumpage le bumpster

  6. #5
    Trusted Penguin Irithori's Avatar
    Join Date
    May 2009
    Location
    Munich
    Posts
    3,440
    Patience is a virtue

    Dont take it the wrong way, but your script is a trainwreck.

    Apart from multiple style violations, hardcodings and needless call of cat (to name just a few),
    it has a serious flaw: For the mentioned 2 million line csv file, it will read that csv file 2*5 = 10 MILLION times.
    No wonder itīs slow.


    Also I didnt figure out, what you intend with the removing of dots ( sed -e "s/\.//g" ),
    but this might be due to the lack of a provided example of the data you want to process.


    Anyway, the following script is a reimplementation and may be a base for your further development
    - It will silently skip all lines, that dont match the regex. This might be a point for improvements, aka issue a warning here.
    - It has some basic errorchecking, but there is room for more
    - It needs to be called with the csv file as argument
    - It reads the csv file only once.

    Code:
    #!/usr/bin/env bash
    
    ## Initialize
    average_over=5
    delimiter=","
    field=6
    declare -a collector
    if [ ! -s $1 ]; then
      echo "Error: Input csv file not found or empty"
      exit 1
    fi
    input_csv=$1
    
    ## Main
    echo "Moving average over $average_over lines:"
    for linevalue in $(cut -d "$delimiter" -f $field $input_csv | grep -E '^([0-9]*(\.[0-9]*)?|(0*)?\.[0-9]*)$' ); do
      # add to array until $average_over is reached, else pop+shift array
      if [ ${#collector[@]} -lt $average_over ]; then
        collector=("${collector[@]}" "$linevalue")
      else
        unset collector[0]
        collector=("${collector[@]}" "$linevalue")
      fi
    
      if [ ${#collector[@]} -eq $average_over ]; then
        sum=0
        array_max=$(expr ${#collector[@]} - 1)
        for (( i=0; i<=$array_max; i++ )); do
          sum=$(echo "$sum + ${collector[$i]}" | bc)
        done
        echo "scale=4; $sum / ${#collector[@]}" | bc
      fi
    done
    Hmm interesting: The purpose seems to be financial data processing.
    Probably some extra care is needed.
    Last edited by Irithori; 12-12-2011 at 12:17 AM.
    You must always face the curtain with a bow.

  7. #6
    Just Joined!
    Join Date
    Jun 2011
    Posts
    17
    the script you gave me has an error on line 19, it says it's expecting "fi" but you have an else following the line then followed by the fi. i don't get it. also, your script will keep me busy for a while until i figure out what everything means once i get it working. i can give you a sample of a few lines in the csv file. It's historical data for forex data, with the time, open, high, low, close, and volume. the close is the 6th deliminated field. i tried using sh "your script" and ./"your script" i'm not sure how to input the arguement though. here's a sample of the data file:

    PHP Code:
    2005.01.17,05:16,102.060,102.070,102.060,102.070,7
    2005.01.17
    ,05:17,102.060,102.060,102.060,102.060,4
    2005.01.17
    ,05:18,102.070,102.070,102.050,102.060,6
    2005.01.17
    ,05:19,102.050,102.060,102.050,102.060,5
    2005.01.17
    ,05:20,102.050,102.060,102.050,102.050,17
    2005.01.17
    ,05:21,102.050,102.060,102.050,102.060,5
    2005.01.17
    ,05:22,102.060,102.080,102.060,102.070,9
    2005.01.17
    ,05:23,102.070,102.070,102.070,102.070,2
    2005.01.17
    ,05:24,102.060,102.060,102.060,102.060,3
    2005.01.17
    ,05:25,102.050,102.070,102.050,102.060,10
    2005.01.17
    ,05:26,102.070,102.070,102.060,102.060,8
    2005.01.17
    ,05:27,102.060,102.060,102.050,102.050,5
    2005.01.17
    ,05:28,102.050,102.060,102.050,102.060,
    i removed the "." from the number because echo "num / 1" didn't include the decimal. i figure i'll just add the decimal after the calculations take place

    thank you for your help with this, i really appreciate it

  8. #7
    Trusted Penguin Irithori's Avatar
    Join Date
    May 2009
    Location
    Munich
    Posts
    3,440
    Hmm,
    in the morning and with some fresh coffee I see some points to improve on my version, but it does work:
    Code:
    ./movingaverage.sh testfile.csv 
    Moving average over 5 lines:
    102.0600
    102.0580
    102.0600
    102.0620
    102.0620
    102.0640
    102.0640
    102.0600
    102.0580
    You must always face the curtain with a bow.

  9. #8
    Just Joined!
    Join Date
    Jun 2011
    Posts
    17
    PHP Code:
    sudo chmod 777 ~/improved.sh
    ./improved.sh ~/USDJPY1.csv
    Moving average over 5 lines

    and then nothing. it echoes the first line, but it doesn't process the data. i'm not sure what i did wrong, and now i feel like a nuisance. are you sure there's nothing more to it?

  10. #9
    Trusted Penguin Irithori's Avatar
    Join Date
    May 2009
    Location
    Munich
    Posts
    3,440
    Check permissions/users and/or insert some echos, maybe a
    echo $linevalue
    as first command in the for loop.

    But yes, the script actually does work. Copy&paste error?
    You must always face the curtain with a bow.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •