Find the answer to your Linux question:
Results 1 to 3 of 3
Hello All, I will like some advice on perl, I have a huge file that contain multiple values on more than 2M lines. I wrote a regex to extract those ...
  1. #1
    Just Joined!
    Join Date
    May 2009
    Posts
    9

    Perl grouping multiple lines in one...

    Hello All,

    I will like some advice on perl, I have a huge file that contain multiple values on more than 2M lines. I wrote a regex to extract those values as needed, but the problem is that there are multiple lines for a same key, with a column with a different value (wich I need to group within one line) here is the example!

    key 1 price 150 warehouse A
    key 1 price 150 warehouse B
    key 1 price 150 warehouse C
    key 2 price 20 warehouse A
    key 2 price 20 warehouse B
    key 2 price 20 warehouse C
    key 3 price 5 warehouse B
    key 3 price 5 warehouse C

    And I will like to output as:
    key 1 price 150 warehouse (A B C)
    key 2 price 20 warehouse (A B C)
    key 3 price 5 warehouse (B C)

    any advice? I am new to perl, I know that it might be possible to hash the values... but I am not that experienced!

    By the way, the records might not be sorted by key as stated on the example, and its a 2 million line file, so memory and performance might be considered.

    Regards,
    Ocon

  2. #2
    Just Joined!
    Join Date
    Feb 2010
    Posts
    4
    You can do it using the hash of array.Read line by line and split the key ( key 1 price 150 warehouse ) and the value. Have the key as the key of the hash and the value ( A ) as the value of the hash.The value field should be a array.so you can push more number of values like B,C.

    Code:
        %hash=("key 1 price 150 warehouse=>["A","B","C"] ,etc );
    After reading all the lines print the contents of the hash in a formatted way.

  3. #3
    drl
    drl is offline
    Linux Engineer drl's Avatar
    Join Date
    Apr 2006
    Location
    Saint Paul, MN, USA / CentOS, Debian, Solaris, SuSE
    Posts
    1,117
    Hi.

    By sorting the file first, you may be able to use less memory, but you will need to pay the cost of the sort.

    Here is a sample solution that uses an unsorted, ill-formatted version of your data. The basic idea is to collect data for unique groups. The groups will be created by the sort. Some pre- and post- processing is necessary (as it is in almost every task).

    An awk script is shorter than perl because so much basic work is done by awk, but the same idea could be done with perl. If the file is sorted, hashes are not necessary, and the data need not all be kept in memory:
    Code:
    #!/usr/bin/env bash
    
    # @(#) s1	Demonstrate collection of unique groups.
    
    # Infrastructure details, environment, commands for forum posts. 
    set +o nounset
    LC_ALL=C ; LANG=C ; export LC_ALL LANG
    echo ; echo "Environment: LC_ALL = $LC_ALL, LANG = $LANG"
    echo "(Versions displayed with local utility \"version\")"
    c=$( ps | grep $$ | awk '{print $NF}' )
    version >/dev/null 2>&1 && s=$(_eat $0 $1) || s=""
    [ "$c" = "$s" ] && p="$s" || p="$c"
    version >/dev/null 2>&1 && version "=o" $p sort awk
    set -o nounset
    echo
    
    FILE=${1-data1}
    
    echo " Data file $FILE:"
    cat $FILE
    
    echo
    echo " Collection script in awk:"
    echo +++++ begin
    cat collect
    echo +++++ end
    
    # Standardize token separation to 1 space,
    # sort lines,
    # split into 2 fields, separated by "|",
    # collect each unique segment,
    # print,
    # if desired, split fields
    # keep intermediate results for viewing flow of test cases,
    # remove "tee" lines for production.
    
    echo
    echo " Results:"
    sed 's/  */ /g' $FILE |
    tee t1 |
    sort |
    tee t2 |
    sed 's/\(key.*warehouse\) \(.*\)$/\1|\2/' |
    tee t3 |
    ./collect
    
    exit 0
    producing:
    Code:
    % ./s1
    
    Environment: LC_ALL = C, LANG = C
    (Versions displayed with local utility "version")
    OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
    Distribution        : Debian GNU/Linux 5.0 
    GNU bash 3.2.39
    sort (GNU coreutils) 6.10
    GNU Awk 3.1.5
    
     Data file data1:
    key  1 price 150 warehouse A
    key 2 price 20  warehouse A
    key 3 price      5 warehouse B
    key 1 price 150 warehouse B
    key 2   price 20 warehouse B
    key    3 price 5 warehouse C
    key 1 price 150     warehouse C
    key 2 price 20     warehouse C
    
     Collection script in awk:
    +++++ begin
    #!/usr/bin/env bash
    
    # @(#) collect	Demonstrate collection script, awk.
    
    FILE="$1"
    
    # Use nawk or /usr/xpg4/bin/awk on Solaris.
    
    awk '
    BEGIN	{ FS = OFS = "|" ; previous = "" ; line = "" ; first = "true"}
    first == "true" { first = "false" ; previous = $1 ; line = $0 ; next }
    $1 == previous	{ line = line " " $2 ; next }
    		{ print line ; previous = $1 ; line = $0 }
    END	{ print line }
    ' $FILE
    
    exit 0
    +++++ end
    
     Results:
    key 1 price 150 warehouse|A B C
    key 2 price 20 warehouse|A B C
    key 3 price 5 warehouse|B C
    You can look at the intermediate data on files t1, t2, etc., as necessary.

    Good luck ... cheers, drl
    Welcome - get the most out of the forum by reading forum basics and guidelines: click here.
    90% of questions can be answered by using man pages, Quick Search, Advanced Search, Google search, Wikipedia.
    We look forward to helping you with the challenge of the other 10%.
    ( Mn, 2.6.n, AMD-64 3000+, ASUS A8V Deluxe, 1 GB, SATA + IDE, Matrox G400 AGP )

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
...