Find the answer to your Linux question:
Results 1 to 7 of 7
I am hoping someone can give me some info/clues/direction. My simple sed command works fine with a small test file. However, when I run it on my 'real' data text ...
  1. #1
    Just Joined!
    Join Date
    Oct 2007
    Posts
    2

    Question sed '-n' option not working with large files

    I am hoping someone can give me some info/clues/direction.

    My simple sed command works fine with a small test file. However, when I run it on my 'real' data text files, which are ~100Mb in size and ~2.8 million lines long, the 'no print' option ('-n') seems to have no effect (all lines are printed instead of only matching lines).

    Here are some code snippets:

    Command line:
    Code:
    sed -n -e '/\*/p' < test.txt > out.txt
    Test file 'test.txt':
    Code:
    137.8	0.000019479	-0.000010742	12	3	
    137.805	-0.000000746	-0.000033118	12	3	#* Hi	
    137.81	-0.000020588	-0.000055909	12	3	
    137.855	-0.000318466	-0.000115908	5	3	
    137.86	-0.000324037	-0.000123737	5	3	
    137.865	-0.000324724	-0.00012641	5	3	#* R2
    137.87	-0.000324464	-0.000123107	5	3

    'out.txt': correct output using the 'test.txt' file (print only lines with annotations flagged by '*'):
    Code:
    137.805	-0.000000746	-0.000033118	12	3	#* Hi
    137.865	-0.000324724	-0.00012641	5	3	#* R2
    However, with the large files 'out.txt' is equivalent to:
    Code:
    137.8	0.000019479	-0.000010742	12	3	
    137.805	-0.000000746	-0.000033118	12	3	#* Hi	
    137.81	-0.000020588	-0.000055909	12	3	
    137.855	-0.000318466	-0.000115908	5	3	
    137.86	-0.000324037	-0.000123737	5	3	
    137.865	-0.000324724	-0.00012641	5	3	#* R2
    137.87	-0.000324464	-0.000123107	5	3
    ... in other words, all lines are printed.


    sed is supposed to work with files > 2Gb. Where might a limitation be? Am I missing something obvious? Any clues would be much much appreciated.

    [2.6.20-16-386; Kubuntu 7.04;
    I am hoping someone can give me some info/clues/direction.

    My simple sed command works fine with a small test file. However, when I run it on my 'real' data text files, which are ~100Mb in size and ~2.8 million lines long, the 'no print' option ('-n') seems to have no effect (all lines are printed instead of only matching lines).

    Here are some code snippets:

    Command line:
    Code:
    sed -n -e '/\*/p' < test.txt > out.txt
    Test file 'test.txt':
    Code:
    137.8	0.000019479	-0.000010742	12	3	
    137.805	-0.000000746	-0.000033118	12	3	#* Hi	
    137.81	-0.000020588	-0.000055909	12	3	
    137.855	-0.000318466	-0.000115908	5	3	
    137.86	-0.000324037	-0.000123737	5	3	
    137.865	-0.000324724	-0.00012641	5	3	#* R2
    137.87	-0.000324464	-0.000123107	5	3

    'out.txt': correct output using test file (print only line with annotations flagged by '*'):
    Code:
    137.805	-0.000000746	-0.000033118	12	3	#* Hi
    137.865	-0.000324724	-0.00012641	5	3	#* R2
    However, with the large files 'out.txt' is equivalent to:
    Code:
    137.8	0.000019479	-0.000010742	12	3	
    137.805	-0.000000746	-0.000033118	12	3	#* Hi	
    137.81	-0.000020588	-0.000055909	12	3	
    137.855	-0.000318466	-0.000115908	5	3	
    137.86	-0.000324037	-0.000123737	5	3	
    137.865	-0.000324724	-0.00012641	5	3	#* R2
    137.87	-0.000324464	-0.000123107	5	3
    ... in other words, all lines are printed.


    sed is supposed to work with files > 2Gb. Where might a limitation be? Am I missing something obvious? Any clues would be much much appreciated.

    [2.6.20-16-386; Kubuntu 7.04; GNU sed version 4.1.5]
    Last edited by gmcauley; 10-11-2007 at 02:40 AM. Reason: Added sed version I am using

  2. #2
    drl
    drl is offline
    Linux Engineer drl's Avatar
    Join Date
    Apr 2006
    Location
    Saint Paul, MN, USA / CentOS, Debian, Solaris, SuSE
    Posts
    1,117
    Hi.

    A few comments.

    1) The command grep would probably be more commonly used for simple searching,

    2) I have come to use syntax like [*] rather than \*, because it's more easily extensible and readable,

    3) sed, grep, et al, process a line at a time, so the number of items in a file should be irrelevant (sed allows a hold space, which has some limitations, that should not come into play here),

    4) you could try one of the other commands, such those below,

    5) I'd guess that the sed command used in the final output is not the same as in your test -- for example, it looks like the -n might have been left off.

    Using your data on file data1:
    Code:
    #!/usr/bin/env sh
    
    # @(#) s1       Demonstrate matching with several commands,
    
    set -o nounset
    echo
    
    debug=":"
    debug="echo"
    
    ## Use local command version for the commands in this demonstration.
    
    echo "(Versions used in this script displayed with local utility "version")"
    version bash grep sed awk
    perl -v | head -2 | tail -1
    
    echo
    
    FILE=${1-data1}
    
    echo " Results with sed:"
    sed -n -e '/[*]/p' $FILE
    
    echo
    echo " Results with grep:"
    grep '[*]' $FILE
    
    echo
    echo " Results with awk:"
    awk '
    /[*]/
    ' $FILE
    
    echo
    echo " Results with perl:"
    perl -n -e 'print if /[*]/;' $FILE
    
    exit 0
    producing:
    Code:
    % ./s1
    
    (Versions used in this script displayed with local utility version)
    GNU bash, version 2.05b.0(1)-release (i386-pc-linux-gnu)
    grep (GNU grep) 2.5.1
    GNU sed version 4.1.2
    GNU Awk 3.1.4
    This is perl, v5.8.4 built for i386-linux-thread-multi
    
     Results with sed:
    137.805 -0.000000746    -0.000033118    12      3       #* Hi
    137.865 -0.000324724    -0.00012641     5       3       #* R2
    
     Results with grep:
    137.805 -0.000000746    -0.000033118    12      3       #* Hi
    137.865 -0.000324724    -0.00012641     5       3       #* R2
    
     Results with awk:
    137.805 -0.000000746    -0.000033118    12      3       #* Hi
    137.865 -0.000324724    -0.00012641     5       3       #* R2
    
     Results with perl:
    137.805 -0.000000746    -0.000033118    12      3       #* Hi
    137.865 -0.000324724    -0.00012641     5       3       #* R2
    cheers, drl
    Welcome - get the most out of the forum by reading forum basics and guidelines: click here.
    90% of questions can be answered by using man pages, Quick Search, Advanced Search, Google search, Wikipedia.
    We look forward to helping you with the challenge of the other 10%.
    ( Mn, 2.6.n, AMD-64 3000+, ASUS A8V Deluxe, 1 GB, SATA + IDE, Matrox G400 AGP )

  3. #3
    Just Joined!
    Join Date
    Aug 2007
    Posts
    37
    I made up a test file 102Mb and 3.7 million lines long and your sed script worked without any problem. I'm using GNU sed version 4.1.5.

    I agree with drl's remarks. If all you want to do is extract lines with "*" in them you'll find it about 15 times faster using:
    Code:
    fgrep '*' test.txt > out.txt

  4. #4
    Linux Engineer wje_lf's Avatar
    Join Date
    Sep 2007
    Location
    Mariposa
    Posts
    1,192
    drl is right in everything he says, particularly in suggesting that gmcauley made a pilot error by omitting the -n from the sed command.

    But just for kicks and grins, I wanted to investigate whether gmcauley might be onto something in suggesting the unlikely possibility that sed might not work correctly on large files.

    It turns out that this possibility is impossible, to the relief of all of us.

    To demonstrate, I took this input file, calling it t1.txt:
    Code:
    137.8	0.000019479	-0.000010742	12	3	
    137.805	-0.000000746	-0.000033118	12	3	#* Hi	
    137.81	-0.000020588	-0.000055909	12	3	
    137.855	-0.000318466	-0.000115908	5	3	
    137.86	-0.000324037	-0.000123737	5	3	
    137.865	-0.000324724	-0.00012641	5	3	#* R2
    137.87	-0.000324464	-0.000123107	5	3
    and ran against it this script:
    Code:
    #!/bin/bash
    
    echo first we produce the long input file
    echo this will take a while
    
    cat 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt \
        1.txt 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt \
        1.txt 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt \
        1.txt 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt \
        1.txt 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt \
        1.txt 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt \
        1.txt 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt \
        1.txt 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt \
        1.txt 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt \
        1.txt 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt \
        > 2.txt
    
    cat 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt \
        2.txt 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt \
        2.txt 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt \
        2.txt 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt \
        2.txt 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt \
        2.txt 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt \
        2.txt 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt \
        2.txt 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt \
        2.txt 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt \
        2.txt 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt \
        > 3.txt
    
    cat 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt \
        3.txt 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt \
        3.txt 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt \
        3.txt 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt \
        3.txt 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt \
        3.txt 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt \
        3.txt 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt \
        3.txt 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt \
        3.txt 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt \
        3.txt 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt \
        > 4.txt
    
    # meow
    
    rm -f 5.txt
    
    echo we count to 20
    
    for (( jndex=1; jndex<=20; jndex++ ))
    do
      cat 4.txt >> 5.txt
      echo $jndex
    done
    
    ls -l 1.txt 5.txt
    
    echo === beginning of the short input file
    cat 1.txt
    echo === end of the short input file
    
    ls -l 1.txt
    echo 1.txt contains $(cat 1.txt | wc -l) lines
    
    ls -l 5.txt
    echo 5.txt contains ... wait a moment ...
    time echo 5.txt contains $(cat 5.txt | wc -l) lines
    
    echo === beginning of the head of the long input file
    head 5.txt
    echo === end of the head of the long input file
    
    echo === beginning of the tail of the long input file
    tail 5.txt
    echo === end of the tail of the long input file
    
    sed -n -e '/\*/p' < 1.txt > 1a.txt
    
    ls -l 1a.txt
    echo 1a.txt contains $(cat 1a.txt | wc -l) lines
    
    echo === beginning of the short output file
    cat 1a.txt
    echo === end of the short output file
    
    echo about to do the long sed run
    echo in a separate console, you can watch 5a.txt grow to 1780000000 bytes
    
    time sed -n -e '/\*/p' < 5.txt > 5a.txt
    
    ls -l 5a.txt
    echo 5a.txt contains ... wait a moment ...
    time echo 5a.txt contains $(cat 5a.txt | wc -l) lines
    
    echo === beginning of the head of the long output file
    head 5a.txt
    echo === end of the head of the long output file
    
    echo === beginning of the tail of the long output file
    tail 5a.txt
    echo === end of the tail of the long output file
    
    sed --version
    uname -a
    getting this output:
    Code:
    bill@tigress:~/1$ t.sh
    first we produce the long input file
    this will take a while
    we count to 20
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    -rw-r--r-- 1 bill tigers        279 Oct 11 06:19 1.txt
    -rw-r--r-- 1 bill tigers 5580000000 Oct 11 06:38 5.txt
    === beginning of the short input file
    137.8	0.000019479	-0.000010742	12	3	
    137.805	-0.000000746	-0.000033118	12	3	#* Hi	
    137.81	-0.000020588	-0.000055909	12	3	
    137.855	-0.000318466	-0.000115908	5	3	
    137.86	-0.000324037	-0.000123737	5	3	
    137.865	-0.000324724	-0.00012641	5	3	#* R2
    137.87	-0.000324464	-0.000123107	5	3
    === end of the short input file
    -rw-r--r-- 1 bill tigers 279 Oct 11 06:19 1.txt
    1.txt contains 7 lines
    -rw-r--r-- 1 bill tigers 5580000000 Oct 11 06:38 5.txt
    5.txt contains ... wait a moment ...
    5.txt contains 140000000 lines
    
    real	5m25.510s
    user	0m18.445s
    sys	0m32.274s
    === beginning of the head of the long input file
    137.8	0.000019479	-0.000010742	12	3	
    137.805	-0.000000746	-0.000033118	12	3	#* Hi	
    137.81	-0.000020588	-0.000055909	12	3	
    137.855	-0.000318466	-0.000115908	5	3	
    137.86	-0.000324037	-0.000123737	5	3	
    137.865	-0.000324724	-0.00012641	5	3	#* R2
    137.87	-0.000324464	-0.000123107	5	3
    137.8	0.000019479	-0.000010742	12	3	
    137.805	-0.000000746	-0.000033118	12	3	#* Hi	
    137.81	-0.000020588	-0.000055909	12	3	
    === end of the head of the long input file
    === beginning of the tail of the long input file
    137.86	-0.000324037	-0.000123737	5	3	
    137.865	-0.000324724	-0.00012641	5	3	#* R2
    137.87	-0.000324464	-0.000123107	5	3
    137.8	0.000019479	-0.000010742	12	3	
    137.805	-0.000000746	-0.000033118	12	3	#* Hi	
    137.81	-0.000020588	-0.000055909	12	3	
    137.855	-0.000318466	-0.000115908	5	3	
    137.86	-0.000324037	-0.000123737	5	3	
    137.865	-0.000324724	-0.00012641	5	3	#* R2
    137.87	-0.000324464	-0.000123107	5	3
    === end of the tail of the long input file
    -rw-r--r-- 1 bill tigers 89 Oct 11 06:43 1a.txt
    1a.txt contains 2 lines
    === beginning of the short output file
    137.805	-0.000000746	-0.000033118	12	3	#* Hi	
    137.865	-0.000324724	-0.00012641	5	3	#* R2
    === end of the short output file
    about to do the long sed run
    in a separate console, you can watch 5a.txt grow to 1780000000 bytes
    
    real	10m3.565s
    user	4m51.234s
    sys	0m36.938s
    -rw-r--r-- 1 bill tigers 1780000000 Oct 11 06:53 5a.txt
    5a.txt contains ... wait a moment ...
    5a.txt contains 40000000 lines
    
    real	2m32.442s
    user	0m5.592s
    sys	0m11.021s
    === beginning of the head of the long output file
    137.805	-0.000000746	-0.000033118	12	3	#* Hi	
    137.865	-0.000324724	-0.00012641	5	3	#* R2
    137.805	-0.000000746	-0.000033118	12	3	#* Hi	
    137.865	-0.000324724	-0.00012641	5	3	#* R2
    137.805	-0.000000746	-0.000033118	12	3	#* Hi	
    137.865	-0.000324724	-0.00012641	5	3	#* R2
    137.805	-0.000000746	-0.000033118	12	3	#* Hi	
    137.865	-0.000324724	-0.00012641	5	3	#* R2
    137.805	-0.000000746	-0.000033118	12	3	#* Hi	
    137.865	-0.000324724	-0.00012641	5	3	#* R2
    === end of the head of the long output file
    === beginning of the tail of the long output file
    137.805	-0.000000746	-0.000033118	12	3	#* Hi	
    137.865	-0.000324724	-0.00012641	5	3	#* R2
    137.805	-0.000000746	-0.000033118	12	3	#* Hi	
    137.865	-0.000324724	-0.00012641	5	3	#* R2
    137.805	-0.000000746	-0.000033118	12	3	#* Hi	
    137.865	-0.000324724	-0.00012641	5	3	#* R2
    137.805	-0.000000746	-0.000033118	12	3	#* Hi	
    137.865	-0.000324724	-0.00012641	5	3	#* R2
    137.805	-0.000000746	-0.000033118	12	3	#* Hi	
    137.865	-0.000324724	-0.00012641	5	3	#* R2
    === end of the tail of the long output file
    GNU sed version 4.1.5
    Copyright (C) 2003 Free Software Foundation, Inc.
    This is free software; see the source for copying conditions.  There is NO
    warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE,
    to the extent permitted by law.
    Linux tigress 2.6.20-16-generic #2 SMP Sun Sep 23 19:50:39 UTC 2007 i686 GNU/Linux
    bill@tigress:~/1$
    As you can see, the long input file, 5.txt, more than 5 GB, contains 140,000,000 lines, and the long output file 5a.txt, about 1.8 GB, contains only 40,000,000 lines.

    gmcauley, if you're curious, you can run this script on your own machine, just to banish the heebee jeebies.

  5. #5
    Just Joined!
    Join Date
    Oct 2007
    Posts
    2

    sed '-n' option not working with large files Reply to Thread: SOLVED

    First of all, thank you everybody for your replies. I appreciate your enthusiasm and efforts. I learned a lot, and will incorporate your tips into my future scripts (kind of has a ring to it).

    The trouble turned out to be that the data files I was given contained only \r and no \n characters (from a Mac). Thus, the whole file was apparently being treated as a single line. When I made my test file, I cut and pasted a portion from the big file into an Linux editor that graciously add the \n's, so it behaved as expected but differently from the big file.

    Note below, before substituting \n's for \r's, the output file is the same size as the input file. Afterward, only the flagged lines are present in the much shortened output file:
    Code:
    hal@zzz:~/yyy/test$ ls
    big-orig.txt  test.txt
    hal@zzz:~/yyy/test$ sed -n -e '/[*]/p' big-orig.txt > big-orig-out.txt
    hal@zzz:~/yyy/test$ sed -e 's/\r/\n/g' big-orig.txt > big.txt
    hal@zzz:~/yyy/test$ sed -n -e '/[*]/p' big.txt > big-out.txt
    hal@zzz:~/yyy/test$ ls -l
    total 321764
    -rw-r--r-- 1 hal hal 109709844 2007-10-11 14:34 big-orig-out.txt
    -rw-r--r-- 1 hal hal 109709844 2007-10-11 14:33 big-orig.txt
    -rw-r--r-- 1 hal hal       642 2007-10-11 14:36 big-out.txt
    -rw-r--r-- 1 hal hal 109709844 2007-10-11 14:35 big.txt
    -rw-r--r-- 1 hal hal      1007 2007-10-11 13:46 test.txt
    hal@zzz:~/yyy/test$ cat big-out.txt
    439.965 -0.000079586    -0.000051728    2       1       #* b1
    83.53   -0.000023066    0.000004203     2       4       #* start b2
    218.38  -0.000173061    -0.000017218    4       0       #* start i1
    559.195 0.000027234     -0.000052225    0       4       #* start h1
    194.22  -0.000146901    -0.000017772    3       51.198  #* s1
    476.365 -0.000070626    -0.000035308    2       1       #* s1
    412.5   -0.000023169    -0.00000373     30      1       #* s2
    478.45  0.000067441     0.000031045     2       1       #* s3
    215.02  -0.000165174    0.000000647     2       0       #* s4
    491.7   -0.000069158    -0.00017182     1       0       #* s5
    0       -0.000636219    -0.000049252    0       0       #* stop h1
    112.82  -0.000121451    0.00003465      1       0       #* R1
    593.495 -0.000043104    0.000072116     1       1       #* stop i1
    137.865 -0.000324724    -0.00012641     5       3       #* R2
    hal@zzz:~/yyy/test$
    So, it had nothing to do with the length of the file

  6. #6
    drl
    drl is offline
    Linux Engineer drl's Avatar
    Join Date
    Apr 2006
    Location
    Saint Paul, MN, USA / CentOS, Debian, Solaris, SuSE
    Posts
    1,117
    Hi, gmcauley.

    It's a cautionary tale for all of us to remember to ask where the data came from, how you got it, etc.

    Thanks for the follow-up explanation -- glad you got it working ... cheers, drl
    Welcome - get the most out of the forum by reading forum basics and guidelines: click here.
    90% of questions can be answered by using man pages, Quick Search, Advanced Search, Google search, Wikipedia.
    We look forward to helping you with the challenge of the other 10%.
    ( Mn, 2.6.n, AMD-64 3000+, ASUS A8V Deluxe, 1 GB, SATA + IDE, Matrox G400 AGP )

  7. #7
    Linux Engineer wje_lf's Avatar
    Join Date
    Sep 2007
    Location
    Mariposa
    Posts
    1,192

    a cautionary note about Macintosh text files and ncftp

    Slightly off topic and hijacking the thread, but I guess that's ok if the problem has already been solved, right?

    I got burned once (well, not severely) when using an otherwise excellent ftp client called ncftp.

    ncftp had the undocumented feature that when you were on a Linux system and using ncftp to download a file from elsewhere, and the NAME OF THE FILE ended in .txt, ncftp assumed (without telling you) that the file was a text file and that you wanted to change every <CR><LF> to <LF>. And the way it did this (*** GRUMBLE GRUMBLE ***) was to remove every <CR> from the file. But <CR> is the end of line indication for Macintosh. Which, of course, meant that ncftp would sanitize every end of line indication in a Mac .txt file for your convenience.

    Because of the automatic and undocumented way this happened, I never used ncftp again.

    Grrrrr.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
...