Results 1 to 7 of 7
I am hoping someone can give me some info/clues/direction.
My simple sed command works fine with a small test file. However, when I run it on my 'real' data text ...
- 10-11-2007 #1Just Joined!
- Join Date
- Oct 2007
- Posts
- 2
sed '-n' option not working with large files
I am hoping someone can give me some info/clues/direction.
My simple sed command works fine with a small test file. However, when I run it on my 'real' data text files, which are ~100Mb in size and ~2.8 million lines long, the 'no print' option ('-n') seems to have no effect (all lines are printed instead of only matching lines).
Here are some code snippets:
Command line:
Test file 'test.txt':Code:sed -n -e '/\*/p' < test.txt > out.txt
Code:137.8 0.000019479 -0.000010742 12 3 137.805 -0.000000746 -0.000033118 12 3 #* Hi 137.81 -0.000020588 -0.000055909 12 3 137.855 -0.000318466 -0.000115908 5 3 137.86 -0.000324037 -0.000123737 5 3 137.865 -0.000324724 -0.00012641 5 3 #* R2 137.87 -0.000324464 -0.000123107 5 3
'out.txt': correct output using the 'test.txt' file (print only lines with annotations flagged by '*'):
However, with the large files 'out.txt' is equivalent to:Code:137.805 -0.000000746 -0.000033118 12 3 #* Hi 137.865 -0.000324724 -0.00012641 5 3 #* R2
... in other words, all lines are printed.Code:137.8 0.000019479 -0.000010742 12 3 137.805 -0.000000746 -0.000033118 12 3 #* Hi 137.81 -0.000020588 -0.000055909 12 3 137.855 -0.000318466 -0.000115908 5 3 137.86 -0.000324037 -0.000123737 5 3 137.865 -0.000324724 -0.00012641 5 3 #* R2 137.87 -0.000324464 -0.000123107 5 3
sed is supposed to work with files > 2Gb. Where might a limitation be? Am I missing something obvious? Any clues would be much much appreciated.
[2.6.20-16-386; Kubuntu 7.04;
I am hoping someone can give me some info/clues/direction.
My simple sed command works fine with a small test file. However, when I run it on my 'real' data text files, which are ~100Mb in size and ~2.8 million lines long, the 'no print' option ('-n') seems to have no effect (all lines are printed instead of only matching lines).
Here are some code snippets:
Command line:
Test file 'test.txt':Code:sed -n -e '/\*/p' < test.txt > out.txt
Code:137.8 0.000019479 -0.000010742 12 3 137.805 -0.000000746 -0.000033118 12 3 #* Hi 137.81 -0.000020588 -0.000055909 12 3 137.855 -0.000318466 -0.000115908 5 3 137.86 -0.000324037 -0.000123737 5 3 137.865 -0.000324724 -0.00012641 5 3 #* R2 137.87 -0.000324464 -0.000123107 5 3
'out.txt': correct output using test file (print only line with annotations flagged by '*'):
However, with the large files 'out.txt' is equivalent to:Code:137.805 -0.000000746 -0.000033118 12 3 #* Hi 137.865 -0.000324724 -0.00012641 5 3 #* R2
... in other words, all lines are printed.Code:137.8 0.000019479 -0.000010742 12 3 137.805 -0.000000746 -0.000033118 12 3 #* Hi 137.81 -0.000020588 -0.000055909 12 3 137.855 -0.000318466 -0.000115908 5 3 137.86 -0.000324037 -0.000123737 5 3 137.865 -0.000324724 -0.00012641 5 3 #* R2 137.87 -0.000324464 -0.000123107 5 3
sed is supposed to work with files > 2Gb. Where might a limitation be? Am I missing something obvious? Any clues would be much much appreciated.
[2.6.20-16-386; Kubuntu 7.04; GNU sed version 4.1.5]Last edited by gmcauley; 10-11-2007 at 02:40 AM. Reason: Added sed version I am using
- 10-11-2007 #2Linux Engineer
- Join Date
- Apr 2006
- Location
- Saint Paul, MN, USA / CentOS, Debian, Solaris, SuSE
- Posts
- 1,117
Hi.
A few comments.
1) The command grep would probably be more commonly used for simple searching,
2) I have come to use syntax like [*] rather than \*, because it's more easily extensible and readable,
3) sed, grep, et al, process a line at a time, so the number of items in a file should be irrelevant (sed allows a hold space, which has some limitations, that should not come into play here),
4) you could try one of the other commands, such those below,
5) I'd guess that the sed command used in the final output is not the same as in your test -- for example, it looks like the -n might have been left off.
Using your data on file data1:
producing:Code:#!/usr/bin/env sh # @(#) s1 Demonstrate matching with several commands, set -o nounset echo debug=":" debug="echo" ## Use local command version for the commands in this demonstration. echo "(Versions used in this script displayed with local utility "version")" version bash grep sed awk perl -v | head -2 | tail -1 echo FILE=${1-data1} echo " Results with sed:" sed -n -e '/[*]/p' $FILE echo echo " Results with grep:" grep '[*]' $FILE echo echo " Results with awk:" awk ' /[*]/ ' $FILE echo echo " Results with perl:" perl -n -e 'print if /[*]/;' $FILE exit 0
cheers, drlCode:% ./s1 (Versions used in this script displayed with local utility version) GNU bash, version 2.05b.0(1)-release (i386-pc-linux-gnu) grep (GNU grep) 2.5.1 GNU sed version 4.1.2 GNU Awk 3.1.4 This is perl, v5.8.4 built for i386-linux-thread-multi Results with sed: 137.805 -0.000000746 -0.000033118 12 3 #* Hi 137.865 -0.000324724 -0.00012641 5 3 #* R2 Results with grep: 137.805 -0.000000746 -0.000033118 12 3 #* Hi 137.865 -0.000324724 -0.00012641 5 3 #* R2 Results with awk: 137.805 -0.000000746 -0.000033118 12 3 #* Hi 137.865 -0.000324724 -0.00012641 5 3 #* R2 Results with perl: 137.805 -0.000000746 -0.000033118 12 3 #* Hi 137.865 -0.000324724 -0.00012641 5 3 #* R2
Welcome - get the most out of the forum by reading forum basics and guidelines: click here.
90% of questions can be answered by using man pages, Quick Search, Advanced Search, Google search, Wikipedia.
We look forward to helping you with the challenge of the other 10%.
( Mn, 2.6.n, AMD-64 3000+, ASUS A8V Deluxe, 1 GB, SATA + IDE, Matrox G400 AGP )
- 10-11-2007 #3Just Joined!
- Join Date
- Aug 2007
- Posts
- 37
I made up a test file 102Mb and 3.7 million lines long and your sed script worked without any problem. I'm using GNU sed version 4.1.5.
I agree with drl's remarks. If all you want to do is extract lines with "*" in them you'll find it about 15 times faster using:
Code:fgrep '*' test.txt > out.txt
- 10-11-2007 #4
drl is right in everything he says, particularly in suggesting that gmcauley made a pilot error by omitting the -n from the sed command.
But just for kicks and grins, I wanted to investigate whether gmcauley might be onto something in suggesting the unlikely possibility that sed might not work correctly on large files.
It turns out that this possibility is impossible, to the relief of all of us.
To demonstrate, I took this input file, calling it t1.txt:
and ran against it this script:Code:137.8 0.000019479 -0.000010742 12 3 137.805 -0.000000746 -0.000033118 12 3 #* Hi 137.81 -0.000020588 -0.000055909 12 3 137.855 -0.000318466 -0.000115908 5 3 137.86 -0.000324037 -0.000123737 5 3 137.865 -0.000324724 -0.00012641 5 3 #* R2 137.87 -0.000324464 -0.000123107 5 3
getting this output:Code:#!/bin/bash echo first we produce the long input file echo this will take a while cat 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt \ 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt \ 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt \ 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt \ 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt \ 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt \ 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt \ 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt \ 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt \ 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt 1.txt \ > 2.txt cat 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt \ 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt \ 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt \ 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt \ 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt \ 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt \ 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt \ 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt \ 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt \ 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt 2.txt \ > 3.txt cat 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt \ 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt \ 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt \ 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt \ 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt \ 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt \ 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt \ 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt \ 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt \ 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt 3.txt \ > 4.txt # meow rm -f 5.txt echo we count to 20 for (( jndex=1; jndex<=20; jndex++ )) do cat 4.txt >> 5.txt echo $jndex done ls -l 1.txt 5.txt echo === beginning of the short input file cat 1.txt echo === end of the short input file ls -l 1.txt echo 1.txt contains $(cat 1.txt | wc -l) lines ls -l 5.txt echo 5.txt contains ... wait a moment ... time echo 5.txt contains $(cat 5.txt | wc -l) lines echo === beginning of the head of the long input file head 5.txt echo === end of the head of the long input file echo === beginning of the tail of the long input file tail 5.txt echo === end of the tail of the long input file sed -n -e '/\*/p' < 1.txt > 1a.txt ls -l 1a.txt echo 1a.txt contains $(cat 1a.txt | wc -l) lines echo === beginning of the short output file cat 1a.txt echo === end of the short output file echo about to do the long sed run echo in a separate console, you can watch 5a.txt grow to 1780000000 bytes time sed -n -e '/\*/p' < 5.txt > 5a.txt ls -l 5a.txt echo 5a.txt contains ... wait a moment ... time echo 5a.txt contains $(cat 5a.txt | wc -l) lines echo === beginning of the head of the long output file head 5a.txt echo === end of the head of the long output file echo === beginning of the tail of the long output file tail 5a.txt echo === end of the tail of the long output file sed --version uname -a
As you can see, the long input file, 5.txt, more than 5 GB, contains 140,000,000 lines, and the long output file 5a.txt, about 1.8 GB, contains only 40,000,000 lines.Code:bill@tigress:~/1$ t.sh first we produce the long input file this will take a while we count to 20 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 -rw-r--r-- 1 bill tigers 279 Oct 11 06:19 1.txt -rw-r--r-- 1 bill tigers 5580000000 Oct 11 06:38 5.txt === beginning of the short input file 137.8 0.000019479 -0.000010742 12 3 137.805 -0.000000746 -0.000033118 12 3 #* Hi 137.81 -0.000020588 -0.000055909 12 3 137.855 -0.000318466 -0.000115908 5 3 137.86 -0.000324037 -0.000123737 5 3 137.865 -0.000324724 -0.00012641 5 3 #* R2 137.87 -0.000324464 -0.000123107 5 3 === end of the short input file -rw-r--r-- 1 bill tigers 279 Oct 11 06:19 1.txt 1.txt contains 7 lines -rw-r--r-- 1 bill tigers 5580000000 Oct 11 06:38 5.txt 5.txt contains ... wait a moment ... 5.txt contains 140000000 lines real 5m25.510s user 0m18.445s sys 0m32.274s === beginning of the head of the long input file 137.8 0.000019479 -0.000010742 12 3 137.805 -0.000000746 -0.000033118 12 3 #* Hi 137.81 -0.000020588 -0.000055909 12 3 137.855 -0.000318466 -0.000115908 5 3 137.86 -0.000324037 -0.000123737 5 3 137.865 -0.000324724 -0.00012641 5 3 #* R2 137.87 -0.000324464 -0.000123107 5 3 137.8 0.000019479 -0.000010742 12 3 137.805 -0.000000746 -0.000033118 12 3 #* Hi 137.81 -0.000020588 -0.000055909 12 3 === end of the head of the long input file === beginning of the tail of the long input file 137.86 -0.000324037 -0.000123737 5 3 137.865 -0.000324724 -0.00012641 5 3 #* R2 137.87 -0.000324464 -0.000123107 5 3 137.8 0.000019479 -0.000010742 12 3 137.805 -0.000000746 -0.000033118 12 3 #* Hi 137.81 -0.000020588 -0.000055909 12 3 137.855 -0.000318466 -0.000115908 5 3 137.86 -0.000324037 -0.000123737 5 3 137.865 -0.000324724 -0.00012641 5 3 #* R2 137.87 -0.000324464 -0.000123107 5 3 === end of the tail of the long input file -rw-r--r-- 1 bill tigers 89 Oct 11 06:43 1a.txt 1a.txt contains 2 lines === beginning of the short output file 137.805 -0.000000746 -0.000033118 12 3 #* Hi 137.865 -0.000324724 -0.00012641 5 3 #* R2 === end of the short output file about to do the long sed run in a separate console, you can watch 5a.txt grow to 1780000000 bytes real 10m3.565s user 4m51.234s sys 0m36.938s -rw-r--r-- 1 bill tigers 1780000000 Oct 11 06:53 5a.txt 5a.txt contains ... wait a moment ... 5a.txt contains 40000000 lines real 2m32.442s user 0m5.592s sys 0m11.021s === beginning of the head of the long output file 137.805 -0.000000746 -0.000033118 12 3 #* Hi 137.865 -0.000324724 -0.00012641 5 3 #* R2 137.805 -0.000000746 -0.000033118 12 3 #* Hi 137.865 -0.000324724 -0.00012641 5 3 #* R2 137.805 -0.000000746 -0.000033118 12 3 #* Hi 137.865 -0.000324724 -0.00012641 5 3 #* R2 137.805 -0.000000746 -0.000033118 12 3 #* Hi 137.865 -0.000324724 -0.00012641 5 3 #* R2 137.805 -0.000000746 -0.000033118 12 3 #* Hi 137.865 -0.000324724 -0.00012641 5 3 #* R2 === end of the head of the long output file === beginning of the tail of the long output file 137.805 -0.000000746 -0.000033118 12 3 #* Hi 137.865 -0.000324724 -0.00012641 5 3 #* R2 137.805 -0.000000746 -0.000033118 12 3 #* Hi 137.865 -0.000324724 -0.00012641 5 3 #* R2 137.805 -0.000000746 -0.000033118 12 3 #* Hi 137.865 -0.000324724 -0.00012641 5 3 #* R2 137.805 -0.000000746 -0.000033118 12 3 #* Hi 137.865 -0.000324724 -0.00012641 5 3 #* R2 137.805 -0.000000746 -0.000033118 12 3 #* Hi 137.865 -0.000324724 -0.00012641 5 3 #* R2 === end of the tail of the long output file GNU sed version 4.1.5 Copyright (C) 2003 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE, to the extent permitted by law. Linux tigress 2.6.20-16-generic #2 SMP Sun Sep 23 19:50:39 UTC 2007 i686 GNU/Linux bill@tigress:~/1$
gmcauley, if you're curious, you can run this script on your own machine, just to banish the heebee jeebies.
- 10-11-2007 #5Just Joined!
- Join Date
- Oct 2007
- Posts
- 2
sed '-n' option not working with large files Reply to Thread: SOLVED
First of all, thank you everybody for your replies. I appreciate your enthusiasm and efforts. I learned a lot, and will incorporate your tips into my future scripts (kind of has a ring to it
).
The trouble turned out to be that the data files I was given contained only \r and no \n characters (from a Mac). Thus, the whole file was apparently being treated as a single line. When I made my test file, I cut and pasted a portion from the big file into an Linux editor that graciously add the \n's, so it behaved as expected but differently from the big file.
Note below, before substituting \n's for \r's, the output file is the same size as the input file. Afterward, only the flagged lines are present in the much shortened output file:
So, it had nothing to do with the length of the fileCode:hal@zzz:~/yyy/test$ ls big-orig.txt test.txt hal@zzz:~/yyy/test$ sed -n -e '/[*]/p' big-orig.txt > big-orig-out.txt hal@zzz:~/yyy/test$ sed -e 's/\r/\n/g' big-orig.txt > big.txt hal@zzz:~/yyy/test$ sed -n -e '/[*]/p' big.txt > big-out.txt hal@zzz:~/yyy/test$ ls -l total 321764 -rw-r--r-- 1 hal hal 109709844 2007-10-11 14:34 big-orig-out.txt -rw-r--r-- 1 hal hal 109709844 2007-10-11 14:33 big-orig.txt -rw-r--r-- 1 hal hal 642 2007-10-11 14:36 big-out.txt -rw-r--r-- 1 hal hal 109709844 2007-10-11 14:35 big.txt -rw-r--r-- 1 hal hal 1007 2007-10-11 13:46 test.txt hal@zzz:~/yyy/test$ cat big-out.txt 439.965 -0.000079586 -0.000051728 2 1 #* b1 83.53 -0.000023066 0.000004203 2 4 #* start b2 218.38 -0.000173061 -0.000017218 4 0 #* start i1 559.195 0.000027234 -0.000052225 0 4 #* start h1 194.22 -0.000146901 -0.000017772 3 51.198 #* s1 476.365 -0.000070626 -0.000035308 2 1 #* s1 412.5 -0.000023169 -0.00000373 30 1 #* s2 478.45 0.000067441 0.000031045 2 1 #* s3 215.02 -0.000165174 0.000000647 2 0 #* s4 491.7 -0.000069158 -0.00017182 1 0 #* s5 0 -0.000636219 -0.000049252 0 0 #* stop h1 112.82 -0.000121451 0.00003465 1 0 #* R1 593.495 -0.000043104 0.000072116 1 1 #* stop i1 137.865 -0.000324724 -0.00012641 5 3 #* R2 hal@zzz:~/yyy/test$
- 10-12-2007 #6Linux Engineer
- Join Date
- Apr 2006
- Location
- Saint Paul, MN, USA / CentOS, Debian, Solaris, SuSE
- Posts
- 1,117
Hi, gmcauley.
It's a cautionary tale for all of us to remember to ask where the data came from, how you got it, etc.
Thanks for the follow-up explanation -- glad you got it working ... cheers, drlWelcome - get the most out of the forum by reading forum basics and guidelines: click here.
90% of questions can be answered by using man pages, Quick Search, Advanced Search, Google search, Wikipedia.
We look forward to helping you with the challenge of the other 10%.
( Mn, 2.6.n, AMD-64 3000+, ASUS A8V Deluxe, 1 GB, SATA + IDE, Matrox G400 AGP )
- 10-12-2007 #7
a cautionary note about Macintosh text files and ncftp
Slightly off topic and hijacking the thread, but I guess that's ok if the problem has already been solved, right?
I got burned once (well, not severely) when using an otherwise excellent ftp client called ncftp.
ncftp had the undocumented feature that when you were on a Linux system and using ncftp to download a file from elsewhere, and the NAME OF THE FILE ended in .txt, ncftp assumed (without telling you) that the file was a text file and that you wanted to change every <CR><LF> to <LF>. And the way it did this (*** GRUMBLE GRUMBLE ***) was to remove every <CR> from the file. But <CR> is the end of line indication for Macintosh. Which, of course, meant that ncftp would sanitize every end of line indication in a Mac .txt file for your convenience.
Because of the automatic and undocumented way this happened, I never used ncftp again.
Grrrrr.


Reply With Quote