Results 1 to 4 of 4
Hi, I have a problem I am having trouble to solve.
I have a large tab delimited text file, about 17gb. It only has 6 column. On column number 4, ...
- 07-06-2010 #1Just Joined!
- Join Date
- Jul 2010
- Posts
- 1
check for word frequency in 1 column
Hi, I have a problem I am having trouble to solve.
I have a large tab delimited text file, about 17gb. It only has 6 column. On column number 4, it is all numbers. Ranging from 1-1000. I want to count how many times each number occured. So the output I want is in 2 columns, first one is a number, second column is how many times it occured.
I tried
head -n 1000 coverage | cut -f 4 | uniq -c
Didn't work for me, the first column returned is not unique.
Can anyone help?
Thanks
- 07-06-2010 #2Linux User
- Join Date
- Nov 2009
- Location
- France
- Posts
- 292
1. It would help if you post a sample data file.
2. I don't that it's up to bash tools to work out these kinds of tasks and I'm not sure we could easily do it.
3. You should import your huge structured data file in a database and issue a select count command with group by clause.0 + 1 = 1 != 2 <> 3 != 4 ...
Until the camel can pass though the eye of the needle.
- 07-06-2010 #3Linux Engineer
- Join Date
- Apr 2006
- Location
- Saint Paul, MN, USA / CentOS, Debian, Solaris, SuSE
- Posts
- 1,117
Hi.
The command uniq does not consider items duplicates unless they are adjacent. Try
cheers, drlCode:head -n 1000 coverage | cut -f 4 | sort -n | uniq -c
Note: 'uniq' does not detect repeated lines unless they are adjacent.
You may want to sort the input first, or use `sort -u' without `uniq'.
-- excerpt from man uniq, q.v.Welcome - get the most out of the forum by reading forum basics and guidelines: click here.
90% of questions can be answered by using man pages, Quick Search, Advanced Search, Google search, Wikipedia.
We look forward to helping you with the challenge of the other 10%.
( Mn, 2.6.n, AMD-64 3000+, ASUS A8V Deluxe, 1 GB, SATA + IDE, Matrox G400 AGP )
- 07-07-2010 #4Linux User
- Join Date
- Nov 2009
- Location
- France
- Posts
- 292
@foxyg
This is ambiguous. Just for the sake of clarification, did you mean that values in column 4 are in that range ? or that you wish to sample the first 1000 lines of you file ?Ranging from 1-1000.0 + 1 = 1 != 2 <> 3 != 4 ...
Until the camel can pass though the eye of the needle.


Reply With Quote