Find the answer to your Linux question:
Results 1 to 4 of 4
Hi, I have a problem I am having trouble to solve. I have a large tab delimited text file, about 17gb. It only has 6 column. On column number 4, ...
  1. #1
    Just Joined!
    Join Date
    Jul 2010
    Posts
    1

    check for word frequency in 1 column

    Hi, I have a problem I am having trouble to solve.

    I have a large tab delimited text file, about 17gb. It only has 6 column. On column number 4, it is all numbers. Ranging from 1-1000. I want to count how many times each number occured. So the output I want is in 2 columns, first one is a number, second column is how many times it occured.

    I tried
    head -n 1000 coverage | cut -f 4 | uniq -c

    Didn't work for me, the first column returned is not unique.

    Can anyone help?

    Thanks

  2. #2
    Linux User
    Join Date
    Nov 2009
    Location
    France
    Posts
    292
    1. It would help if you post a sample data file.

    2. I don't that it's up to bash tools to work out these kinds of tasks and I'm not sure we could easily do it.

    3. You should import your huge structured data file in a database and issue a select count command with group by clause.
    0 + 1 = 1 != 2 <> 3 != 4 ...
    Until the camel can pass though the eye of the needle.

  3. #3
    drl
    drl is offline
    Linux Engineer drl's Avatar
    Join Date
    Apr 2006
    Location
    Saint Paul, MN, USA / CentOS, Debian, Solaris, SuSE
    Posts
    1,117
    Hi.

    The command uniq does not consider items duplicates unless they are adjacent. Try
    Code:
    head -n 1000 coverage | cut -f 4 | sort -n | uniq -c
    cheers, drl
    Note: 'uniq' does not detect repeated lines unless they are adjacent.
    You may want to sort the input first, or use `sort -u' without `uniq'.
    -- excerpt from man uniq, q.v.
    Welcome - get the most out of the forum by reading forum basics and guidelines: click here.
    90% of questions can be answered by using man pages, Quick Search, Advanced Search, Google search, Wikipedia.
    We look forward to helping you with the challenge of the other 10%.
    ( Mn, 2.6.n, AMD-64 3000+, ASUS A8V Deluxe, 1 GB, SATA + IDE, Matrox G400 AGP )

  4. #4
    Linux User
    Join Date
    Nov 2009
    Location
    France
    Posts
    292
    @foxyg
    Ranging from 1-1000.
    This is ambiguous. Just for the sake of clarification, did you mean that values in column 4 are in that range ? or that you wish to sample the first 1000 lines of you file ?
    0 + 1 = 1 != 2 <> 3 != 4 ...
    Until the camel can pass though the eye of the needle.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
...