Find the answer to your Linux question:
Results 1 to 4 of 4
I want to remove duplicate lines in a file, based on a partial match - namely, the second field, space delimited. uniq will work for the entire row. What will ...
  1. #1
    Just Joined!
    Join Date
    Mar 2010
    Posts
    2

    Removing Duplicate Entries in a File

    I want to remove duplicate lines in a file, based on a partial match - namely, the second field, space delimited.

    uniq will work for the entire row. What will do what I want?

    File format is:

    Line number-spaces-file name-space-primary key

    Basically, if a file name appears two or more times in the file, I want to delete ALL rows containing the file name.

    I also have the option of doing this on an unnumbered file - thus, the partial match would be on the first field. How would this work?

  2. #2
    Linux User
    Join Date
    Nov 2009
    Location
    France
    Posts
    292
    If you process each line with grep and awk or cut , you could count the number of occurences of a fileName. If it's greater than one, you may append the file name to a temp file. Then you can process each line of the temp file and delete the corresponding lines with sed in your data file. Quite cumbersome though ! The file names must not contain white spaces however. White spaces in file names are always evil !

    A sample data file would help.
    0 + 1 = 1 != 2 <> 3 != 4 ...
    Until the camel can pass though the eye of the needle.

  3. #3
    Just Joined!
    Join Date
    Mar 2010
    Posts
    2
    nmset,

    Here is the general format:

    1 f1 a1
    2 f1 a2
    3 f1 a3
    4 f1 a4
    5 f2 a5
    6 f3 a6
    7 f3 a7
    8 f3 a8
    9 f4 a9
    10 f5 a10
    11 f6 a11
    12 f6 a12
    13 f7 a13
    14 f8 a14
    15 f8 a15
    16 f8 a16
    17 f8 a17
    18 f8 a18
    19 f9 a19
    20 f10 a20
    21 f10 a21
    22 f11 a22

    I want to end up with a file containing:

    5 f2 a5
    9 f4 a9
    10 f5 a10
    13 f7 a13
    19 f9 a19
    22 f11 a22

    (or, actually,
    f2 a5
    f4 a9
    f5 a10
    f7 a13
    f9 a19
    f11 a22
    -- that is, without the line numbers)

  4. #4
    Linux User
    Join Date
    Nov 2009
    Location
    France
    Posts
    292
    This might be what you want.
    Makes use of uniq -d to find duplicates on the second column.

    Code:
    for unwanted in $(awk '{print $2}' /tmp/data | uniq -d)
    do
     echo $unwanted
     sed -i /"${unwanted} "/d /tmp/data
    done
    0 + 1 = 1 != 2 <> 3 != 4 ...
    Until the camel can pass though the eye of the needle.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
...