Find the answer to your Linux question:
Page 1 of 2 1 2 LastLast
Results 1 to 10 of 14
Hi! Right now i am doing it manually deduplication with short -u, but i cant reconstruct the file. Do you have any recommendation for a free linux tool? Thanks...
Enjoy an ad free experience by logging in. Not a member yet? Register.
  1. #1
    Just Joined!
    Join Date
    Jul 2011
    Posts
    18

    Looking for deduplication in linux.


    Hi!
    Right now i am doing it manually deduplication with short -u, but i cant reconstruct the file.
    Do you have any recommendation for a free linux tool?
    Thanks

  2. #2
    Linux Newbie
    Join Date
    Nov 2012
    Posts
    220
    hi,

    be more explicative, please.

    what does your file look like?
    what does s(h)ort -u on it?
    what do you want as output?

    what do you mean «reconstruct»?

    awk can mimic uniq whithout the need of sorting the file.

  3. #3
    Just Joined!
    Join Date
    Jul 2011
    Posts
    18
    Hi watael,
    Thanks for your quick respond.

    My files are text filles.
    Each contain millions of lines.
    Most of the lines are in the same length.
    Some of the lines are identical.

    i want to keep only the unique lines within the file, but without loosing this information.(if i will want the original file i will restore it)
    i know there are tools for deduplication between files, but i am looking for somthing that combine dedup within each file and between all files.

    sort -u is first sorting the file (dictionary order) and then remove the duplicates lines.
    its doing good job, but loosing the possibility to restore the original file.

    Thanks

  4. #4
    Linux Newbie mactruck's Avatar
    Join Date
    Apr 2012
    Location
    City of Salt
    Posts
    185
    Code:
    cat file | sort -u > file.new
    EDIT: this will keep your old file and make a new one with out all the dupes.

  5. #5
    Just Joined!
    Join Date
    Jul 2011
    Posts
    18
    in this way i just use more storage and not save...

  6. #6
    Linux Newbie mactruck's Avatar
    Join Date
    Apr 2012
    Location
    City of Salt
    Posts
    185
    This is assuming that your files end with .txt I do something very similar to this at work. If you can restore the original file then this should work.
    Code:
    #!/usr/bin/bash
    ls *.txt | while read i
    do
    cat $i | sort -u > $i.fixed
    rm $i
    mv $i.fixed $i
    done
    If you wanted to make an alias and be able to run it from any directory I would do this.



    alias name: fix
    Code:
    #!/usr/bin/bash
    cat $1 | sort -u > $1.fixed
    rm $1
    mv $1.fixed $1
    done
    Then you can just type "fix file.txt"

  7. #7
    Just Joined!
    Join Date
    Jul 2011
    Posts
    18
    Sorry mactruck, but i think we are not talking about the same thing.
    i want to create deduplicate file but with the option to restore the file to its original size, without saving a copy.

  8. #8
    Linux Newbie
    Join Date
    Nov 2012
    Posts
    220
    you want to sort -u oneFile to an otherFile
    run a diff on both, and keep the diffFile
    delete the otherFile
    and, when needed, apply patch to restore the oneFile from the diffFile
    ?

  9. #9
    Just Joined!
    Join Date
    Jul 2011
    Posts
    18
    No,
    My problem is that my file is very big ~300GB.
    i want to save only the unique rows, but the nubmer of occurence of each string and the original row number of each string is improtant for me.

    For example:
    original file:
    AAAAA
    BBBBB
    AAAAA
    VFVDS
    AAAAA
    unique file:
    AAAAA
    BBBBB
    VFVDS
    reconstruct file:
    AAAAA
    BBBBB
    AAAAA
    VFVDS
    AAAAA
    For this i have to remember the number of times AAAAA occur and where (row 1,3,5)

    Thanks

  10. #10
    Linux Newbie
    Join Date
    Nov 2012
    Posts
    220
    it seems like a compression algorithm.

Page 1 of 2 1 2 LastLast

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •