Find the answer to your Linux question:
Page 1 of 2 1 2 LastLast
Results 1 to 10 of 14
Enjoy an ad free experience by logging in. Not a member yet? Register.
  1. #1

    Looking for deduplication in linux.


    Hi!
    Right now i am doing it manually deduplication with short -u, but i cant reconstruct the file.
    Do you have any recommendation for a free linux tool?
    Thanks

  2. #2
    hi,

    be more explicative, please.

    what does your file look like?
    what does s(h)ort -u on it?
    what do you want as output?

    what do you mean «reconstruct»?

    awk can mimic uniq whithout the need of sorting the file.

  3. #3
    Hi watael,
    Thanks for your quick respond.

    My files are text filles.
    Each contain millions of lines.
    Most of the lines are in the same length.
    Some of the lines are identical.

    i want to keep only the unique lines within the file, but without loosing this information.(if i will want the original file i will restore it)
    i know there are tools for deduplication between files, but i am looking for somthing that combine dedup within each file and between all files.

    sort -u is first sorting the file (dictionary order) and then remove the duplicates lines.
    its doing good job, but loosing the possibility to restore the original file.

    Thanks

  4. $spacer_open
    $spacer_close
  5. #4
    Linux Newbie mactruck's Avatar
    Join Date
    Apr 2012
    Location
    City of Salt
    Posts
    187
    Code:
    cat file | sort -u > file.new
    EDIT: this will keep your old file and make a new one with out all the dupes.

  6. #5
    in this way i just use more storage and not save...

  7. #6
    Linux Newbie mactruck's Avatar
    Join Date
    Apr 2012
    Location
    City of Salt
    Posts
    187
    This is assuming that your files end with .txt I do something very similar to this at work. If you can restore the original file then this should work.
    Code:
    #!/usr/bin/bash
    ls *.txt | while read i
    do
    cat $i | sort -u > $i.fixed
    rm $i
    mv $i.fixed $i
    done
    If you wanted to make an alias and be able to run it from any directory I would do this.



    alias name: fix
    Code:
    #!/usr/bin/bash
    cat $1 | sort -u > $1.fixed
    rm $1
    mv $1.fixed $1
    done
    Then you can just type "fix file.txt"

  8. #7
    Sorry mactruck, but i think we are not talking about the same thing.
    i want to create deduplicate file but with the option to restore the file to its original size, without saving a copy.

  9. #8
    you want to sort -u oneFile to an otherFile
    run a diff on both, and keep the diffFile
    delete the otherFile
    and, when needed, apply patch to restore the oneFile from the diffFile
    ?

  10. #9
    No,
    My problem is that my file is very big ~300GB.
    i want to save only the unique rows, but the nubmer of occurence of each string and the original row number of each string is improtant for me.

    For example:
    original file:
    AAAAA
    BBBBB
    AAAAA
    VFVDS
    AAAAA
    unique file:
    AAAAA
    BBBBB
    VFVDS
    reconstruct file:
    AAAAA
    BBBBB
    AAAAA
    VFVDS
    AAAAA
    For this i have to remember the number of times AAAAA occur and where (row 1,3,5)

    Thanks

  11. #10
    it seems like a compression algorithm.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •