Results 1 to 10 of 14
Hi!
Right now i am doing it manually deduplication with short -u, but i cant reconstruct the file.
Do you have any recommendation for a free linux tool?
Thanks...
Enjoy an ad free experience by logging in. Not a member yet? Register.
- 02-28-2013 #1Just Joined!
- Join Date
- Jul 2011
- Posts
- 11
Looking for deduplication in linux.
Hi!
Right now i am doing it manually deduplication with short -u, but i cant reconstruct the file.
Do you have any recommendation for a free linux tool?
Thanks
- 02-28-2013 #2Linux Newbie
- Join Date
- Nov 2012
- Posts
- 134
hi,
be more explicative, please.
what does your file look like?
what does s(h)ort -u on it?
what do you want as output?
what do you mean «reconstruct»?
awk can mimic uniq whithout the need of sorting the file.
- 02-28-2013 #3Just Joined!
- Join Date
- Jul 2011
- Posts
- 11
Hi watael,
Thanks for your quick respond.
My files are text filles.
Each contain millions of lines.
Most of the lines are in the same length.
Some of the lines are identical.
i want to keep only the unique lines within the file, but without loosing this information.(if i will want the original file i will restore it)
i know there are tools for deduplication between files, but i am looking for somthing that combine dedup within each file and between all files.
sort -u is first sorting the file (dictionary order) and then remove the duplicates lines.
its doing good job, but loosing the possibility to restore the original file.
Thanks
- 02-28-2013 #4EDIT: this will keep your old file and make a new one with out all the dupes.Code:
cat file | sort -u > file.new
- 02-28-2013 #5Just Joined!
- Join Date
- Jul 2011
- Posts
- 11
in this way i just use more storage and not save...
- 02-28-2013 #6
This is assuming that your files end with .txt I do something very similar to this at work. If you can restore the original file then this should work.
If you wanted to make an alias and be able to run it from any directory I would do this.Code:#!/usr/bin/bash ls *.txt | while read i do cat $i | sort -u > $i.fixed rm $i mv $i.fixed $i done
alias name: fix
Then you can just type "fix file.txt"Code:#!/usr/bin/bash cat $1 | sort -u > $1.fixed rm $1 mv $1.fixed $1 done
- 02-28-2013 #7Just Joined!
- Join Date
- Jul 2011
- Posts
- 11
Sorry mactruck, but i think we are not talking about the same thing.
i want to create deduplicate file but with the option to restore the file to its original size, without saving a copy.
- 02-28-2013 #8Linux Newbie
- Join Date
- Nov 2012
- Posts
- 134
you want to sort -u oneFile to an otherFile
run a diff on both, and keep the diffFile
delete the otherFile
and, when needed, apply patch to restore the oneFile from the diffFile
?
- 02-28-2013 #9Just Joined!
- Join Date
- Jul 2011
- Posts
- 11
No,
My problem is that my file is very big ~300GB.
i want to save only the unique rows, but the nubmer of occurence of each string and the original row number of each string is improtant for me.
For example:
original file:
AAAAA
BBBBB
AAAAA
VFVDS
AAAAA
unique file:
AAAAA
BBBBB
VFVDS
reconstruct file:
AAAAA
BBBBB
AAAAA
VFVDS
AAAAA
For this i have to remember the number of times AAAAA occur and where (row 1,3,5)
Thanks
- 02-28-2013 #10Linux Newbie
- Join Date
- Nov 2012
- Posts
- 134
it seems like a compression algorithm.


Reply With Quote
