Results 1 to 8 of 8
Hi new to all this so bear with me.
I have two lists. List A has the old names and some values for each of the names
e.g.
List A:
...
- 05-16-2011 #1Just Joined!
- Join Date
- May 2011
- Posts
- 1
Unix "Join" command help required.
Hi new to all this so bear with me.
I have two lists. List A has the old names and some values for each of the names
e.g.
List A:
snpa 3.2
snpb -2
snpc 0
List B has old names and new names equivalent:
snpa rs1234
snpb rs2345
snpc re3456
Now I have over 50,000 lines in each and want it to end up as
Final list after join.
rs1234 3.2
rs2345 -2
rs3456 0
The code I have been using:
join fileA.txt filebB.txt | awk '{print $3 ,$2}' | sort -k1 > final_list.txt
But all I get is the first few lines rather the 50k lines I should get.
Why is it working but only for a small amount of the data?
- 05-16-2011 #2Just Joined!
- Join Date
- Aug 2006
- Posts
- 12
If the field separators are not consistent or the input files are not sorted this can happen.
I would ensure that each line has either a tab or space separator, then use
join -t' ' fileA.txt fileB.txt ...
making sure that either a tab or a space follows the -t option.
It is strictly required that both fileA.txt and fileB.txt are sorted prior to doing the join.
JohnLast edited by John Rodkey; 05-16-2011 at 11:45 PM. Reason: confirmed behavior of join
- 05-16-2011 #3Just Joined!
- Join Date
- Jan 2011
- Location
- Fairfax, Virginia, USA
- Posts
- 94
I made a quick bash script to generate data like you mentioned and your command line works fine. I even tried to corrupt, unsort and delete lines and things still worked. I write a lot of bash and awk and I never used join(1) before so thanks.
Before I go on, this is the script I used to generate the fake data:
I generated 50k records in two files like this:Code:#!/bin/bash readonly RECORDS=50000 readonly FIRST_FIELD=$( seq 0 $RECORDS ) readonly -a SECOND_FIELD=( $( seq $1 $(( $1 + $RECORDS )) ) ) #<-- bash array for a in $FIRST_FIELD do echo $a ${SECOND_FIELD[$a]} done
and the data looks like this:Code:[brian@bmicek x]$ ./a.sh 20000 > file1.txt [brian@bmicek x]$ ./a.sh 400000 > file2.txt
It looks like your data might not be what you think ... try this awk program which should be a little bit more tolerant of bad data:Code:[brian@bmicek x]$ head -3 file1.txt 0 20000 1 20001 2 20002 [...] [brian@bmicek x]$ head -3 file2.txt 0 400000 1 400001 2 400002 [...]
and (assuming its named something like a.awk), run it like this:Code:#!/usr/bin/awk -f { inx = $1 #<-- associative array (text) input value = $2 array[ inx ] = array[ inx ] " " value } END { for ( x in array ) { print x, array[ x ] } }
(Note, I used "sort -g" so my output numerically sorts correctly. You probably want to use "sort" to sort the first field by string)Code:[brian@bmicek x]$ ./a.awk file1.txt file2.txt | sort -g | awk '{ print $2, $3 }'
When I run these two commands:
(Note, you probably want to replace "sort -g" with "sort")Code:[brian@bmicek x]$ join file1.txt file2.txt | sort -g | awk '{ print $2, $3 }' > out1.txt [brian@bmicek x]$ ./a.awk file1.txt file2.txt | sort -g | awk '{ print $2, $3 }' > out2.txt
I get the same results as seen with:
Code:[brian@bmicek x]$ md5sum out1.txt out2.txt c15e52a0f4c8c77efb00511c8d1c2723 out1.txt c15e52a0f4c8c77efb00511c8d1c2723 out2.txt
- 05-17-2011 #4Just Joined!
- Join Date
- Feb 2011
- Posts
- 83
Try some serious script
Hi,
If you can program in other languages (besides Awk) you can make a beautiful structure in C++ for reading from file A the old name and the value, finding in File B the new name matched, replacing the old name with the new name, and writing in a new file C the new name with the 'old' value.
_____________________
N.B.: With this approach it doesn't matter whether the lists are sorted or not, or whether a name or a value is missing.
Regards
- 05-17-2011 #5
awk can also work on multiple files. a quick google search gave this link: Combine multiple files using awk
surely it's not the solution, but it opens a possibility worth to investigate.
- 05-17-2011 #6Linux Engineer
- Join Date
- Feb 2005
- Posts
- 1,044
Are your two lists sorted? I think join requires them to be.
- 05-18-2011 #7Just Joined!
- Join Date
- Feb 2011
- Posts
- 83
You have warning in the help option of <join>:
FILE1 and FILE2 must be sorted on the join fields.
E.g., use `sort -k 1b,1' if `join' has no options.
If the input is not sorted and some lines cannot be joined, a
warning message will be given.
___________________
... and my comment to this is: If some lines cannot be joined forget about <join>.
Even if they are sorted and matching 1:1 <join> still does not do the job properly, for it 'sticks together' the lines and gives output of three values.
If you think of doing this seriously write a C++ script to do the job.
- 05-19-2011 #8Linux Engineer
- Join Date
- Apr 2006
- Location
- Saint Paul, MN, USA / CentOS, Debian, Solaris, SuSE
- Posts
- 1,117
Hi.
Using your data:
producing:Code:#!/usr/bin/env bash # @(#) s1 Demonstrate join omitting field used for join operation. # Utility functions: print-as-echo, print-line-with-visual-space, debug. pe() { for i;do printf "%s" "$i";done; printf "\n"; } pl() { pe;pe "-----" ;pe "$*"; } db() { ( printf " db, ";for i;do printf "%s" "$i";done; printf "\n" ) >&2 ; } db() { : ; } C=$HOME/bin/context && [ -f $C ] && . $C sort join FILE1=${1-data1} shift FILE2=${1-data2} pl " Input data $FILE1:" cat $FILE1 pl " Input data $FILE2:" cat $FILE2 pl " Results from join of 2 sorted files, omitting output join field:" join -o 2.2 1.2 <( sort $FILE1 ) <( sort $FILE2 ) exit 0
See info coreutils join, man sort, man bash for details ... cheers, drlCode:% ./s1 Environment: LC_ALL = C, LANG = C (Versions displayed with local utility "version") OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64 Distribution : Debian GNU/Linux 5.0.8 (lenny) GNU bash 3.2.39 sort (GNU coreutils) 6.10 join (GNU coreutils) 6.10 ----- Input data data1: snpa 3.2 snpc 0 snpb -2 ----- Input data data2: snpb rs2345 snpa rs1234 snpc re3456 ----- Results from join of 2 sorted files, omitting output join field: rs1234 3.2 rs2345 -2 re3456 0
Welcome - get the most out of the forum by reading forum basics and guidelines: click here.
90% of questions can be answered by using man pages, Quick Search, Advanced Search, Google search, Wikipedia.
We look forward to helping you with the challenge of the other 10%.
( Mn, 2.6.n, AMD-64 3000+, ASUS A8V Deluxe, 1 GB, SATA + IDE, Matrox G400 AGP )


Reply With Quote