Find the answer to your Linux question:
Results 1 to 8 of 8
Hi new to all this so bear with me. I have two lists. List A has the old names and some values for each of the names e.g. List A: ...
  1. #1
    qnc
    qnc is offline
    Just Joined!
    Join Date
    May 2011
    Posts
    1

    Unix "Join" command help required.

    Hi new to all this so bear with me.

    I have two lists. List A has the old names and some values for each of the names

    e.g.
    List A:
    snpa 3.2
    snpb -2
    snpc 0


    List B has old names and new names equivalent:
    snpa rs1234
    snpb rs2345
    snpc re3456

    Now I have over 50,000 lines in each and want it to end up as

    Final list after join.
    rs1234 3.2
    rs2345 -2
    rs3456 0

    The code I have been using:

    join fileA.txt filebB.txt | awk '{print $3 ,$2}' | sort -k1 > final_list.txt

    But all I get is the first few lines rather the 50k lines I should get.

    Why is it working but only for a small amount of the data?

  2. #2
    Just Joined!
    Join Date
    Aug 2006
    Posts
    12
    If the field separators are not consistent or the input files are not sorted this can happen.
    I would ensure that each line has either a tab or space separator, then use
    join -t' ' fileA.txt fileB.txt ...
    making sure that either a tab or a space follows the -t option.

    It is strictly required that both fileA.txt and fileB.txt are sorted prior to doing the join.

    John
    Last edited by John Rodkey; 05-16-2011 at 11:45 PM. Reason: confirmed behavior of join

  3. #3
    Just Joined!
    Join Date
    Jan 2011
    Location
    Fairfax, Virginia, USA
    Posts
    94
    I made a quick bash script to generate data like you mentioned and your command line works fine. I even tried to corrupt, unsort and delete lines and things still worked. I write a lot of bash and awk and I never used join(1) before so thanks.

    Before I go on, this is the script I used to generate the fake data:
    Code:
    #!/bin/bash
    readonly RECORDS=50000
    readonly FIRST_FIELD=$( seq 0 $RECORDS )
    readonly -a SECOND_FIELD=( $( seq $1 $(( $1 + $RECORDS )) ) )  #<-- bash array
    for a in $FIRST_FIELD
    do
       echo $a ${SECOND_FIELD[$a]}
    done
    I generated 50k records in two files like this:
    Code:
    [brian@bmicek x]$ ./a.sh 20000 > file1.txt
    [brian@bmicek x]$ ./a.sh 400000 > file2.txt
    and the data looks like this:
    Code:
    [brian@bmicek x]$ head -3 file1.txt 
    0 20000
    1 20001
    2 20002
    [...]
    [brian@bmicek x]$ head -3 file2.txt 
    0 400000
    1 400001
    2 400002
    [...]
    It looks like your data might not be what you think ... try this awk program which should be a little bit more tolerant of bad data:
    Code:
    #!/usr/bin/awk -f
    {
      inx = $1  #<-- associative array (text) input
      value = $2
      array[ inx ] = array[ inx ] " " value
    }
    END {
      for  ( x in array ) {
         print x, array[ x ]
      }
    }
    and (assuming its named something like a.awk), run it like this:
    Code:
    [brian@bmicek x]$ ./a.awk file1.txt file2.txt | sort -g | awk '{ print $2, $3 }'
    (Note, I used "sort -g" so my output numerically sorts correctly. You probably want to use "sort" to sort the first field by string)

    When I run these two commands:
    Code:
    [brian@bmicek x]$ join file1.txt file2.txt  | sort -g | awk '{ print $2, $3 }' > out1.txt
    [brian@bmicek x]$ ./a.awk file1.txt file2.txt | sort -g | awk '{ print $2, $3 }' > out2.txt
    (Note, you probably want to replace "sort -g" with "sort")
    I get the same results as seen with:
    Code:
    [brian@bmicek x]$ md5sum out1.txt out2.txt 
    c15e52a0f4c8c77efb00511c8d1c2723  out1.txt
    c15e52a0f4c8c77efb00511c8d1c2723  out2.txt

  4. #4
    Just Joined!
    Join Date
    Feb 2011
    Posts
    83

    Try some serious script

    Hi,
    If you can program in other languages (besides Awk) you can make a beautiful structure in C++ for reading from file A the old name and the value, finding in File B the new name matched, replacing the old name with the new name, and writing in a new file C the new name with the 'old' value.
    _____________________
    N.B.: With this approach it doesn't matter whether the lists are sorted or not, or whether a name or a value is missing.

    Regards

  5. #5
    Linux Enthusiast Kloschüssel's Avatar
    Join Date
    Oct 2005
    Location
    Italy
    Posts
    717
    awk can also work on multiple files. a quick google search gave this link: Combine multiple files using awk

    surely it's not the solution, but it opens a possibility worth to investigate.

  6. #6
    scm
    scm is offline
    Linux Engineer
    Join Date
    Feb 2005
    Posts
    1,044
    Are your two lists sorted? I think join requires them to be.

  7. #7
    Just Joined!
    Join Date
    Feb 2011
    Posts
    83
    You have warning in the help option of <join>:

    FILE1 and FILE2 must be sorted on the join fields.
    E.g., use `sort -k 1b,1' if `join' has no options.
    If the input is not sorted and some lines cannot be joined, a
    warning message will be given.
    ___________________
    ... and my comment to this is: If some lines cannot be joined forget about <join>.

    Even if they are sorted and matching 1:1 <join> still does not do the job properly, for it 'sticks together' the lines and gives output of three values.

    If you think of doing this seriously write a C++ script to do the job.

  8. #8
    drl
    drl is offline
    Linux Engineer drl's Avatar
    Join Date
    Apr 2006
    Location
    Saint Paul, MN, USA / CentOS, Debian, Solaris, SuSE
    Posts
    1,117
    Hi.

    Using your data:
    Code:
    #!/usr/bin/env bash
    
    # @(#) s1	Demonstrate join omitting field used for join operation.
    
    # Utility functions: print-as-echo, print-line-with-visual-space, debug.
    pe() { for i;do printf "%s" "$i";done; printf "\n"; }
    pl() { pe;pe "-----" ;pe "$*"; }
    db() { ( printf " db, ";for i;do printf "%s" "$i";done; printf "\n" ) >&2 ; }
    db() { : ; }
    C=$HOME/bin/context && [ -f $C ] && . $C sort join
    
    FILE1=${1-data1}
    shift
    FILE2=${1-data2}
    
    pl " Input data $FILE1:"
    cat $FILE1
    
    pl " Input data $FILE2:"
    cat $FILE2
    
    pl " Results from join of 2 sorted files, omitting output join field:"
    join -o 2.2 1.2 <( sort $FILE1 ) <( sort $FILE2 )
    
    exit 0
    producing:
    Code:
    % ./s1
    
    Environment: LC_ALL = C, LANG = C
    (Versions displayed with local utility "version")
    OS, ker|rel, machine: Linux, 2.6.26-2-amd64, x86_64
    Distribution        : Debian GNU/Linux 5.0.8 (lenny) 
    GNU bash 3.2.39
    sort (GNU coreutils) 6.10
    join (GNU coreutils) 6.10
    
    -----
     Input data data1:
    snpa 3.2
    snpc 0
    snpb -2
    
    -----
     Input data data2:
    snpb rs2345
    snpa rs1234
    snpc re3456
    
    -----
     Results from join of 2 sorted files, omitting output join field:
    rs1234 3.2
    rs2345 -2
    re3456 0
    See info coreutils join, man sort, man bash for details ... cheers, drl
    Welcome - get the most out of the forum by reading forum basics and guidelines: click here.
    90% of questions can be answered by using man pages, Quick Search, Advanced Search, Google search, Wikipedia.
    We look forward to helping you with the challenge of the other 10%.
    ( Mn, 2.6.n, AMD-64 3000+, ASUS A8V Deluxe, 1 GB, SATA + IDE, Matrox G400 AGP )

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
...