Find the answer to your Linux question:
Results 1 to 2 of 2
Enjoy an ad free experience by logging in. Not a member yet? Register.
  1. #1

    Need help searching for values in file then adding to line


    I'm currently trying to organize data for some bio research, but I'm not sure how to compare a value to values in a file. So what I have are 2 arrays, one array contains NM numbers and can be referenced as NM[#]. The other array has symbols, SYM[#]. I have a file for which it contains an NM number every other line and between each NM number, irrelevant information (but I need it in there still). What I need to do is match every NM[#] in my array to the NM number in the file, but also add :Sym[#] to the end of that line. The problem is, before each NM number in the file, there is a > symbol in front of the line (which needs to stay there). So for example I have an array NM that looks like:

    {NM_23948375 NM_03948274 NM_39482746 NM_20475839} #except there are about 2 thousand values

    and SYM:

    {fj48g9sk 2idjf8a0s ajsie9rt skdjie8t} #same amount of values as NM

    and the file looks like:


    I need to take the first NM number in my NM array and compare it to every other line in the file without the > in front. Then, when that line in the file is found, I need to add :SYM, where SYM is the same order as the NM number from the array. So take the first NM number, find the line, add the first symbol. Then the second NM number, match it, add second symbol, and so on, for a final product that looks like:


    I feel like the process should be relatively simple, I'm just completely new at this and was looking for any help. I'm not really even sure how to start.

    Here's what I have (forgive all syntax errors, everything I want to do is in there, I just need help translating it to code, file to be edited is called file.fa, I can also take it as an argument and refer to it as $1 if that's easier):

    for ((i=0; i<$(wc -l file.fa)/2; i++))
      for ((j=0; j<$(wc -l file.fa)/2; j++))
        if ($NM[i] = $fileline[2*j+1)]) #without the >
          sed '(2*(j+1)s/.*/>$NM[i]:$SYM[i]/
    I also have access to perl if that makes things easier. Also, if this is all possible by just using the command line, that'd be simpler for me.

    Sorry for the long post and any help is appreciated!

  2. #2
    Trusted Penguin Irithori's Avatar
    Join Date
    May 2009
    Hi and welcome

    If I understood it correct, then the following ruby snippet should do it.
    #!/usr/bin/env ruby
    NM     = ["NM_23948375","NM_03948274","NM_39482746","NM_20475839"]
    SYM    = ["fj48g9sk","2idjf8a0s","ajsie9rt","skdjie8t"]
    output ="output", "w")"datafile", "r") do |datafile|
      while (line = datafile.gets)
        if ( line =~ /^>NM_[0-9]+.*/ )
          line.sub!(/^>(NM_[0-9]+).*/, '\1')
          i = NM.index("#{line}")
          unless i.nil?
            line = ">#{line}:#{SYM[i]}"
            line = ">#{line}"
          output.puts "#{line}"
    It will be faster than the sed approach in your post, as the datafile is read only once.
    I am a bit unhappy with the two regex lines. They can potentially be merged into one.
    There is also hardcoding (datafile, output, NM and SYM)
    and errorchecking is missing.
    e.g.: A different number of elements in NM and SYM is unhandled.
    And so are duplicates.

    So treat it as a skeleton
    You must always face the curtain with a bow.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts