Find the answer to your Linux question:
Results 1 to 6 of 6
Enjoy an ad free experience by logging in. Not a member yet? Register.
  1. #1

    Strange Perl problem


    I'm using Perl 5.8.7 on Cygwin 1.5.18 and recently I ran into a strange problem:

    I have file1 whose contents are along the lines of:
    ...
    XXX 10
    YYY 12
    ZZZ 17
    ...

    I have file2 whose contents are along the lines of:
    ...
    XXX
    AAA
    ZZZ
    DDD
    ...

    I have a perl script which reads each line from file2, attempts a match in file1, then extracts the second field:
    Code:
    ...
    my $file1 = ## path to file1 ##
    my $file2 = ## path to file2 ##
    open IN, "<$file2" or die "blabblahblah"
    while (<IN>) {
       chomp;
       my $d = $_; print "d=zzz${d}zzz\n";
       my $left = `grep $d $file1`; print "left=aaa${left}aaa\n";
       chomp $left; print "left=bbb${left}bbb\n";
       my $v = "-";
       if ($left ne "") { $v = (split)[1], $left; }
       print "v=ccc${v}ccc\n";
       ...
    }
    ...
    My debug print statements output something totally unexpected (I'm only going to show one attempted match for XXX below; the rest are similar):
    d=zzzXXXzzz
    left=aaaXXX 10
    aaa
    bbbt=bbbXXX 10
    v=cccccc


    Calling chomp on the newline terminated string returned by grep totally messed up that string (as seen from the debug outputs). Subsequently, (split)[0] on that string returns XXX as expected (not shown here), but (split)[1] on that string returns a null string (instead of 10 as expected). Anyone knows what is going on here or how to fix it? Thanks in advance.

  2. #2
    Linux Guru Cabhan's Avatar
    Join Date
    Jan 2005
    Location
    Seattle, WA, USA
    Posts
    3,252
    I'm very confused with how you're going about this...

    Code:
    #!/usr/bin/perl
    
    open FILE1, "< /path/to/file1" or die "$0: Error\n";
    open FILE2, "< /path/to/file2" or die "$0: Error\n";
    
    while(my $lookfor = <FILE2>)
    {
        chomp $lookfor;
        while(<FILE1>)
        {
            chomp;
            next unless /$lookfor\s+(\d+)/;
    
            print "$1\n";
            last;
        }
    }
    In this case, the 'print "$1\n"' will print the number that matched on the line. Looking at what you did, I don't see why it didn't work, but you're doing something weird there, I know that much.


    AH HA! I figured out what you did wrong.
    Code:
    if ($left ne "") { $v = (split)[1], $left; }
    Here, you're running split with default arguments: $_ on whitespace. You want to be splitting $left on whitespace, so that line should actually read:
    Code:
    if ($left ne "") { $v = (split /\s+/, $left)[1]; }
    See, you were splitting $_, which in this case is "XXX". There is, of course, no whitespace, and you were trying to access the second element, which is undef.

  3. #3
    Thanks Cabhan, you caught the split error (which causes $v to be blank in the printout). My careless mistake.

    After a bit of experimentation, I found the cause of the chomp problem. $left is a windows terminated string, i.e. \r\f., since $file1 is a windows file. chomp only removed the \r leaving \f intact, causing the wraparound problem as shown in the printout. The hanging \f also causes other string comparison problems. So beware when using chomp on non-unix files.

    Is it too much to ask for a chomp that works correctly on all three types of files: unix, windows, mac? or is it already exists?

  4. $spacer_open
    $spacer_close
  5. #4
    Linux Guru Cabhan's Avatar
    Join Date
    Jan 2005
    Location
    Seattle, WA, USA
    Posts
    3,252
    The issue with chomp is that since you're using Cygwin, you're running the Linux-compiled Perl. A Windows version (such as ActivePerl) should do that stuff correctly.

    So yeah.

  6. #5
    Cabhan, thanks for advice. But you're of the same mind as this user on Linux Questions who pointed out to me:

    Chomp removes the input record separator (special variable $\) which by default is a "\n". Set it to whatever you want and chomp will remove it. Alternatively, you can use the regex s/\s+$// which will remove any and all whitespace characters (including carriage returns and line feeds) from the end of the string.

    And this is my response:

    1) Setting $\

    This means the script will only work correctly for one specific type of files (unix, windows, or mac). Certainly, this alternative will not work if you don't know ahead of time which type of files your script will have to deal with. Also, this definitely won't work if your script needs to work with more than one type of files.

    2) Using regex s/\s+$// instead of chomp

    This of course will work with all types of files. I can even define my custom chomp to do this regex if calling chomp is more convenient. However, this highlights the problem of having to define (or redefine) common functions in Perl just have my scripts work correctly cross-platform. If there are a dozen more like chomp, then I have to redefine them all for every single one of my scripts? Wouldn't it be better if the Perl language is implemented with cross-platform in mind instead of shifting this burden to its programmers?

  7. #6
    Linux Guru Cabhan's Avatar
    Join Date
    Jan 2005
    Location
    Seattle, WA, USA
    Posts
    3,252
    I understand what you're saying, but I do need to ask if you've tried ActivePerl, and if it does it correctly.

    Because Windows and Unix-based systems use different line terminators, the implementation does need to be somewhat OS-based. If you use a Perl implementation for Linux, you expect your files to end in LF. You are using a Linux implementation on a Windows system, and therefore are getting mixed results.

    When I worked in a Bioinformatics lab last summer as an intern, one of my jobs was processing data files with Perl scripts. And I had this problem too: the data files were created on Windows and uploaded to the Linux server where I did my work. So yeah, I had to make the extra step of removing the CR. While a little annoying, it's not really that much work.

    You could, I suppose, do something like this:

    Code:
    $/ = "\n" if $^O eq 'linux' or $^O eq 'WhateverMacOSXis';
    $/ = "\r\n" if $^O eq 'MSWin32';
    That may work. Note that $^O returns the OS perl was built under, so Cygwin may make things a little freaky.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •