Find the answer to your Linux question:
Results 1 to 2 of 2
I have a large text file encoded in Unicode that I need to convert to CSV. In general, I know how to do this by regular expression substitutions using sed ...
  1. #1
    Just Joined!
    Join Date
    Jun 2006
    Location
    Carlisle, MA
    Posts
    13

    Regular expression / regex substition on Unicode text

    I have a large text file encoded in Unicode that
    I need to convert to CSV. In general, I know how
    to do this by regular expression substitutions using
    sed or Perl, but one problem I am having is that I
    need to put a quotation mark at the end of each line
    to protect the last field. The usual regex substitution ...
    s/$/"/
    ... works fine for 7-bit ASCII text, but when I run this
    on my Unicode text file, the double quotation mark
    appears at the BEGINNING of the FOLLOWING
    line, not at the end of the line on which it's supposed
    to appear.
    Does anyone know of a special library function
    intended for this purpose, a Perl pragma, etc., that
    would accomplish this easily? This should be a
    trivial problem.
    Thanks in advance for any suggestions.
    Tom

  2. #2
    Just Joined!
    Join Date
    Jun 2006
    Location
    Carlisle, MA
    Posts
    13
    As I mentioned, the regex ...
    s/$/"/
    ... puts the `"' at the beginning of the following line,
    and piping through dos2unix doesn't matter one
    way or the other. Using `\n' instead of `$' doesn't
    make any difference.
    However, I discovered an interesting fact: The regex ...
    s/\r/"/ # use `\r' instead of `$' or `\n'
    ... gives the expected result, and does so without
    piping through dos2unix making any difference!
    I also found that a C program using a wchar_t
    declaration behaves similarly. That is, a character
    that appears as if it should be output BEFORE the
    EOL actually appears after it, if it is matched as '\n',
    but if it is matched as '\r' then it is output as
    expected.
    My immediate problem is solved, however I wonder
    whether this doesn't show a bug in Perl or in its
    regular expression engine ...
    This seems to be a clear case where Unicode
    text is handled differently than non-Unicode text.
    Any opinions?
    Tom

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •  
...