Results 1 to 2 of 2
I have a large text file encoded in Unicode that
I need to convert to CSV. In general, I know how
to do this by regular expression substitutions using
sed ...
- 02-02-2010 #1Just Joined!
- Join Date
- Jun 2006
- Location
- Carlisle, MA
- Posts
- 13
Regular expression / regex substition on Unicode text
I have a large text file encoded in Unicode that
I need to convert to CSV. In general, I know how
to do this by regular expression substitutions using
sed or Perl, but one problem I am having is that I
need to put a quotation mark at the end of each line
to protect the last field. The usual regex substitution ...
s/$/"/
... works fine for 7-bit ASCII text, but when I run this
on my Unicode text file, the double quotation mark
appears at the BEGINNING of the FOLLOWING
line, not at the end of the line on which it's supposed
to appear.
Does anyone know of a special library function
intended for this purpose, a Perl pragma, etc., that
would accomplish this easily? This should be a
trivial problem.
Thanks in advance for any suggestions.
Tom
- 02-03-2010 #2Just Joined!
- Join Date
- Jun 2006
- Location
- Carlisle, MA
- Posts
- 13
As I mentioned, the regex ...
s/$/"/
... puts the `"' at the beginning of the following line,
and piping through dos2unix doesn't matter one
way or the other. Using `\n' instead of `$' doesn't
make any difference.
However, I discovered an interesting fact: The regex ...
s/\r/"/ # use `\r' instead of `$' or `\n'
... gives the expected result, and does so without
piping through dos2unix making any difference!
I also found that a C program using a wchar_t
declaration behaves similarly. That is, a character
that appears as if it should be output BEFORE the
EOL actually appears after it, if it is matched as '\n',
but if it is matched as '\r' then it is output as
expected.
My immediate problem is solved, however I wonder
whether this doesn't show a bug in Perl or in its
regular expression engine ...
This seems to be a clear case where Unicode
text is handled differently than non-Unicode text.
Any opinions?
Tom


Reply With Quote