Perl, UTF-8, and binmode on filehandles
Greg Sabino Mullane
February 21, 2012
I recently ran into a Perl quirk involving UTF-8, standard filehandles, and the built-in Perl die() and warn() functions. Someone reported a bug in the check_postgres program in which the French output was displaying incorrectly. That is, when the locale was set to FR_fr, the French accented characters generated by the program were coming out as “byte soup” instead of proper UTF-8. Some other languages, English and Japanese among them, seemed to be fine. For example:
## English: "sorry, too many clients already" ## Japanese: "現在クライアント数が多すぎます" ## French expected: "désolé, trop de clients sont déjà connectés" ## French actual: "d�sol�, trop de clients sont d�j� connect�s"
That last line should be very familiar to anyone who has struggled with Unicode on a command line, with those question marks on an inverted background. Our problem was that the output of the script looked like the last line, rather than the one before it. The Japanese output, despite being chock full of Unicode, does have the same problem! More on that later.
I was able to duplicate the problem easy enough by setting my locale to FR_fr and having check_postgres output a message with some non-ASCII characters in it. However, as noted above, some languages were fine, some were not.
Before going any further, I should point out that this Perl script did have a use utf8; at the top of it, as it should. This does not dictate how things will be read in or output,but merely tells Perl that the source code itself contains UTF-8 characters. Now to the quirky parts.
I normally test my Perl scripts on the fly by adding a quick series of debugging statements to warn()s or die()s. Both go to stderr, so it is easy to separate your debugging statements from normal output of the code. However, when I output a non-ASCII message in question immediately after it was defined in the script, it showed a normal, expected UTF-8 string. So I started tracking things through the code, to see if there was some point at which the apparently normal UTF-8 string gets turned back into byte soup. It never did; I finally realized that although print was outputting byte soup, both warn() and die() were outputting UTF-8! Here’s a sample script to better demonstrate the problem:
#!perl use strict; use warnings; use utf8; my $msg = 'This is a micro symbol: µ'; print "print = $msg\n"; warn "warn = $msg\n"; die "die = $msg\n";
Now let’s run it and see what happens:
print = This is a micro symbol: � warn = This is a micro symbol: µ die = This is a micro symbol: µ
So we’ve found one Perl quirk: the output of print() and warn() are different, as warn() manages to correctly output the string as UTF-8. Perhaps it is just that the stdout and stderr filehandles are using different encodings? Let’s take a look by expanding the script and explicitly printing to both stdout and stderr. We’ll also add some other Unicode characters, to emulate the difference between French and Japanese above:
#!perl use strict; use warnings; use utf8; my $msg = 'This is a micro symbol: µ'; my $alert = 'The radioactive snowmen come in peace: ☢ ☃☃☃ ☮'; print STDOUT "print to STDOUT = $msg\n"; print STDOUT "print to STDOUT = $alert\n"; print STDERR "print to STDERR = $msg\n"; print STDERR "print to STDERR = $alert\n"; warn "warn = $msg\n"; warn "warn = $alert\n";
(Note: if you do not see small literal snowmen characters in the above script, you need to get a better browser or RSS reader!)
print to STDOUT = This is a micro symbol: � Wide character in print at utf12 line 11. print to STDOUT = The radioactive snowmen come in peace: ☢ ☃☃☃ ☮ print to STDERR = This is a micro symbol: � Wide character in print at utf12 line 14. print to STDERR = The radioactive snowmen come in peace: ☢ ☃☃☃ ☮ warn = This is a micro symbol: µ warn = The radioactive snowmen come in peace: ☢ ☃☃☃ ☮
There are a number of things to note here. First, that the stderr filehandle has the same problem as the stdout filehandle. So, while warn() and die() send things to stderr, there is some magic happening behind the scenes such that sending a string to them is not the same as sending it to stderr ourselves via a print statement. Which is a good thing overall, as it would be more weird for stdout and stderr to have different encoding layers! The solution to this is simple enough: just force stdout to have the proper encoding by use of the binmode function:
binmode STDOUT, ':utf8';
Indeed, the one line above solved the original poster’s problem; applying it to our test script shows that the stdout filehandle now outputs things correctly, unlike the stderr filehandle:
print to STDOUT = This is a micro symbol: µ print to STDOUT = The radioactive snowmen come in peace: ☢ ☃☃☃ ☮ print to STDERR = This is a micro symbol: � Wide character in print at utf12 line 16. print to STDERR = The radioactive snowmen come in peace: ☢ ☃☃☃ ☮ warn = This is a micro symbol: µ warn = The radioactive snowmen come in peace: ☢ ☃☃☃ ☮
The next thing to notice is that the snowmen alert message is displayed properly everywhere. Why is this? The answer lies in that the micro symbol (and the accented French characters) fall into a range that could still be ASCII, as far as Perl is concerned. What happens is that, in the lack of any explicit guidance, Perl makes a best guess as to whether a string to be outputted contains UTF-8 characters or not. In the case of the French and “micro” strings, it guessed wrong, and the characters were output as ASCII. In the case of the Japanese and “snowmen” strings, it still guessed wrong, even though the strings contained higher bytes that left no doubt that we had left ASCII-land and were exploring the land of Unicode. In other words, even though they were still not coming out as pure UTF-8, there is no direct ASCII equivalent so they appear as the characters one would expect. Note, however, that Perl still emits a wide character warning, for it recognizes that something is probably wrong. The warnings go away when we use
binmode to force the encoding layer to
The correct solution when dealing with UTF-8 is to be explicit and not let Perl make any guesses. Solutions to this vary, but the combination here of adding use utf8; and binmode STDOUT, ':utf8';. While I was able to duplicate the problem right away, the combination of Perl making inconsistent guesses and the odd behavior of warn() and die() turned this from a quick fix into a slightly longer investigation. Yes, Unicode and Perl has given me quite a few gray hairs over the years, but I always feel better when I look at how other languages handle Unicode. :)