run_cmd && utf8 streams

Zsbán Ambrus ambrus at math.bme.hu
Fri Jul 6 12:54:28 CEST 2012


On 7/6/12, Marc Lehmann <schmorp at schmorp.de> wrote:
> As for incremental decoding, perl unfortunately doesn't have an interface
> for this kind of thing (Encode cannot incrementally encode or decode).

Perl's encode can actually handle incremental decoding of utf-8.
Here's some example code for how to use Encode this way at the end of
my mail.

There are some caveats.  Encode still might not be able to
incrementally decode encodings that are more stateful than utf-8 or
utf-16, that is, encodings with shift characters.  The code for using
Encode this way is a bit ugly.  You might not be able to detect
invalid input immediately, but if you get only valid utf-8 input, then
as much of it will definitely be decoded as possible.  If your utf-8
input has byte order marks, as some programs insist on adding, you may
have to strip them by hand if whatever is eating the decoded text
doesn't like them.  And of course, all this might not work in very old
versions of perl.

> The perlio :encoding layer comes closest (it employs some hacks in more
> recent perls), but still cannot decode multibyte data incrementally.

My problem with the encoding layer is that it still has some bugs even
in recent perl versions.  It seems that some of these bugs I can't
even work around without throwing away the layer completely.  Though
it will probably work for decoding utf-8 output, be careful.

Ambrus

------

#!perl
use warnings; use strict;
use Encode;

# Assume you read these chunks of utf-8 encoded input.
my @read = (
	"\xc3\x81rvzt\xc5",
	"\xb1r\xc5\x91 t\xc3",
	"\xbck\xc3\xb6rf",
	"\xc3\xbar\xc3\xb3g\xc3",
	"\xa9p,\n",
);

my $terminal_encoding = "utf-8";

my $buf;
for my $chunknr (keys @read) {
	$buf .= $read[$chunknr];
	my $charstr = decode("utf-8", $buf, Encode::FB_QUIET);
	(my $quoted = $charstr) =~ s/([^ -~])/sprintf("\\x{%02x}",ord($1))/ge;
	my $reencoded = encode($terminal_encoding, $charstr);
	print qq(at chunk number $chunknr, read and decoded input
'$reencoded' = "$quoted".\n);
	if (8 < length($buf)) {
		die "invalid utf-8 encoded input near chunk number $chunknr";
	}
}
if (length($buf)) {
	die "invalid utf-8 encoded input near end of file";
}

__END__



More information about the anyevent mailing list