BOM results from the CPAN

Audio-TagLib

Encode prepends a Byte Order Marker when performing an Encode
of type (if that's the right word) UTF16. The BOM tels the receiving
program (or library) whether the following bytes are ordered in
big-endian (aka network) or little-endian order. In addition to prepending
the BOM, Encode will re-arrange the bytes. How it knows what the going-in
ordwer is, is not known.

http://en.wikipedia.org/wiki/Byte_order_mark

There appears to be come confusion about what TagLib does or expects.

t/TagLib_String.t goes to some effort to deal with th "fact" that TagLib
appears to want UTF-16LE, whereas the construction of the MPEG header in
include/MPEG_Header.pm appears to be attempting to construct a heder in
UTF-16BE

README.byteorder view on Meta::CPAN

There's a lengthy explanation of byte order in perlfunc, under pack.
Cf also perlport

From pack(): Basically, Intel and VAX CPUs are little-endian, while everybody else,
including Motorola m68k/88k, PPC, Sparc, HP PA, Power, and Cray, are big-endian.
Alpha and MIPS can be either: Digital/Compaq uses (well, used) them in little-endian mode,
but SGI/Cray uses them in big-endian mode.

This may be important. From Encode::Unicode,
When BE or LE is omitted during encode(), it returns a BE-encoded string with
BOM prepended. So when you want to encode a whole text file, make sure you encode()
the whole text at once, not line by line or each line, not file,
will have a BOM prepended.

There appears to be some confusion as to whether Perl and TagLib have builtin assumptions
about byte order. As noted above, Perl manages to figure it out per system. The 
same is true of TagLib, as follows.

In Verified SYSTEM_BYTEORDER == 1 => little_endian
    put a diagnostic in ConfigureChecks.cmake

In ConfigureChecks.cmake, SYSTEM_BYTEORDER is computed. In the case of my Intel system,
I've verified that this computes to little_endian. This is translated to the function

lib/Audio/TagLib/String.pm view on Meta::CPAN

  use Audio::TagLib::String;
  
  my $i = Audio::TagLib::String->new("blah blah blah");
  print $i->toCString(), "\n"; # got "blah blah blah"

=head1 DESCRIPTION

This is an implicitly shared wide string. For storage it uses
Audio::TagLib::wstring, but as this is an I<implementation detail> this of
course could change. Strings are stored internally as
UTF-16BE. (Without the BOM (Byte Order Mark))

The use of implicit sharing means that copying a string is cheap, the
only  cost comes into play when the copy is modified. Prior to that
the string just has a pointer to the data of the parent String. This
also makes this class suitable as a function return type.

In addition to adding implicit sharing, this class keeps track of four
possible encodings, which are the four supported by the ID3v2
standard.

t/TagLib_String.t view on Meta::CPAN

    diag("can_ok failed");

my $i = Audio::TagLib::String->new();
my $s_latin1 = Audio::TagLib::String->new(Audio::TagLib::String->new("string test 1"));
is($s_latin1->to8Bit(), "string test 1")						                    or
	diag("method new(ascii) failed");
is(Audio::TagLib::String->new(Audio::TagLib::ByteVector->new("STRING TEST 2"))->to8Bit(), "STRING TEST 2") or 
	diag("method new(ByteVector) failed");

# These are needed for fixing UTF16. Cf. also http://en.wikipedia.org/wiki/Byte_order_mark 
my $BOM_LE = 0xfffe;
my $BOM_BE = 0xfeff;

# An arbitrary seletion of non-ASCII Latin-1 characters
my $gb2312 			= chr(0316). # ÃŽ  LATIN CAPITAL LETTER I WITH CIRCUMFLEX
                      chr(0322). # Ã’  LATIN CAPITAL LETTER O WITH GRAVE
                      chr(0265). # Âµ  MICRO SIGN
                      chr(0304); # Ã…  LATIN CAPITAL LETTER A WITH RING ABOVE
# The same thing in a different representation
my $utf8_hardcode 	= "\x{6211}\x{7684}";
# This conversion should be a no-op
my $utf8 			= decode("GB2312", $gb2312);
# Various representations
# These encodings affect byte order only. There is NO BOM prepended
my $utf16be 		= encode("UTF16BE", $utf8);
my $utf16le 		= encode("UTF16LE", $utf8);
my $utf16 			= encode("UTF16", $utf8);
# $utf16 has been encoded as big-endian (aka network order) with a  BE BOM prepended.
# even though we may be executing on a little-endian # system.
# This is the defined behavior for Encode::Unicode. 
my $s_utf8 = Audio::TagLib::String->new($utf8);

is($s_utf8->to8Bit("true"), $utf8_hardcode)					                    or 
	diag("method new(utf8) failed");

is(Audio::TagLib::String->new($utf8_hardcode)->to8Bit("true"),$utf8_hardcode)   or
    diag("method new(utf8) failed");

t/TagLib_String.t view on Meta::CPAN

is($s_utf8->data("UTF16LE")->data(), $utf16le) 					                or
	diag("method data(utf16le) failed");

# 29
# What this test is checking is whether TagLib encodes $utf8 in the same way as 
# Encode, which it does not. Note that neither encoding choice depends on system endian-ness
# $utf16 is $utf8 data BE-encoded (see above comment re Encode)
# Comment in TagLib ByteVector String::data(Type t) for t == UtF16
# // Assume that if we're doing UTF16 and not UTF16BE that we want little
# // endian encoding.  (Byte Order Mark)
# We test the above assertion by constructing a LE-encoded utf16 with a LE BOM
my $utf16le_with_BOM = "  $utf16le";
vec($utf16le_with_BOM, 0, 16) =  $BOM_LE;
is($s_utf8->data("UTF16")->data(), $utf16le_with_BOM) 						    or 
	diag("method data(utf16) did not execute as expected");

cmp_ok(Audio::TagLib::String->new("a")->toInt(), "==", oct("a")) 		        or
	diag("method toInt() failed");

is(Audio::TagLib::String->new("   blanks   ")->stripWhiteSpace()->to8Bit(), "blanks") or
    diag("method stripWhiteSpace() failed");

is($s_latin1->getChar(1), "t") 									                or 
	diag("method getChar(i) failed");

( run in 1.550 second using v1.01-cache-2.11-cpan-e1769b4cff6 )