BOM results from the CPAN

Pod-L10N

view release on metacpan or search on metacpan

corpus/perlpodspec-copy.pod view on Meta::CPAN

should emit a warning and may abort parsing the document
altogether.

A document having more than one "=encoding" line should be
considered an error.  Pod processors may silently tolerate this if
the not-first "=encoding" lines are just duplicates of the
first one (e.g., if there's a "=encoding utf8" line, and later on
another "=encoding utf8" line).  But Pod processors should complain if
there are contradictory "=encoding" lines in the same document
(e.g., if there is a "=encoding utf8" early in the document and
"=encoding big5" later).  Pod processors that recognize BOMs
may also complain if they see an "=encoding" line
that contradicts the BOM (e.g., if a document with a UTF-16LE
BOM has an "=encoding shiftjis" line).

=back

If a Pod processor sees any command other than the ones listed
above (like "=head", or "=haed1", or "=stuff", or "=cuttlefish",
or "=w123"), that processor must by default treat this as an
error.  It must not process the paragraph beginning with that
command, must by default warn of this as an error, and may
abort the parse.  A Pod parser may allow a way for particular
applications to add to the above list of known commands, and to

corpus/perlpodspec-copy.pod view on Meta::CPAN

Future versions of this specification may specify
how Pod can accept other encodings.  Presumably treatment of other
encodings in Pod parsing would be as in XML parsing: whatever the
encoding declared by a particular Pod file, content is to be
stored in memory as Unicode characters.

=item *

The well known Unicode Byte Order Marks are as follows:  if the
file begins with the two literal byte values 0xFE 0xFF, this is
the BOM for big-endian UTF-16.  If the file begins with the two
literal byte value 0xFF 0xFE, this is the BOM for little-endian
UTF-16.  If the file begins with the three literal byte values
0xEF 0xBB 0xBF, this is the BOM for UTF-8.

=for comment
 use bytes; print map sprintf(" 0x%02X", ord $_), split '', "\x{feff}";
 0xEF 0xBB 0xBF

=for comment
 If toke.c is modified to support UTF-32, add mention of those here.

=item *

A naive but sufficient heuristic for testing the first highbit
byte-sequence in a BOM-less file (whether in code or in Pod!), to see
whether that sequence is valid as UTF-8 (RFC 2279) is to check whether
that the first byte in the sequence is in the range 0xC0 - 0xFD
I<and> whether the next byte is in the range
0x80 - 0xBF.  If so, the parser may conclude that this file is in
UTF-8, and all highbit sequences in the file should be assumed to
be UTF-8.  Otherwise the parser should treat the file as being
in Latin-1.  In the unlikely circumstance that the first highbit
sequence in a truly non-UTF-8 file happens to appear to be UTF-8, one
can cater to our heuristic (as well as any more intelligent heuristic)
by prefacing that line with a comment line containing a highbit

( run in 0.598 second using v1.01-cache-2.11-cpan-131fc08a04b )