BOM results from the CPAN

Perl6-Pugs
view release on metacpan or search on metacpan
docs/notes/unicode_draft view on Meta::CPAN
Good Ideas:

1. The lexical syntax of programs (i.e. POD and literal strings)
   should be declared separately from the runtime treatment of strings.

    # This is like putting binmode() on the source file itself
    # at parse time and should never propagate into runtime
    =!lang ja       (this is new)
    =!encoding Big5
    say "String" # Japanese but encoded in Big5

   "locale" makes a fine runtime :lang/:encoding choice (i.e.
   read from LC_* or CP settings) but makes no sense at parse time.

   Just like how XML always knows about its xml:encoding and xml:lang,
   Perl string literals and documentation should always know their
   encoding/lang informatino for correct presenation of e.g. pod2html
   (which would depend on lang to render CJK fonts correctly).

   The "lang" should never be inferred from encoding -- it makes no
   sense because lang usage shifts with time: People are writing
   Trad.Chinese in GBK all the time now.

2. BOM sniffing of .pl files, but currently the set it knows is
    (UTF16[LB]E, UTF8+BOM, ASCII(really latin* as default))
   it should be:
    (UTF32[LB]E, UTF16[LB]E, UTF8(default))

3. Per-handle stackable IO layers makes sense.
   But it should allow introspection into different layer-chunks:
        # storage layers (:mmap)
        # textual transformation layers (:encoding, :crlf)
        # format (MIME Header, HTTP Header, XML)
        # semantic (:language(ja))
        $fh.layers.pop;

4. Per-string "SvUTF8" tag to denote strings from buffers
   But buffers are currently too limited - they cannot be used
   as C buffers ($buf[3..10] should transparently work)

5. String offsets allows COWing:
    use Devel::Peek;
    my $string = "This is a large string";
    my $substr = substr($string, 0, 5);
    Dump($substr); # reuse the PV inside the string with SvCUR
    $string = "Hello, Kitty"; # $substr is COW'ed at this point
                              # and still just "This "

   except it works ever better now because the fragment it
   refers doesn't need to be invalidated if other fragments change.

6. Use UCM tables that allows bidirectional fallbacks via |1 |3
   to maintain PUA fallback for round-trippability.  By default,
   Perl 6 should always use the same fallback semantics as Perl5's
   Encode.pm, instead of dying horribly like iconv.

=====================================================================

Bad Ideas

1. Autopromotion of buffers to strings via ${^ENCODING}
   $str = $buf.read(:encoding<Big5>);   # Not Encode::decode()
   $str = $buf; # should simply die
   $buf = $str.write(...);

2. Fixed contiguous memory chunk for Strings is a bad idea
   (it works well for buffers)

        ### encoding != charset
        ###              -> semantics of sort() etc
        ###              -> P6 should just have one charset = UCS
        ###              -> to sort via stroke count etc, use Unicode Collation
        ###              -> to sort as bytes, well just use a buffer!
        # If this is just "cat"ing two shiftjis files into one file
        # $fhA is :encoding(shiftjis)
        # $fhB is :encoding(euc-jp)
        $huge_string_a = slurp($fhA);   # *MUST* be unicode string here
        $huge_string_b = slurp($fhB);   # *MUST* be unicode string here
        my $string = "$huge_string_a$huge_string_b" # huge malloc
        print $string; # but if the output encoding is shiftjis
                       # then it just transcodes the euc-jp part

    $string should just have buffer fragments and how to "view"
    these buffers, namely 

            $string should be a sparse array
                - Pointer --> $huge_string_a
                    - at offset ...
                - Pointer --> $huge_string_b
                    - at offset ...

    The IO actions like .slurp and .read should do validation
    (because we need to offset info anyway), by "preparing" the
    string to know that how many "highest desirable units" are
    in there (i.e. defaults to count up to chars but not graphemes)
    this can be done very quickly for most DBCS with a one-pass scan
    (and latin* with a O(1) scan). Then the buffer is tagged with
    the length info and original encoding -- if it's used as a
    unicode string for processing, decoding is done on the fly
    for the "buffer fragment" inside the string, i.e. to substr
    ($string, 0, 1) as chars should not decode $huge_string_b.

    But you should _never_ be able to treat the string as one
    single buffer because there may be multiple ones underneath --
    the text semantics is encapsulated as one "Unicode text" string
    even though the underlying storage may be transcoded/cached
    into fixed ASCII/UTF16/UTF32 representations.

        utf8::downgrade()   # evil! should never allow this!
                            # esp. the utfX is not utf8 anyway
                            # so user should never see utfX
                            # (i.e. the internal representation)

    # Consequently:
    piconv -f eucjp -t eucjp # this should be very fast with one-pass validation

3. In Perl5, transcoding always go through UCS (in-memory with UTF/X),
   which is unneccessary for e.g. eucjp->shiftjis.  So transcoders
   should be able to convert directly, provided that they will _always_
   produce the same as-if it had roundtripped through UCS for all
   valid inputs.  This should be exposed to the user of from_to(),
   and also used during lazy transcoding in outputs.

4.  # 0 and 1 below are "codepoint" in perl5
    #                or "byte" if $string is Buf
    substr($string, 0, 1)   # take the first codepoint

    # All the string manipulation takes unit adverbs, which
    # defaults to the lexical scope's settings
    # The "$string" no longer has a say on how it should be
    # viewed -- the action takes a view that suits its purpose
    $string.substr(0, 1, :bytes)
    $string.index(5, :bytes)

    The 0 and 1 should be the "position" type, if you write them
    out as literals, it responds to the lexical setting of char unit:
        .bytes      # pretend strings are buffers
        .codepoints # same as perl5 - not terribly useful
                    #  - basically unsigned integers with 21 bits
        .characters # this should be the default:
                    #  - COMBINING MARKS
                    #  - BOM (and other zero-width assertions)
        .graphemes  # visual rendering - includes metadata like
                    #  - LANGUAGE TAG blocks
                    #  - VARIATION SELECTOR
                    #  - LTR/RTL SELECTOR
                    #  - Act as pre-decomposed forms (for canonical decomposition)

    Lexical pragma determines what 0 and 1 means, but you can also
    construct them explicitly with "pos(0, :byte)" or "character(1)"
    (XXX the syntax needs work)

5. Treating Str as SvLV makes no sense at all.  Currently P5 has
   one single form of treating Str as SvLV, namely lvalue substr:

    $string = "Hello, World!";
    my $lv = \substr($string, 1, 4); # a shallow LV reference to the Scalar
    $string = "Hi, Kitty!";
    $$lv = "oho";
    print $string; # "Hohoitty!";

   This should totally die, then Str becomes immutable and can assume
   value semantics i.e. shared/COWed across threads, etc.

   It may make sense to make buffers mutable (but not resizable), but
   strings should always be "constant" and it's the scalar container
   that changes -- exactly like integers.  This also enables shared
   string tables ala Ruby, so "str".WHICH and "str".WHICH would compare
   always the same.

   In other words, you can ask a string to write itself into a
   buffer with some encoding/lang/format/blah combination, and
   if it happens to agree with how it's internally constructed,
   it could be a O(0) operation (and a transcoding otherwise),
   but the user may _never_ ask a string "what encoding are you
   using" or any other questions pretaining the internal Buf
   layouts.

6. UTFX (UTF8+offset cache) should die as an internal representation
   because it allows no sane interaction with the C world, so I
   think internally fixed-width-unicode-represetation should be
   preferred:
        - UTF/0bit      # uninitialized null buffer with a length
                        # - alloc(NULL, 100000000000);
                        # - not malloc()ed until populated
        - UTF/7bit      # ASCII
        - UTF/8bit      # LATIN1 (just internal, never exposed)
        - UTF/16bit     # UCS2
        - UTF/32bit     # UCS4

   to allow O(1) random access to codepoints, and to
   allow chars/graphemes to refer to codepoint units instead of
   raw UTF-X length (which had to be invalidated after each
   destructive operation).  If you insert a UTF/8bit into the
   middle of a UTF/16bit (as in 4-arg substr or s///), then 
   the 8bit should be promoted to 16bit (very fast too) without
   invalidating any length caches.

   We can safely do this now because utf8::downgrade is no
   longer exposed to the user, so the C land may have macros
   like CHARS which returns a (char*), or W_CHARS returning
   the 32-bit w_char, or whatever the native C library wants
( run in 2.250 seconds using v1.01-cache-2.11-cpan-39bf76dae61 )