Perl6-Pugs
view release on metacpan or search on metacpan
docs/notes/unicode_draft view on Meta::CPAN
Good Ideas:
1. The lexical syntax of programs (i.e. POD and literal strings)
should be declared separately from the runtime treatment of strings.
# This is like putting binmode() on the source file itself
# at parse time and should never propagate into runtime
=!lang ja (this is new)
=!encoding Big5
say "String" # Japanese but encoded in Big5
"locale" makes a fine runtime :lang/:encoding choice (i.e.
read from LC_* or CP settings) but makes no sense at parse time.
Just like how XML always knows about its xml:encoding and xml:lang,
Perl string literals and documentation should always know their
encoding/lang informatino for correct presenation of e.g. pod2html
(which would depend on lang to render CJK fonts correctly).
The "lang" should never be inferred from encoding -- it makes no
sense because lang usage shifts with time: People are writing
Trad.Chinese in GBK all the time now.
2. BOM sniffing of .pl files, but currently the set it knows is
(UTF16[LB]E, UTF8+BOM, ASCII(really latin* as default))
it should be:
(UTF32[LB]E, UTF16[LB]E, UTF8(default))
3. Per-handle stackable IO layers makes sense.
But it should allow introspection into different layer-chunks:
# storage layers (:mmap)
# textual transformation layers (:encoding, :crlf)
# format (MIME Header, HTTP Header, XML)
# semantic (:language(ja))
$fh.layers.pop;
4. Per-string "SvUTF8" tag to denote strings from buffers
But buffers are currently too limited - they cannot be used
as C buffers ($buf[3..10] should transparently work)
5. String offsets allows COWing:
use Devel::Peek;
my $string = "This is a large string";
my $substr = substr($string, 0, 5);
Dump($substr); # reuse the PV inside the string with SvCUR
$string = "Hello, Kitty"; # $substr is COW'ed at this point
# and still just "This "
except it works ever better now because the fragment it
refers doesn't need to be invalidated if other fragments change.
6. Use UCM tables that allows bidirectional fallbacks via |1 |3
to maintain PUA fallback for round-trippability. By default,
Perl 6 should always use the same fallback semantics as Perl5's
Encode.pm, instead of dying horribly like iconv.
=====================================================================
Bad Ideas
1. Autopromotion of buffers to strings via ${^ENCODING}
$str = $buf.read(:encoding<Big5>); # Not Encode::decode()
$str = $buf; # should simply die
$buf = $str.write(...);
2. Fixed contiguous memory chunk for Strings is a bad idea
(it works well for buffers)
### encoding != charset
### -> semantics of sort() etc
### -> P6 should just have one charset = UCS
### -> to sort via stroke count etc, use Unicode Collation
### -> to sort as bytes, well just use a buffer!
# If this is just "cat"ing two shiftjis files into one file
# $fhA is :encoding(shiftjis)
# $fhB is :encoding(euc-jp)
$huge_string_a = slurp($fhA); # *MUST* be unicode string here
$huge_string_b = slurp($fhB); # *MUST* be unicode string here
my $string = "$huge_string_a$huge_string_b" # huge malloc
print $string; # but if the output encoding is shiftjis
# then it just transcodes the euc-jp part
$string should just have buffer fragments and how to "view"
these buffers, namely
$string should be a sparse array
- Pointer --> $huge_string_a
- at offset ...
- Pointer --> $huge_string_b
- at offset ...
The IO actions like .slurp and .read should do validation
(because we need to offset info anyway), by "preparing" the
string to know that how many "highest desirable units" are
in there (i.e. defaults to count up to chars but not graphemes)
this can be done very quickly for most DBCS with a one-pass scan
(and latin* with a O(1) scan). Then the buffer is tagged with
the length info and original encoding -- if it's used as a
unicode string for processing, decoding is done on the fly
for the "buffer fragment" inside the string, i.e. to substr
($string, 0, 1) as chars should not decode $huge_string_b.
But you should _never_ be able to treat the string as one
single buffer because there may be multiple ones underneath --
the text semantics is encapsulated as one "Unicode text" string
even though the underlying storage may be transcoded/cached
into fixed ASCII/UTF16/UTF32 representations.
utf8::downgrade() # evil! should never allow this!
# esp. the utfX is not utf8 anyway
# so user should never see utfX
# (i.e. the internal representation)
# Consequently:
piconv -f eucjp -t eucjp # this should be very fast with one-pass validation
3. In Perl5, transcoding always go through UCS (in-memory with UTF/X),
which is unneccessary for e.g. eucjp->shiftjis. So transcoders
should be able to convert directly, provided that they will _always_
produce the same as-if it had roundtripped through UCS for all
valid inputs. This should be exposed to the user of from_to(),
and also used during lazy transcoding in outputs.
4. # 0 and 1 below are "codepoint" in perl5
# or "byte" if $string is Buf
substr($string, 0, 1) # take the first codepoint
# All the string manipulation takes unit adverbs, which
# defaults to the lexical scope's settings
# The "$string" no longer has a say on how it should be
# viewed -- the action takes a view that suits its purpose
$string.substr(0, 1, :bytes)
$string.index(5, :bytes)
The 0 and 1 should be the "position" type, if you write them
out as literals, it responds to the lexical setting of char unit:
.bytes # pretend strings are buffers
.codepoints # same as perl5 - not terribly useful
# - basically unsigned integers with 21 bits
.characters # this should be the default:
# - COMBINING MARKS
# - BOM (and other zero-width assertions)
.graphemes # visual rendering - includes metadata like
# - LANGUAGE TAG blocks
# - VARIATION SELECTOR
# - LTR/RTL SELECTOR
# - Act as pre-decomposed forms (for canonical decomposition)
Lexical pragma determines what 0 and 1 means, but you can also
construct them explicitly with "pos(0, :byte)" or "character(1)"
(XXX the syntax needs work)
5. Treating Str as SvLV makes no sense at all. Currently P5 has
one single form of treating Str as SvLV, namely lvalue substr:
$string = "Hello, World!";
my $lv = \substr($string, 1, 4); # a shallow LV reference to the Scalar
$string = "Hi, Kitty!";
$$lv = "oho";
print $string; # "Hohoitty!";
This should totally die, then Str becomes immutable and can assume
value semantics i.e. shared/COWed across threads, etc.
It may make sense to make buffers mutable (but not resizable), but
strings should always be "constant" and it's the scalar container
that changes -- exactly like integers. This also enables shared
string tables ala Ruby, so "str".WHICH and "str".WHICH would compare
always the same.
In other words, you can ask a string to write itself into a
buffer with some encoding/lang/format/blah combination, and
if it happens to agree with how it's internally constructed,
it could be a O(0) operation (and a transcoding otherwise),
but the user may _never_ ask a string "what encoding are you
using" or any other questions pretaining the internal Buf
layouts.
6. UTFX (UTF8+offset cache) should die as an internal representation
because it allows no sane interaction with the C world, so I
think internally fixed-width-unicode-represetation should be
preferred:
- UTF/0bit # uninitialized null buffer with a length
# - alloc(NULL, 100000000000);
# - not malloc()ed until populated
- UTF/7bit # ASCII
- UTF/8bit # LATIN1 (just internal, never exposed)
- UTF/16bit # UCS2
- UTF/32bit # UCS4
to allow O(1) random access to codepoints, and to
allow chars/graphemes to refer to codepoint units instead of
raw UTF-X length (which had to be invalidated after each
destructive operation). If you insert a UTF/8bit into the
middle of a UTF/16bit (as in 4-arg substr or s///), then
the 8bit should be promoted to 16bit (very fast too) without
invalidating any length caches.
We can safely do this now because utf8::downgrade is no
longer exposed to the user, so the C land may have macros
like CHARS which returns a (char*), or W_CHARS returning
the 32-bit w_char, or whatever the native C library wants
( run in 2.250 seconds using v1.01-cache-2.11-cpan-39bf76dae61 )