Alt-CWB-ambs
view release on metacpan or search on metacpan
lib/CWB/CEQL/Parser.pm view on Meta::CPAN
so no substitutions can be made).
=cut
sub encodeEntities {
my ($self, $s) = @_;
my %entity = ( '<' => '<', '>' => '>', '&' => '&', '"' => '"' );
$s =~ s/([<>&"])/$entity{$1}/ge; # unsafe characters => entities
$s =~ s/[ \t]+/ /g; # normalise whitespace (but not line breaks)
$s =~ s/[\x00-\x09\x0b\x0c\x0e-\x1f]+//g; # remove other control characters except LF and CR
if (Encode::is_utf8($s)) {
$s =~ s/([^\x00-\x7f])/sprintf "&#x%X;", ord($1)/ge;
}
return $s;
}
=back
=head2 Internal structure of CWB::CEQL::Parser objects
A DPP parser object (i.e. an object that belongs to B<CWB::CEQL::Parser> or
lib/CWB/CEQL/String.pm view on Meta::CPAN
print "42 $op 0\n"; # prints "42 >= 0"
if ($op->type eq "Operator") { ... }
$string = new CWB::CEQL::String "my string", "String";
$string .= " is beautiful"; # changes string, but not its type
$string->value("another string"); # $string = "..."; would replace with ordinary string
print $string->value, "\n"; # access string value explicitly
$string->attribute("charset", "ascii"); # declare and/or set user-defined attribute
if ($string->attribute("charset") eq "utf8") { ... }
$new_string = $string->copy; # $new_string = $string; would point to same object
=head1
=head1 DESCRIPTION
B<** TODO **>
Note: automatic conversion to number in numerical expression does usually not work -- use value() method explicitly in this case
lib/CWB/Encoder.pm view on Meta::CPAN
sub info {
my ($self, $info) = @_;
$self->{INFO} = $info;
}
=item $enc->charset($code);
Set corpus character set (as a corpus property in the registry entry).
So far, only C<latin1> is fully supported. Other valid character sets are
C<latin2>, ..., C<latin9>, and C<utf8> (which will be supported by future
releases of the CWB). Any other I<$code> will raise a warning.
=cut
sub charset {
my ($self, $charset) = @_;
carp "CWB::Encoder: character set $charset not supported by CWB (latin1, ..., latin9, utf8).\n"
unless $charset =~ /^(latin[1-9]|utf8)$/;
$self->{CHARSET} = $charset;
}
=item $enc->language($code);
Set corpus language (as an informational corpus property in the
registry entry). Use of a two-letter ISO code (C<de>, C<en>, C<fr>,
...) is recommended, and any other formats will raise a warning.
=cut
( run in 0.507 second using v1.01-cache-2.11-cpan-49f99fa48dc )