utf8 results from the CPAN

utf8

Alt-CWB-ambs

view release on metacpan or search on metacpan

lib/CWB/CEQL/Parser.pm view on Meta::CPAN

so no substitutions can be made).

=cut

sub encodeEntities {
  my ($self, $s) = @_;
  my %entity = ( '<' => '&lt;', '>' => '&gt;', '&' => '&amp;', '"' => '&quot;' );
  $s =~ s/([<>&"])/$entity{$1}/ge;  # unsafe characters => entities
  $s =~ s/[ \t]+/ /g;               # normalise whitespace (but not line breaks)
  $s =~ s/[\x00-\x09\x0b\x0c\x0e-\x1f]+//g; # remove other control characters except LF and CR
  if (Encode::is_utf8($s)) {
    $s =~ s/([^\x00-\x7f])/sprintf "&#x%X;", ord($1)/ge;
  }
  return $s;
}

=back

=head2 Internal structure of CWB::CEQL::Parser objects

A DPP parser object (i.e. an object that belongs to B<CWB::CEQL::Parser> or

lib/CWB/CEQL/String.pm view on Meta::CPAN


  print "42 $op 0\n"; # prints "42 >= 0"
  if ($op->type eq "Operator") { ... }

  $string = new CWB::CEQL::String "my string", "String";
  $string .= " is beautiful";       # changes string, but not its type
  $string->value("another string"); # $string = "..."; would replace with ordinary string
  print $string->value, "\n";       # access string value explicitly

  $string->attribute("charset", "ascii"); # declare and/or set user-defined attribute
  if ($string->attribute("charset") eq "utf8") { ... }

  $new_string = $string->copy;      # $new_string = $string; would point to same object

=head1

=head1 DESCRIPTION

B<** TODO **>

Note: automatic conversion to number in numerical expression does usually not work -- use value() method explicitly in this case

lib/CWB/Encoder.pm view on Meta::CPAN


sub info {
  my ($self, $info) = @_;
  $self->{INFO} = $info;
}

=item $enc->charset($code);

Set corpus character set (as a corpus property in the registry entry).
So far, only C<latin1> is fully supported. Other valid character sets are
C<latin2>, ..., C<latin9>, and C<utf8> (which will be supported by future
releases of the CWB). Any other I<$code> will raise a warning.

=cut

sub charset {
  my ($self, $charset) = @_;
  carp "CWB::Encoder: character set $charset not supported by CWB (latin1, ..., latin9, utf8).\n"
    unless $charset =~ /^(latin[1-9]|utf8)$/;
  $self->{CHARSET} = $charset;
}

=item $enc->language($code);

Set corpus language (as an informational corpus property in the
registry entry). Use of a two-letter ISO code (C<de>, C<en>, C<fr>,
...) is recommended, and any other formats will raise a warning.

=cut

( run in 0.853 second using v1.01-cache-2.11-cpan-5f4f29bf90f )