HTML-Parser

 view release on metacpan or  search on metacpan

Changes  view on Meta::CPAN

  * Convert to Dist::Zilla
  * Build all prereqs from our cpanfile
  * Go through all test files and:
    * perltidy
    * Use strict/warnings
    * Get rid of two-arg open
    * Get rid of BAREWORD filehandles
    * Fix the eval pattern used
    * Only use -w where we catch $SIG{__WARN__}
    * Fix encoding problems
    * use utf8 where we have unicode in the source
  * Fix a typo here and there
  * perltidy all of the example apps in eg/
  * Add comments explaining the apps in eg/ (GH#13 Thanks, Salvatore Bonaccorso)
  * Print out UTF-8 encoded data where sensible in eg/

3.73     2020-08-24
  * Cleaned up this changes log.
  * Added a .mailmap file to organize contributions accurately.
  * Ensure all versions are equal and on the current version
  * Add the .mailmap to the MANIFEST
  * Change the META information to point to the new GH repository
  * Add a .perltidyrc to use going forward
  * Add hctype.h and pfunc.h to the dist as static files and stop asking
    for them to be built on the user's end.
  * Remove t/pod.t from userland testing
  * Remove t/pod-coverage.t from userland testing
  * Clean up the MANIFEST
  * Start testing via GitHub Actions/Workflows
  * Protect active parser from being freed (PR 13, RT #115034)

3.72     2016-01-19
  * Avoid more clang casting warnings
  * Remove trailing whitespace
  * Ensure entities expand to utf8 sequences under 'utf8_mode' [RT#99755]
  * typo fixes (David Steinbrunner)
  * Silence clang warning (Jacques Germishuys)
  * const+static-ing (bulk88)

3.71     2013-05-09
  * Transform ':' in headers to '-' [RT#80524]

3.70     2013-03-28
  * Fix for cross-compiling with Buildroot (François Perrad)
  * Comment typo fix
  * Fix Issue #3 / RT #84144: HTML::Entities::decode_entities() needs
    to call SV_CHECK_THINKFIRST() before checking READONLY flag (Yves Orton)

3.69     2011-10-15
  * Documentation fix; encode_utf8 mixup [RT#71151]
  * Make it clearer that there are 2 (actually 3) options for handing "UTF-8 garbage"
  * Github is the official repo
  * Can't be bothered to try to fix the failures that occur on perl-5.6
  * fix to TokeParser to correctly handle option configuration (Barbie)
  * Aesthetic change: remove extra ; (Jon Jensen)
  * Trim surrounding whitespace from extracted URLs. (Ville Skyttä)

3.68     2010-09-01
  * Declare the encoding of the POD to be utf8

3.67     2010-08-17
  * bleadperl 2154eca7 breaks HTML::Parser 3.66 [RT#60368] (Nicholas Clark)

3.66     2010-07-09
  * Fix entity decoding in utf8_mode for the title header

3.65     2010-04-04
  * Eliminate buggy entities_decode_old
  * Fixed endianness typo [RT#50811] (Salvatore Bonaccorso)
  * Documentation Fixes. (Ville Skyttä)

3.64     2009-10-25
  * Convert files to UTF-8
  * Don't allow decode_entities() to generate illegal Unicode chars
  * Copyright 2009
  * Remove rendundant (repeated) test
  * Make parse_file() method use 3-arg open [RT#49434]

3.63     2009-10-22
  * Take more care to prepare the char range for encode_entities [RT#50170]
  * decode_entities confused by trailing incomplete entity

3.62     2009-08-13

Changes  view on Meta::CPAN

  * Add support for HTML 5 <meta charset> and new HEAD elements. (Ville Skyttä)
  * Short description of the htextsub example (Damyan Ivanov)
  * Suppress warning when encode_entities is called with undef [RT#27567] (Mike South)
  * HTML::Parser doesn't compile with perl 5.8.0. (Zefram)

3.59     2008-11-24
  * Restore perl-5.6 compatibility for HTML::HeadParser.
  * Improved META.yml

3.58     2008-11-17
  * Suppress "Parsing of undecoded UTF-8 will give garbage" warning
     with attr_encoded [RT#29089]
  * HTML::HeadParser:
       - Recognize the Unicode BOM in utf8_mode as well [RT#27522]
       - Avoid ending up with '/' keys attribute in Link headers.

3.57     2008-11-16
  * The <iframe> element content is now parsed in literal mode.
  * Parsing of <script> and <style> content ends on the first end tag
     even when that tag was in a quoted string.  That seems to be the
     behaviour of all modern browsers.
  * Implement backquote() attribute as requested by Alex Kapranoff.
  * Test and documentation tweaks from Alex Kapranoff.

Changes  view on Meta::CPAN


3.45     2005-01-06
  * Fix stack memory leak caused by missing PUTBACK.  Only
     code that used $p->parse(\&cb) form was affected.
     Fix provided by Gurusamy Sarathy <gsar@sophos.com>.

3.44     2004-12-28
  * Fix confusion about nested quotes in <script> and <style> text.

3.43     2004-12-06
  * The SvUTF8 flag was not propagated correctly when replacing
     unterminated entities.
  * Fixed test failure because of missing binmode on Windows.

3.42     2004-12-04
  * Avoid sv_catpvn_utf8_upgrade() as that macro was not
     available in perl-5.8.0.
     Patch by Reed Russell <Russell.Reed@acxiom.com>.
  * Add casts to suppress compilation warnings for char/U8
     mismatches.
  * HTML::HeadParser will always push new header values.
     This make sure we never loose old header values.

3.41     2004-11-30
  * Fix unresolved symbol error with perl-5.005.

3.40     2004-11-29
  * Make utf8_mode only available on perl-5.8 or better.  It produced
     garbage with older versions of perl.
  * Emit warning if entities are decoded and something in the first
     chunk looks like hi-bit UTF-8.  Previously this warning was only
     triggered for documents with BOM.

3.39_92     2004-11-23
  * More documentation of the Unicode issues.  Moved around HTML::Parser
     documentation a bit.
  * New boolean option; $p->utf8_mode to allow parsing of raw  UTF-8.
  * Documented that HTML::Entities::decode_entities() can take multiple
     arguments.
  * Unterminated entities are now decoded in text (compatibility
     with MSIE misfeature).
  * Document HTML::Entities::_decode_entities(); this variation of the
     decode_entities() function has been available for a long time, but
     have not been documented until now.
  * HTML::Entities::_decode_entities() can now be told to try to
     expand unterminated entities.
  * Simplified Makefile.PL

Changes  view on Meta::CPAN

     consider end element if the corresponding end tag is found
     inside such a string.

3.39_90     2004-11-17
  * The <title> element is now parsed in literal mode, which
     means that other tags are not recognized until </title> has
     been seen.
  * Unicode support for perl-5.8 and better.
  * Decoding Unicode entities always enabled; no longer a compile
    time option.
  * Propagation of UTF8 state on strings.
    Patch contributed by John Gardiner Myers <jgmyers@proofpoint.com>.
  * Calculate offsets and lengths in chars for Unicode strings.
  * Fixed link typo in the HTML::TokeParser documentation.

3.38     2004-11-11
  * New boolean option; $p->closing_plaintext
     Contributed by Alex Kapranoff <alex@kapranoff.ru>

3.37     2004-11-10
  * Improved handling of HTML encoded surrogate pairs and illegally
     encoded Unicode; <http://rt.cpan.org/Ticket/Display.html?id=7785>.
     Patch by John Gardiner Myers <jgmyers@proofpoint.com>.
  * Avoid generating bad UTF8 strings when decoding entities
     representing chars beyond #255 in 8-bit strings.  Such bad
     UTF8 sometimes made perl-5.8.5 and older segfault.
  * Undocument v2 style subclassing in synopsis section.
  * Internal cleanup: Make 'gcc -Wall' happier.
  * Avoid modification of PVs during parsing of attrspec.
    Another patch by John Gardiner Myers.

3.36     2004-04-01
  * Improved MSIE/Mozilla compatibility.  If the same attribute
     name repeats for a start tag, use the first value instead
     of the last.  Patch by Nick Duffek <html-parser@duffek.com>.
     <http://rt.cpan.org/Ticket/Display.html?id=5472>

Parser.xs  view on Meta::CPAN

#endif

    pstate2->strict_comment = pstate->strict_comment;
    pstate2->strict_names = pstate->strict_names;
    pstate2->strict_end = pstate->strict_end;
    pstate2->xml_mode = pstate->xml_mode;
    pstate2->unbroken_text = pstate->unbroken_text;
    pstate2->attr_encoded = pstate->attr_encoded;
    pstate2->case_sensitive = pstate->case_sensitive;
    pstate2->closing_plaintext = pstate->closing_plaintext;
    pstate2->utf8_mode = pstate->utf8_mode;
    pstate2->empty_element_tags = pstate->empty_element_tags;
    pstate2->xml_pic = pstate->xml_pic;
    pstate2->backquote = pstate->backquote;

    pstate2->bool_attr_val =
	SvREFCNT_inc(sv_dup(pstate->bool_attr_val, params));
    for (i = 0; i < EVENT_COUNT; i++) {
	pstate2->handlers[i].cb =
	    SvREFCNT_inc(sv_dup(pstate->handlers[i].cb, params));
	pstate2->handlers[i].argspec =

Parser.xs  view on Meta::CPAN

    ALIAS:
	HTML::Parser::strict_comment = 1
	HTML::Parser::strict_names = 2
        HTML::Parser::xml_mode = 3
	HTML::Parser::unbroken_text = 4
        HTML::Parser::marked_sections = 5
        HTML::Parser::attr_encoded = 6
        HTML::Parser::case_sensitive = 7
	HTML::Parser::strict_end = 8
	HTML::Parser::closing_plaintext = 9
        HTML::Parser::utf8_mode = 10
        HTML::Parser::empty_element_tags = 11
        HTML::Parser::xml_pic = 12
	HTML::Parser::backquote = 13
    PREINIT:
	bool *attr;
    CODE:
        switch (ix) {
	case  1: attr = &pstate->strict_comment;       break;
	case  2: attr = &pstate->strict_names;         break;
	case  3: attr = &pstate->xml_mode;             break;

Parser.xs  view on Meta::CPAN

        case  5:
#ifdef MARKED_SECTION
		 attr = &pstate->marked_sections;      break;
#else
	         croak("marked sections not supported"); break;
#endif
	case  6: attr = &pstate->attr_encoded;         break;
	case  7: attr = &pstate->case_sensitive;       break;
	case  8: attr = &pstate->strict_end;           break;
	case  9: attr = &pstate->closing_plaintext;    break;
        case 10: attr = &pstate->utf8_mode;            break;
	case 11: attr = &pstate->empty_element_tags;   break;
        case 12: attr = &pstate->xml_pic;              break;
	case 13: attr = &pstate->backquote;            break;
	default:
	    croak("Unknown boolean attribute (%d)", (int)ix);
        }
	RETVAL = boolSV(*attr);
	if (items > 1)
	    *attr = SvTRUE(ST(1));
    OUTPUT:

Parser.xs  view on Meta::CPAN

            entities_hv = 0;
        }
#ifdef SV_CHECK_THINKFIRST
        SV_CHECK_THINKFIRST(string);
#endif
	if (SvREADONLY(string))
	    croak("Can't inline decode readonly string in _decode_entities()");
	decode_entities(aTHX_ string, entities_hv, expand_prefix);

bool
_probably_utf8_chunk(string)
    SV* string
    PREINIT:
        STRLEN len;
        char *s;
    CODE:
        sv_utf8_downgrade(string, 0);
	s = SvPV(string, len);
        RETVAL = probably_utf8_chunk(aTHX_ s, len);
    OUTPUT:
        RETVAL

int
UNICODE_SUPPORT()
    PROTOTYPE:
    CODE:
       RETVAL = 1;
    OUTPUT:
       RETVAL

README  view on Meta::CPAN

        $p->parse($chunk2);

        # ...
        # signal end of document
        $p->eof;

        # Parse directly from file
        $p->parse_file("foo.html");

        # or
        open(my $fh, "<:utf8", "foo.html") || die;
        $p->parse_file($fh);

DESCRIPTION
    Objects of the "HTML::Parser" class will recognize markup and separate
    it from plain text (alias data content) in HTML documents. As different
    kinds of markup and text are recognized, the corresponding event
    handlers are invoked.

    "HTML::Parser" is not a generic SGML parser. We have tried to make it
    able to deal with the HTML that is actually "out there", and it normally

README  view on Meta::CPAN

        that make it hard to do transformations on the text. When this
        attribute is enabled, blocks of text are always reported in one
        piece. This will delay the text event until the following (non-text)
        event has been recognized by the parser.

        Note that the "offset" argspec will give you the offset of the first
        segment of text and "length" is the combined length of the segments.
        Since there might be ignored tags in between, these numbers can't be
        used to directly index in the original document file.

    $p->utf8_mode
    $p->utf8_mode( $bool )
        Enable this option when parsing raw undecoded UTF-8. This tells the
        parser that the entities expanded for strings reported by "attr",
        @attr and "dtext" should be expanded as decoded UTF-8 so they end up
        compatible with the surrounding text.

        If "utf8_mode" is enabled then it is an error to pass strings
        containing characters with code above 255 to the parse() method, and
        the parse() method will croak if you try.

        Example: The Unicode character "\x{2665}" is "\xE2\x99\xA5" when
        UTF-8 encoded. The character can also be represented by the entity
        "&hearts;" or "&#x2665". If we feed the parser:

          $p->parse("\xE2\x99\xA5&hearts;");

        then "dtext" will be reported as "\xE2\x99\xA5\x{2665}" without
        "utf8_mode" enabled, but as "\xE2\x99\xA5\xE2\x99\xA5" when enabled.
        The later string is what you want.

        This option is only available with perl-5.8 or better.

    $p->xml_mode
    $p->xml_mode( $bool )
        Enabling this attribute changes the parser to allow some XML
        constructs. This enables the behaviour controlled by individually by
        the "case_sensitive", "empty_element_tags", "strict_names" and
        "xml_pic" attributes and also suppresses special treatment of

README  view on Meta::CPAN


        The parser will make sure that it does not break a word or a
        sequence of whitespace between two text events.

  Unicode
    "HTML::Parser" can parse Unicode strings when running under perl-5.8 or
    better. If Unicode is passed to $p->parse() then chunks of Unicode will
    be reported to the handlers. The offset and length argspecs will also
    report their position in terms of characters.

    It is safe to parse raw undecoded UTF-8 if you either avoid decoding
    entities and make sure to not use *argspecs* that do, or enable the
    "utf8_mode" for the parser. Parsing of undecoded UTF-8 might be useful
    when parsing from a file where you need the reported offsets and lengths
    to match the byte offsets in the file.

    If a filename is passed to $p->parse_file() then the file will be read
    in binary mode. This will be fine if the file contains only ASCII or
    Latin-1 characters. If the file contains UTF-8 encoded text then care
    must be taken when decoding entities as described in the previous
    paragraph, but better is to open the file with the UTF-8 layer so that
    it is decoded properly:

       open(my $fh, "<:utf8", "index.html") || die "...: $!";
       $p->parse_file($fh);

    If the file contains text encoded in a charset besides ASCII, Latin-1 or
    UTF-8 then decoding will always be needed.

VERSION 2 COMPATIBILITY
    When an "HTML::Parser" object is constructed with no arguments, a set of
    handlers is automatically provided that is compatible with the old
    HTML::Parser version 2 callback methods.

    This is equivalent to the following method calls:

        $p->handler(start   => "start",   "self, tagname, attr, attrseq, text");
        $p->handler(end     => "end",     "self, tagname, text");

README  view on Meta::CPAN

    Unterminated literal string in argspec
        (F) The terminating quote character for a literal was not found.

    Bad argspec (%s)
        (F) Only identifier names, literals, spaces and commas are allowed
        in argspecs.

    Missing comma separator in argspec
        (F) Identifiers in an argspec must be separated with ",".

    Parsing of undecoded UTF-8 will give garbage when decoding entities
        (W) The first chunk parsed appears to contain undecoded UTF-8 and
        one or more argspecs that decode entities are used for the callback
        handlers.

        The result of decoding will be a mix of encoded and decoded
        characters for any entities that expand to characters with code
        above 127. This is not a good thing.

        The recommended solution is to apply Encode::decode_utf8() on the
        data before feeding it to the $p->parse(). For $p->parse_file() pass
        a file that has been opened in ":utf8" mode.

        The alternative solution is to enable the "utf8_mode" and not decode
        before passing strings to $p->parse(). The parser can process raw
        undecoded UTF-8 sanely if the "utf8_mode" is enabled, or if the
        "attr", @attr or "dtext" argspecs are avoided.

    Parsing string decoded with wrong endian selection
        (W) The first character in the document is U+FFFE. This is not a
        legal Unicode character but a byte swapped "BOM". The result of
        parsing will likely be garbage.

    Parsing of undecoded UTF-32
        (W) The parser found the Unicode UTF-32 "BOM" signature at the start
        of the document. The result of parsing will likely be garbage.

    Parsing of undecoded UTF-16
        (W) The parser found the Unicode UTF-16 "BOM" signature at the start
        of the document. The result of parsing will likely be garbage.

SEE ALSO
    HTML::Entities, HTML::PullParser, HTML::TokeParser, HTML::HeadParser,
    HTML::LinkExtor, HTML::Form

    HTML::TreeBuilder (part of the *HTML-Tree* distribution)

    <http://www.w3.org/TR/html4/>

eg/hanchors  view on Meta::CPAN

}

sub img_handler {
    my ($self, $tag, $attr) = @_;
    return unless $tag eq "img";
    push(@{$self->handler("text")}, $attr->{alt} || "[IMG]");
}

sub a_end_handler {
    my ($self, $tag) = @_;
    my $text = encode('utf8', join("", @{$self->handler("text")}));
    $text =~ s/^\s+//;
    $text =~ s/\s+$//;
    $text =~ s/\s+/ /g;
    print "T $text\n";

    $self->handler("text",  undef);
    $self->handler("start", \&a_start_handler);
    $self->handler("end",   undef);
}

eg/htext  view on Meta::CPAN

my %inside;

sub tag {
    my ($tag, $num) = @_;
    $inside{$tag} += $num;
    print " ";    # not for all tags
}

sub text {
    return if $inside{script} || $inside{style};
    print encode('utf8', $_[0]);
}

HTML::Parser->new(
    api_version => 3,
    handlers    => [
        start => [\&tag,  "tagname, '+1'"],
        end   => [\&tag,  "tagname, '-1'"],
        text  => [\&text, "dtext"],
    ],
    marked_sections => 1,

hparser.c  view on Meta::CPAN

 *     parse_process()             - deals with process instructions <?...>
 *     parse_null()                - deals with anything else        <....>
 *
 *     report_event() - called whenever any of the parse*() routines
 *                      has recongnized something.
 */

static void
report_event(PSTATE* p_state,
	     event_id_t event,
	     char *beg, char *end, U32 utf8,
	     token_pos_t *tokens, int num_tokens,
	     SV* self
	    )
{
    struct p_handler *h;
    dTHX;
    dSP;
    AV *array;
    STRLEN my_na;
    char *argspec;
    char *s;
    STRLEN offset;
    STRLEN line;
    STRLEN column;

    #define CHR_DIST(a,b) (utf8 ? utf8_distance((U8*)(a),(U8*)(b)) : (a) - (b))

    /* some events might still fire after a handler has signaled eof
     * so suppress them here.
     */
    if (p_state->eof)
	return;

    /* capture offsets */
    offset = p_state->offset;
    line = p_state->line;

hparser.c  view on Meta::CPAN

#endif

    /* tag filters */
    if (p_state->ignore_tags || p_state->report_tags || p_state->ignore_elements) {

	if (event == E_START || event == E_END) {
	    SV* tagname = p_state->tmp;

	    assert(num_tokens >= 1);
	    sv_setpvn(tagname, tokens[0].beg, tokens[0].end - tokens[0].beg);
	    if (utf8)
		SvUTF8_on(tagname);
	    else
		SvUTF8_off(tagname);
	    if (!CASE_SENSITIVE(p_state))
		sv_lower(aTHX_ tagname);

	    if (p_state->ignoring_element) {
		if (sv_eq(p_state->ignoring_element, tagname)) {
		    if (event == E_START)
			p_state->ignore_depth++;
		    else if (--p_state->ignore_depth == 0) {
			SvREFCNT_dec(p_state->ignoring_element);
			p_state->ignoring_element = 0;

hparser.c  view on Meta::CPAN

		goto INIT_PEND_TEXT;
	    }
	}
	else {
	INIT_PEND_TEXT:
	    p_state->pend_text_offset = offset;
	    p_state->pend_text_line = line;
	    p_state->pend_text_column = column;
	    p_state->pend_text_is_cdata = p_state->is_cdata;
	    sv_setpvs(p_state->pend_text, "");
	    if (!utf8)
		SvUTF8_off(p_state->pend_text);
	}
	if (utf8 && !SvUTF8(p_state->pend_text))
	    sv_utf8_upgrade(p_state->pend_text);
	if (utf8 || !SvUTF8(p_state->pend_text)) {
	    sv_catpvn(p_state->pend_text, beg, end - beg);
	}
	else {
	    SV *tmp = newSVpvn(beg, end - beg);
	    sv_utf8_upgrade(tmp);
	    sv_catsv(p_state->pend_text, tmp);
	    SvREFCNT_dec(tmp);
	}
	return;
    }
    else if (p_state->pend_text && SvOK(p_state->pend_text)) {
	flush_pending_text(p_state, self);
	SPAGAIN;
    }

hparser.c  view on Meta::CPAN


	case ARG_TOKENS:
	    if (num_tokens >= 1) {
		AV* av = newAV();
		SV* prev_token = &PL_sv_undef;
		int i;
		av_extend(av, num_tokens);
		for (i = 0; i < num_tokens; i++) {
		    if (tokens[i].beg) {
			prev_token = newSVpvn(tokens[i].beg, tokens[i].end-tokens[i].beg);
			if (utf8)
			    SvUTF8_on(prev_token);
			av_push(av, prev_token);
		    }
		    else { /* boolean */
			av_push(av, p_state->bool_attr_val
				? newSVsv(p_state->bool_attr_val)
				: newSVsv(prev_token));
		    }
		}
		arg = sv_2mortal(newRV_noinc((SV*)av));
	    }

hparser.c  view on Meta::CPAN

	    break;

	case ARG_TOKEN0:
	case ARG_TAGNAME:
	    /* fall through */

	case ARG_TAG:
	    if (num_tokens >= 1) {
		arg = sv_2mortal(newSVpvn(tokens[0].beg,
					  tokens[0].end - tokens[0].beg));
		if (utf8)
		    SvUTF8_on(arg);
		if (!CASE_SENSITIVE(p_state) && argcode != ARG_TOKEN0)
		    sv_lower(aTHX_ arg);
		if (argcode == ARG_TAG && event != E_START) {
		    char *e_type = "!##/#?#";
		    sv_insert(arg, 0, 0, &e_type[event], 1);
		}
	    }
	    break;

	case ARG_ATTR:

hparser.c  view on Meta::CPAN

		    hv = 0;
#endif
		    push_arg = 0;  /* deal with argument pushing here */
		}

		for (i = 1; i < num_tokens; i += 2) {
		    SV* attrname = newSVpvn(tokens[i].beg,
					    tokens[i].end-tokens[i].beg);
		    SV* attrval;

		    if (utf8)
			SvUTF8_on(attrname);
		    if (tokens[i+1].beg) {
			char *beg = tokens[i+1].beg;
			STRLEN len = tokens[i+1].end - beg;
			if (*beg == '"' || *beg == '\'' || (*beg == '`' && p_state->backquote)) {
			    assert(len >= 2 && *beg == beg[len-1]);
			    beg++; len -= 2;
			}
			attrval = newSVpvn(beg, len);
			if (utf8)
			    SvUTF8_on(attrval);
			if (!p_state->attr_encoded) {
			    if (p_state->utf8_mode) {
				sv_utf8_decode(attrval);
                                sv_utf8_upgrade(attrval);
                            }
			    decode_entities(aTHX_ attrval, p_state->entity2char, 0);
			    if (p_state->utf8_mode)
				SvUTF8_off(attrval);
			}
		    }
		    else { /* boolean */
			if (p_state->bool_attr_val)
			    attrval = newSVsv(p_state->bool_attr_val);
			else
			    attrval = newSVsv(attrname);
		    }

		    if (!CASE_SENSITIVE(p_state))

hparser.c  view on Meta::CPAN

	    }
	    break;

	case ARG_ATTRSEQ:       /* (v2 compatibility stuff) */
	    if (event == E_START) {
		AV* av = newAV();
		int i;
		for (i = 1; i < num_tokens; i += 2) {
		    SV* attrname = newSVpvn(tokens[i].beg,
					    tokens[i].end-tokens[i].beg);
		    if (utf8)
			SvUTF8_on(attrname);
		    if (!CASE_SENSITIVE(p_state))
			sv_lower(aTHX_ attrname);
		    av_push(av, attrname);
		}
		arg = sv_2mortal(newRV_noinc((SV*)av));
	    }
	    break;

	case ARG_TEXT:
	    arg = sv_2mortal(newSVpvn(beg, end - beg));
	    if (utf8)
		SvUTF8_on(arg);
	    break;

	case ARG_DTEXT:
	    if (event == E_TEXT) {
		arg = sv_2mortal(newSVpvn(beg, end - beg));
		if (utf8)
		    SvUTF8_on(arg);
		if (!p_state->is_cdata) {
		    if (p_state->utf8_mode) {
			sv_utf8_decode(arg);
                        sv_utf8_upgrade(arg);
                    }
		    decode_entities(aTHX_ arg, p_state->entity2char, 1);
		    if (p_state->utf8_mode)
			SvUTF8_off(arg);
		}
	    }
	    break;

	case ARG_IS_CDATA:
	    if (event == E_TEXT) {
		arg = boolSV(p_state->is_cdata);
	    }
	    break;

hparser.c  view on Meta::CPAN


	case ARG_EVENT:
	    assert(event >= 0 && event < EVENT_COUNT);
	    arg = sv_2mortal(newSVpv(event_id_str[event], 0));
	    break;

	case ARG_LITERAL:
	{
	    int len = (unsigned char)s[1];
	    arg = sv_2mortal(newSVpvn(s+2, len));
	    if (SvUTF8(h->argspec))
		SvUTF8_on(arg);
	    s += len + 1;
	}
	break;

	case ARG_UNDEF:
	    arg = sv_mortalcopy(&PL_sv_undef);
	    break;

	default:
	    arg = sv_2mortal(newSVpvf("Bad argspec %d", *s));

hparser.c  view on Meta::CPAN

	LEAVE;
    }
    if (p_state->skipped_text)
	SvCUR_set(p_state->skipped_text, 0);
    return;

IGNORE_EVENT:
    if (p_state->skipped_text) {
	if (event != E_TEXT && p_state->pend_text && SvOK(p_state->pend_text))
	    flush_pending_text(p_state, self);
	if (utf8 && !SvUTF8(p_state->skipped_text))
	    sv_utf8_upgrade(p_state->skipped_text);
	if (utf8 || !SvUTF8(p_state->skipped_text)) {
	    sv_catpvn(p_state->skipped_text, beg, end - beg);
	}
	else {
	    SV *tmp = newSVpvn(beg, end - beg);
	    sv_utf8_upgrade(tmp);
	    sv_catsv(p_state->skipped_text, tmp);
	    SvREFCNT_dec(tmp);
	}
    }
#undef CHR_DIST
    return;
}


EXTERN SV*
argspec_compile(SV* src, PSTATE* p_state)
{
    dTHX;
    SV* argspec = newSVpvs("");
    STRLEN len;
    char *s = SvPV(src, len);
    char *end = s + len;

    if (SvUTF8(src))
	SvUTF8_on(argspec);

    while (isHSPACE(*s))
	s++;

    if (*s == '@') {
	/* try to deal with '@{ ... }' wrapping */
	char *tmp = s + 1;
	while (isHSPACE(*tmp))
	    tmp++;
	if (*tmp == '{') {

hparser.c  view on Meta::CPAN


    p_state->unbroken_text = 0;
    p_state->pend_text     = 0;
    p_state->is_cdata      = p_state->pend_text_is_cdata;
    p_state->offset        = p_state->pend_text_offset;
    p_state->line          = p_state->pend_text_line;
    p_state->column        = p_state->pend_text_column;

    report_event(p_state, E_TEXT,
		 SvPVX(old_pend_text), SvEND(old_pend_text),
		 SvUTF8(old_pend_text), 0, 0, self);
    SvOK_off(old_pend_text);

    p_state->unbroken_text = old_unbroken_text;
    p_state->pend_text     = old_pend_text;
    p_state->is_cdata      = old_is_cdata;
    p_state->offset        = old_offset;
    p_state->line          = old_line;
    p_state->column        = old_column;
}

hparser.c  view on Meta::CPAN

	    else if (!quote && (prev == ' ' || prev == '=')) {
		quote = *s;
	    }
	}
	prev = *s++;
    }
    return end;
}

static char*
parse_comment(PSTATE* p_state, char *beg, char *end, U32 utf8, SV* self)
{
    char *s = beg;

    if (p_state->strict_comment) {
	dTOKENS(4);
	char *start_com = s;  /* also used to signal inside/outside */

	while (1) {
	    /* try to locate "--" */
	FIND_DASH_DASH:

hparser.c  view on Meta::CPAN

		return beg;
	    }

	    if (*s == '>') {
		s++;
		if (start_com)
		    goto FIND_DASH_DASH;

		/* we are done recognizing all comments, make callbacks */
		report_event(p_state, E_COMMENT,
			     beg - 4, s, utf8,
			     tokens, num_tokens,
			     self);
		FREE_TOKENS;

		return s;
	    }

	    s++;
	    if (s == end) {
		FREE_TOKENS;

hparser.c  view on Meta::CPAN

    }
    else if (p_state->no_dash_dash_comment_end) {
	token_pos_t token;
        token.beg = beg;
        /* a lone '>' signals end-of-comment */
	while (s < end && *s != '>')
	    s++;
	token.end = s;
	if (s < end) {
	    s++;
	    report_event(p_state, E_COMMENT, beg-4, s, utf8, &token, 1, self);
	    return s;
	}
	else {
	    return beg;
	}
    }
    else { /* non-strict comment */
	token_pos_t token;
	token.beg = beg;
	/* try to locate /--\s*>/ which signals end-of-comment */

hparser.c  view on Meta::CPAN

	token.end = s;
	if (s < end) {
	    s++;
	    if (*s == '-') {
		s++;
		while (isHSPACE(*s))
		    s++;
		if (*s == '>') {
		    s++;
		    /* yup */
		    report_event(p_state, E_COMMENT, beg-4, s, utf8, &token, 1, self);
		    return s;
		}
	    }
	    if (s < end) {
		s = token.end + 1;
		goto LOCATE_END;
	    }
	}

	if (s == end)

hparser.c  view on Meta::CPAN

	    }
	}
    }
    /* printf("MS %d\n", p_state->ms); */
    p_state->is_cdata = (p_state->ms == MS_CDATA);
    return;
}


static char*
parse_marked_section(PSTATE* p_state, char *beg, char *end, U32 utf8, SV* self)
{
    dTHX;
    char *s;
    AV* tokens = 0;

    if (!p_state->marked_sections)
	return 0;

    assert(beg[0] == '<');
    assert(beg[1] == '!');

hparser.c  view on Meta::CPAN

	    s++;
	name_end = s;
	while (isHSPACE(*s))
	    s++;
	if (s == end)
	    goto PREMATURE;

	if (!tokens)
	    tokens = newAV();
	name = newSVpvn(name_start, name_end - name_start);
	if (utf8)
	    SvUTF8_on(name);
	av_push(tokens, sv_lower(aTHX_ name));
    }
    if (*s == '-') {
	s++;
	if (*s == '-') {
	    /* comment */
	    s++;
	    while (1) {
		while (s < end && *s != '-')
		    s++;

hparser.c  view on Meta::CPAN


	if (!tokens) {
	    tokens = newAV();
	    av_push(tokens, newSVpvs("include"));
	}

	if (!p_state->ms_stack)
	    p_state->ms_stack = newAV();
	av_push(p_state->ms_stack, newRV_noinc((SV*)tokens));
	marked_section_update(p_state);
	report_event(p_state, E_NONE, beg, s, utf8, 0, 0, self);
	return s;
    }

FAIL:
    SvREFCNT_dec(tokens);
    return 0; /* not yet implemented */

PREMATURE:
    SvREFCNT_dec(tokens);
    return beg;
}
#endif


static char*
parse_decl(PSTATE* p_state, char *beg, char *end, U32 utf8, SV* self)
{
    char *s = beg + 2;

    if (*s == '-') {
	/* comment? */

	char *tmp;
	s++;
	if (s == end)
	    return beg;

	if (*s != '-')
	    goto DECL_FAIL;  /* nope, illegal */

	/* yes, two dashes seen */
	s++;

	tmp = parse_comment(p_state, s, end, utf8, self);
	return (tmp == s) ? beg : tmp;
    }

#ifdef MARKED_SECTION
    if (*s == '[') {
	/* marked section */
	char *tmp;
	tmp = parse_marked_section(p_state, beg, end, utf8, self);
	if (!tmp)
	    goto DECL_FAIL;
	return tmp;
    }
#endif

    if (*s == '>') {
	/* make <!> into empty comment <SGML Handbook 36:32> */
	token_pos_t token;
	token.beg = s;
	token.end = s;
	s++;
	report_event(p_state, E_COMMENT, beg, s, utf8, &token, 1, self);
	return s;
    }

    if (isALPHA(*s)) {
	dTOKENS(8);
	char *decl_id = s;
	STRLEN decl_id_len;

	s++;
	/* declaration */

hparser.c  view on Meta::CPAN

	    }
	    else {
		break;
	    }
	}

	if (s == end)
	    goto PREMATURE;
	if (*s == '>') {
	    s++;
	    report_event(p_state, E_DECLARATION, beg, s, utf8, tokens, num_tokens, self);
	    FREE_TOKENS;
	    return s;
	}

    FAIL:
	FREE_TOKENS;
	goto DECL_FAIL;

    PREMATURE:
	FREE_TOKENS;

hparser.c  view on Meta::CPAN

	return 0;

    /* consider everything up to the first '>' a comment */
    while (s < end && *s != '>')
	s++;
    if (s < end) {
	token_pos_t token;
	token.beg = beg + 2;
	token.end = s;
	s++;
	report_event(p_state, E_COMMENT, beg, s, utf8, &token, 1, self);
	return s;
    }
    else {
	return beg;
    }
}


static char*
parse_start(PSTATE* p_state, char *beg, char *end, U32 utf8, SV* self)
{
    char *s = beg;
    int empty_tag = 0;
    dTOKENS(16);

    hctype_t tag_name_char;
    hctype_t attr_name_first, attr_name_char;

    if (STRICT_NAMES(p_state)) {
	attr_name_first = HCTYPE_NAME_FIRST;

hparser.c  view on Meta::CPAN

    if (ALLOW_EMPTY_TAG(p_state) && *s == '/') {
	s++;
	if (s == end)
	    goto PREMATURE;
	empty_tag = 1;
    }

    if (*s == '>') {
	s++;
	/* done */
	report_event(p_state, E_START, beg, s, utf8, tokens, num_tokens, self);
	if (empty_tag) {
	    report_event(p_state, E_END, s, s, utf8, tokens, 1, self);
	}
	else if (!p_state->xml_mode) {
	    /* find out if this start tag should put us into literal_mode
	     */
	    int i;
	    int tag_len = tokens[0].end - tokens[0].beg;

	    for (i = 0; literal_mode_elem[i].len; i++) {
		if (tag_len == literal_mode_elem[i].len) {
		    /* try to match it */

hparser.c  view on Meta::CPAN

    FREE_TOKENS;
    return 0;

PREMATURE:
    FREE_TOKENS;
    return beg;
}


static char*
parse_end(PSTATE* p_state, char *beg, char *end, U32 utf8, SV* self)
{
    char *s = beg+2;
    hctype_t name_first, name_char;

    if (STRICT_NAMES(p_state)) {
	name_first = HCTYPE_NAME_FIRST;
	name_char  = HCTYPE_NAME_CHAR;
    }
    else {
	name_first = name_char = HCTYPE_NOT_SPACE_GT;

hparser.c  view on Meta::CPAN

	    while (isHSPACE(*s))
		s++;
	}
	else {
	    s = skip_until_gt(s, end);
	}
	if (s < end) {
	    if (*s == '>') {
		s++;
		/* a complete end tag has been recognized */
		report_event(p_state, E_END, beg, s, utf8, &tagname, 1, self);
		return s;
	    }
	}
	else {
	    return beg;
	}
    }
    else if (!p_state->strict_comment) {
	s = skip_until_gt(s, end);
	if (s < end) {
	    token_pos_t token;
	    token.beg = beg + 2;
	    token.end = s;
	    s++;
	    report_event(p_state, E_COMMENT, beg, s, utf8, &token, 1, self);
	    return s;
	}
	else {
	    return beg;
	}
    }
    return 0;
}


static char*
parse_process(PSTATE* p_state, char *beg, char *end, U32 utf8, SV* self)
{
    char *s = beg + 2;  /* skip '<?' */
    /* processing instruction */
    token_pos_t token_pos;
    token_pos.beg = s;

    while (s < end) {
	if (*s == '>') {
	    token_pos.end = s;
	    s++;

	    if (p_state->xml_mode || p_state->xml_pic) {
		/* XML processing instructions are ended by "?>" */
		if (s - beg < 4 || s[-2] != '?')
		    continue;
		token_pos.end = s - 2;
	    }

	    /* a complete processing instruction seen */
	    report_event(p_state, E_PROCESS, beg, s, utf8,
			 &token_pos, 1, self);
	    return s;
	}
	s++;
    }
    return beg;  /* could not find end */
}


#ifdef USE_PFUNC
static char*
parse_null(PSTATE* p_state, char *beg, char *end, U32 utf8, SV* self)
{
    return 0;
}



#include "pfunc.h"                   /* declares the parsefunc[] */
#endif /* USE_PFUNC */

static char*
parse_buf(pTHX_ PSTATE* p_state, char *beg, char *end, U32 utf8, SV* self)
{
    char *s = beg;
    char *t = beg;
    char *new_pos;

    while (!p_state->eof) {
	/*
	 * At the start of this loop we will always be ready for eating text
	 * or a new tag.  We will never be inside some tag.  The 't' points
	 * to where we started and the 's' is advanced as we go.

hparser.c  view on Meta::CPAN

		    /* matched it all */
		    token_pos_t end_token;
		    end_token.beg = end_text + 2;
		    end_token.end = s;

		    while (isHSPACE(*s))
			s++;
		    if (*s == '>') {
			s++;
			if (t != end_text)
			    report_event(p_state, E_TEXT, t, end_text, utf8,
					 0, 0, self);
			report_event(p_state, E_END,  end_text, s, utf8,
				     &end_token, 1, self);
			p_state->literal_mode = 0;
			p_state->is_cdata = 0;
			t = s;
		    }
		}
	    }
	}

#ifdef MARKED_SECTION
	while (p_state->ms == MS_CDATA || p_state->ms == MS_RCDATA) {
	    while (s < end && *s != ']')
		s++;
	    if (*s == ']') {
		char *end_text = s;
		s++;
		if (*s == ']' && *(s + 1) == '>') {
		    s += 2;
		    /* marked section end */
		    if (t != end_text)
			report_event(p_state, E_TEXT, t, end_text, utf8,
				     0, 0, self);
		    report_event(p_state, E_NONE, end_text, s, utf8, 0, 0, self);
		    t = s;
		    SvREFCNT_dec(av_pop(p_state->ms_stack));
		    marked_section_update(p_state);
		    continue;
		}
	    }
	    if (s == end) {
		s = t;
		goto DONE;
	    }

hparser.c  view on Meta::CPAN

	/* first we try to match as much text as possible */
	while (s < end && *s != '<') {
#ifdef MARKED_SECTION
	    if (p_state->ms && *s == ']') {
		char *end_text = s;
		s++;
		if (*s == ']') {
		    s++;
		    if (*s == '>') {
			s++;
			report_event(p_state, E_TEXT, t, end_text, utf8,
				     0, 0, self);
			report_event(p_state, E_NONE, end_text, s, utf8,
				     0, 0, self);
			t = s;
			SvREFCNT_dec(av_pop(p_state->ms_stack));
			marked_section_update(p_state);
			continue;
		    }
		}
	    }
#endif
	    s++;
	}
	if (s != t) {
	    if (*s == '<') {
		report_event(p_state, E_TEXT, t, s, utf8, 0, 0, self);
		t = s;
	    }
	    else {
		s--;
		if (isHSPACE(*s)) {
		    /* wait with white space at end */
		    while (s >= t && isHSPACE(*s))
			s--;
		}
		else {
		    /* might be a chopped up entities/words */
		    while (s >= t && !isHSPACE(*s))
			s--;
		    while (s >= t && isHSPACE(*s))
			s--;
		}
		s++;
		if (s != t)
		    report_event(p_state, E_TEXT, t, s, utf8, 0, 0, self);
		break;
	    }
	}

	if (end - s < 3)
	    break;

	/* next char is known to be '<' and pointed to by 't' as well as 's' */
	s++;

#ifdef USE_PFUNC
	new_pos = parsefunc[(unsigned char)*s](p_state, t, end, utf8, self);
#else
	if (isHNAME_FIRST(*s))
	    new_pos = parse_start(p_state, t, end, utf8, self);
	else if (*s == '/')
	    new_pos = parse_end(p_state, t, end, utf8, self);
	else if (*s == '!')
	    new_pos = parse_decl(p_state, t, end, utf8, self);
	else if (*s == '?')
	    new_pos = parse_process(p_state, t, end, utf8, self);
	else
	    new_pos = 0;
#endif /* USE_PFUNC */

	if (new_pos) {
	    if (new_pos == t) {
		/* no progress, need more data to know what it is */
		s = t;
		break;
	    }

hparser.c  view on Meta::CPAN


}

EXTERN void
parse(pTHX_
      PSTATE* p_state,
      SV* chunk,
      SV* self)
{
    char *s, *beg, *end;
    U32 utf8 = 0;
    STRLEN len;

    if (!p_state->start_document) {
	char dummy[1];
	report_event(p_state, E_START_DOCUMENT, dummy, dummy, 0, 0, 0, self);
	p_state->start_document = 1;
    }

    if (!chunk) {
	/* eof */
	char empty[1];
	if (p_state->buf && SvOK(p_state->buf)) {
	    /* flush it */
	    s = SvPV(p_state->buf, len);
	    end = s + len;
	    utf8 = SvUTF8(p_state->buf);
	    assert(len);

	    while (s < end) {
		if (p_state->literal_mode) {
		    if (strEQ(p_state->literal_mode, "plaintext") ||
			strEQ(p_state->literal_mode, "xmp") ||
			strEQ(p_state->literal_mode, "iframe") ||
			strEQ(p_state->literal_mode, "textarea"))
		    {
			/* rest is considered text */

hparser.c  view on Meta::CPAN

			token_pos_t t;
			char dummy;
			t.beg = p_state->literal_mode;
			t.end = p_state->literal_mode + strlen(p_state->literal_mode);
			report_event(p_state, E_END, &dummy, &dummy, 0, &t, 1, self);
		    }
		    else {
			p_state->pending_end_tag = p_state->literal_mode;
		    }
		    p_state->literal_mode = 0;
		    s = parse_buf(aTHX_ p_state, s, end, utf8, self);
		    continue;
		}

		if (!p_state->strict_comment && !p_state->no_dash_dash_comment_end && *s == '<') {
		    p_state->no_dash_dash_comment_end = 1;
		    s = parse_buf(aTHX_ p_state, s, end, utf8, self);
		    continue;
		}

		if (!p_state->strict_comment && *s == '<') {
		    char *s1 = s + 1;
		    if (s1 == end || isHNAME_FIRST(*s1) || *s1 == '/' || *s1 == '!' || *s1 == '?') {
			/* some kind of unterminated markup.  Report rest as as comment */
			token_pos_t token;
			token.beg = s + 1;
			token.end = end;
			report_event(p_state, E_COMMENT, s, end, utf8, &token, 1, self);
			s = end;
		    }
		}

		break;
	    }

	    if (s < end) {
		/* report rest as text */
		report_event(p_state, E_TEXT, s, end, utf8, 0, 0, self);
	    }

	    SvREFCNT_dec(p_state->buf);
	    p_state->buf = 0;
	}
	if (p_state->pend_text && SvOK(p_state->pend_text))
	    flush_pending_text(p_state, self);

	if (p_state->ignoring_element) {
	    /* document not balanced */

hparser.c  view on Meta::CPAN

	p_state->offset = 0;
	if (p_state->line)
	    p_state->line = 1;
	p_state->column = 0;
	p_state->start_document = 0;
	p_state->literal_mode = 0;
	p_state->is_cdata = 0;
	return;
    }

    if (p_state->utf8_mode)
	sv_utf8_downgrade(chunk, 0);

    if (p_state->buf && SvOK(p_state->buf)) {
	sv_catsv(p_state->buf, chunk);
	beg = SvPV(p_state->buf, len);
	utf8 = SvUTF8(p_state->buf);
    }
    else {
	beg = SvPV(chunk, len);
	utf8 = SvUTF8(chunk);
	if (p_state->offset == 0 && DOWARN) {
	    /* Print warnings if we find unexpected Unicode BOM forms */
	    if (p_state->argspec_entity_decode &&
		!(p_state->attr_encoded && p_state->argspec_entity_decode == ARG_ATTR) &&
		!p_state->utf8_mode && (
                 (!utf8 && len >= 3 && strnEQ(beg, "\xEF\xBB\xBF", 3)) ||
		 (utf8 && len >= 6 && strnEQ(beg, "\xC3\xAF\xC2\xBB\xC2\xBF", 6)) ||
		 (!utf8 && probably_utf8_chunk(aTHX_ beg, len))
		)
	       )
	    {
		warn("Parsing of undecoded UTF-8 will give garbage when decoding entities");
	    }
	    if (utf8 && len >= 2 && strnEQ(beg, "\xFF\xFE", 2)) {
		warn("Parsing string decoded with wrong endianness");
	    }
	    if (!utf8 && len >= 4 &&
		(strnEQ(beg, "\x00\x00\xFE\xFF", 4) ||
		 strnEQ(beg, "\xFE\xFF\x00\x00", 4))
		)
	    {
		warn("Parsing of undecoded UTF-32");
	    }
	    else if (!utf8 && len >= 2 &&
		     (strnEQ(beg, "\xFE\xFF", 2) || strnEQ(beg, "\xFF\xFE", 2))
		)
	    {
		warn("Parsing of undecoded UTF-16");
	    }
	}
    }

    if (!len)
	return; /* nothing to do */

    end = beg + len;
    s = parse_buf(aTHX_ p_state, beg, end, utf8, self);

    if (s == end || p_state->eof) {
	if (p_state->buf) {
	    SvOK_off(p_state->buf);
	}
    }
    else {
	/* need to keep rest in buffer */
	if (p_state->buf) {
	    /* chop off some chars at the beginning */
	    if (SvOK(p_state->buf)) {
		sv_chop(p_state->buf, s);
	    }
	    else {
		sv_setpvn(p_state->buf, s, end - s);
		if (utf8)
		    SvUTF8_on(p_state->buf);
		else
		    SvUTF8_off(p_state->buf);
	    }
	}
	else {
	    p_state->buf = newSVpv(s, end - s);
	    if (utf8)
		SvUTF8_on(p_state->buf);
	}
    }
    return;
}

hparser.h  view on Meta::CPAN


    /* various boolean configuration attributes */
    bool strict_comment;
    bool strict_names;
    bool strict_end;
    bool xml_mode;
    bool unbroken_text;
    bool attr_encoded;
    bool case_sensitive;
    bool closing_plaintext;
    bool utf8_mode;
    bool empty_element_tags;
    bool xml_pic;
    bool backquote;

    /* other configuration stuff */
    SV* bool_attr_val;
    struct p_handler handlers[EVENT_COUNT];
    int argspec_entity_decode;

    /* filters */

lib/HTML/Entities.pm  view on Meta::CPAN

package HTML::Entities;

=encoding utf8

=head1 NAME

HTML::Entities - Encode or decode strings with HTML entities

=head1 SYNOPSIS

 use HTML::Entities;

 $a = "V&aring;re norske tegn b&oslash;r &#230res";

lib/HTML/HeadParser.pm  view on Meta::CPAN


=head1 DESCRIPTION

The C<HTML::HeadParser> is a specialized (and lightweight)
C<HTML::Parser> that will only parse the E<lt>HEAD>...E<lt>/HEAD>
section of an HTML document.  The parse() method
will return a FALSE value as soon as some E<lt>BODY> element or body
text are found, and should not be called again after this.

Note that the C<HTML::HeadParser> might get confused if raw undecoded
UTF-8 is passed to the parse() method.  Make sure the strings are
properly decoded before passing them on.

The C<HTML::HeadParser> keeps a reference to a header object, and the
parser will update this header object as the various elements of the
E<lt>HEAD> section of the HTML document are recognized.  The following
header fields are affected:

=over 4

=item Content-Base:

lib/HTML/HeadParser.pm  view on Meta::CPAN

{
    my $self = shift;
    my $tag  = $self->{'tag'};
    my $text = $self->{'text'};
    $text =~ s/^\s+//;
    $text =~ s/\s+$//;
    $text =~ s/\s+/ /g;
    print "FLUSH $tag => '$text'\n"  if $DEBUG;
    if ($tag eq 'title') {
	my $decoded;
	$decoded = utf8::decode($text) if $self->utf8_mode && defined &utf8::decode;
	HTML::Entities::decode($text);
	utf8::encode($text) if $decoded;
	$self->{'header'}->push_header(Title => $text);
    }
    $self->{'tag'} = $self->{'text'} = '';
}

# This is an quote from the HTML3.2 DTD which shows which elements
# that might be present in a <HEAD>...</HEAD>.  Also note that the
# <HEAD> tags themselves might be missing:
#
# <!ENTITY % head.content "TITLE & ISINDEX? & BASE? & STYLE? &

lib/HTML/HeadParser.pm  view on Meta::CPAN

    $self->flush_text if $self->{'tag'};
    $self->eof if $tag eq 'head';
}

sub text
{
    my($self, $text) = @_;
    print "TEXT[$text]\n" if $DEBUG;
    unless ($self->{first_chunk}) {
	# drop Unicode BOM if found
	if ($self->utf8_mode) {
	    $text =~ s/^\xEF\xBB\xBF//;
	}
	else {
	    $text =~ s/^\x{FEFF}//;
	}
	$self->{first_chunk}++;
    }
    my $tag = $self->{tag};
    if (!$tag && $text =~ /\S/) {
	# Normal text means start of body
        $self->eof;
	return;
    }
    return if $tag ne 'title';
    $self->{'text'} .= $text;
}

BEGIN {
    *utf8_mode = sub { 1 } unless HTML::Entities::UNICODE_SUPPORT;
}

1;

__END__

=back

=head1 EXAMPLE

lib/HTML/Parser.pm  view on Meta::CPAN

    $p->parse($chunk2);

    # ...
    # signal end of document
    $p->eof;

    # Parse directly from file
    $p->parse_file("foo.html");

    # or
    open(my $fh, "<:utf8", "foo.html") || die;
    $p->parse_file($fh);

=head1 DESCRIPTION

Objects of the C<HTML::Parser> class will recognize markup and
separate it from plain text (alias data content) in HTML
documents.  As different kinds of markup and text are recognized, the
corresponding event handlers are invoked.

C<HTML::Parser> is not a generic SGML parser.  We have tried to

lib/HTML/Parser.pm  view on Meta::CPAN

make it hard to do transformations on the text. When this attribute is
enabled, blocks of text are always reported in one piece.  This will
delay the text event until the following (non-text) event has been
recognized by the parser.

Note that the C<offset> argspec will give you the offset of the first
segment of text and C<length> is the combined length of the segments.
Since there might be ignored tags in between, these numbers can't be
used to directly index in the original document file.

=item $p->utf8_mode

=item $p->utf8_mode( $bool )

Enable this option when parsing raw undecoded UTF-8.  This tells the
parser that the entities expanded for strings reported by C<attr>,
C<@attr> and C<dtext> should be expanded as decoded UTF-8 so they end
up compatible with the surrounding text.

If C<utf8_mode> is enabled then it is an error to pass strings
containing characters with code above 255 to the parse() method, and
the parse() method will croak if you try.

Example: The Unicode character "\x{2665}" is "\xE2\x99\xA5" when UTF-8
encoded.  The character can also be represented by the entity
"&hearts;" or "&#x2665".  If we feed the parser:

  $p->parse("\xE2\x99\xA5&hearts;");

then C<dtext> will be reported as "\xE2\x99\xA5\x{2665}" without
C<utf8_mode> enabled, but as "\xE2\x99\xA5\xE2\x99\xA5" when enabled.
The later string is what you want.

This option is only available with perl-5.8 or better.

=item $p->xml_mode

=item $p->xml_mode( $bool )

Enabling this attribute changes the parser to allow some XML
constructs.  This enables the behaviour controlled by individually by

lib/HTML/Parser.pm  view on Meta::CPAN


=back

=head2 Unicode

C<HTML::Parser> can parse Unicode strings when running under
perl-5.8 or better.  If Unicode is passed to $p->parse() then chunks
of Unicode will be reported to the handlers.  The offset and length
argspecs will also report their position in terms of characters.

It is safe to parse raw undecoded UTF-8 if you either avoid decoding
entities and make sure to not use I<argspecs> that do, or enable the
C<utf8_mode> for the parser.  Parsing of undecoded UTF-8 might be
useful when parsing from a file where you need the reported offsets
and lengths to match the byte offsets in the file.

If a filename is passed to $p->parse_file() then the file will be read
in binary mode.  This will be fine if the file contains only ASCII or
Latin-1 characters.  If the file contains UTF-8 encoded text then care
must be taken when decoding entities as described in the previous
paragraph, but better is to open the file with the UTF-8 layer so that
it is decoded properly:

   open(my $fh, "<:utf8", "index.html") || die "...: $!";
   $p->parse_file($fh);

If the file contains text encoded in a charset besides ASCII, Latin-1
or UTF-8 then decoding will always be needed.

=head1 VERSION 2 COMPATIBILITY

When an C<HTML::Parser> object is constructed with no arguments, a set
of handlers is automatically provided that is compatible with the old
HTML::Parser version 2 callback methods.

This is equivalent to the following method calls:

    $p->handler(start   => "start",   "self, tagname, attr, attrseq, text");

lib/HTML/Parser.pm  view on Meta::CPAN


=item Bad argspec (%s)

(F) Only identifier names, literals, spaces and commas
are allowed in argspecs.

=item Missing comma separator in argspec

(F) Identifiers in an argspec must be separated with ",".

=item Parsing of undecoded UTF-8 will give garbage when decoding entities

(W) The first chunk parsed appears to contain undecoded UTF-8 and one
or more argspecs that decode entities are used for the callback
handlers.

The result of decoding will be a mix of encoded and decoded characters
for any entities that expand to characters with code above 127.  This
is not a good thing.

The recommended solution is to apply Encode::decode_utf8() on the data before
feeding it to the $p->parse().  For $p->parse_file() pass a file that has been
opened in ":utf8" mode.

The alternative solution is to enable the C<utf8_mode> and not decode before
passing strings to $p->parse().  The parser can process raw undecoded UTF-8
sanely if the C<utf8_mode> is enabled, or if the C<attr>, C<@attr> or C<dtext>
argspecs are avoided.

=item Parsing string decoded with wrong endian selection

(W) The first character in the document is U+FFFE.  This is not a
legal Unicode character but a byte swapped C<BOM>.  The result of parsing
will likely be garbage.

=item Parsing of undecoded UTF-32

(W) The parser found the Unicode UTF-32 C<BOM> signature at the start
of the document.  The result of parsing will likely be garbage.

=item Parsing of undecoded UTF-16

(W) The parser found the Unicode UTF-16 C<BOM> signature at the start of
the document.  The result of parsing will likely be garbage.

=back

=head1 SEE ALSO

L<HTML::Entities>, L<HTML::PullParser>, L<HTML::TokeParser>, L<HTML::HeadParser>,
L<HTML::LinkExtor>, L<HTML::Form>

L<HTML::TreeBuilder> (part of the I<HTML-Tree> distribution)

lib/HTML/TokeParser.pm  view on Meta::CPAN

it will be a filehandle of some kind.  The stream will be read() until
EOF, but not closed.

A newly constructed C<HTML::TokeParser> differ from its base classes
by having the C<unbroken_text> attribute enabled by default. See
L<HTML::Parser> for a description of this and other attributes that
influence how the document is parsed. It is often a good idea to enable
C<empty_element_tags> behaviour.

Note that the parsing result will likely not be valid if raw undecoded
UTF-8 is used as a source.  When parsing UTF-8 encoded files turn
on UTF-8 decoding:

   open(my $fh, "<:utf8", "index.html") || die "Can't open 'index.html': $!";
   my $p = HTML::TokeParser->new( $fh );
   # ...

If a $filename is passed to the constructor the file will be opened in
raw mode and the parsing result will only be valid if its content is
Latin-1 or pure ASCII.

If parsing from an UTF-8 encoded string buffer decode it first:

   utf8::decode($document);
   my $p = HTML::TokeParser->new( \$document );
   # ...

=item $p->get_token

This method will return the next I<token> found in the HTML document,
or C<undef> at the end of the document.  The token is returned as an
array reference.  The first element of the array will be a string
denoting the type of this token: "S" for start tag, "E" for end tag,
"T" for text, "C" for comment, "D" for declaration, and "PI" for

mkpfunc  view on Meta::CPAN

#!/usr/bin/perl

($progname = $0) =~ s,.*/,,;

print "/* This file is autogenerated by $progname */\n";

print "typedef char*(*PFUNC)(PSTATE*, char *beg, char *end, U32 utf8, SV* self);\n";
print "static PFUNC parsefunc[] = {\n";

for my $c (0..255) {
    local $_ = chr($c);
    my $func = "null";
    if (/^[A-Za-z]$/) {
	$func = "start";
    }
    elsif ($_ eq "/") {
	$func = "end";

pfunc.h  view on Meta::CPAN

/* This file is autogenerated by mkpfunc */
typedef char*(*PFUNC)(PSTATE*, char *beg, char *end, U32 utf8, SV* self);
static PFUNC parsefunc[] = {
    parse_null,     /*   0 */
    parse_null,     /*   1 */
    parse_null,     /*   2 */
    parse_null,     /*   3 */
    parse_null,     /*   4 */
    parse_null,     /*   5 */
    parse_null,     /*   6 */
    parse_null,     /*   7 */
    parse_null,     /*   8 */

ppport.h  view on Meta::CPAN

#endif
#ifndef PERL_MAGIC_uvar_elem
#define PERL_MAGIC_uvar_elem 'u'
#endif
#ifndef PERL_MAGIC_vstring
#define PERL_MAGIC_vstring 'V'
#endif
#ifndef PERL_MAGIC_vec
#define PERL_MAGIC_vec 'v'
#endif
#ifndef PERL_MAGIC_utf8
#define PERL_MAGIC_utf8 'w'
#endif
#ifndef PERL_MAGIC_substr
#define PERL_MAGIC_substr 'x'
#endif
#ifndef PERL_MAGIC_defelem
#define PERL_MAGIC_defelem 'y'
#endif
#ifndef PERL_MAGIC_glob
#define PERL_MAGIC_glob '*'
#endif

ppport.h  view on Meta::CPAN

#define inRANGE(c, l, u) \
( (sizeof(c) == sizeof(U8)) ? withinCOUNT(((U8) (c)), (l), ((u) - (l))) \
: (sizeof(c) == sizeof(U32)) ? withinCOUNT(((U32) (c)), (l), ((u) - (l))) \
: (withinCOUNT(((WIDEST_UTYPE) (c)), (l), ((u) - (l)))))
#endif
#undef FITS_IN_8_BITS
#ifndef FITS_IN_8_BITS
#define FITS_IN_8_BITS(c) ( (sizeof(c) == 1) \
|| !(((WIDEST_UTYPE)((c) | 0)) & ~0xFF))
#endif
#define D_PPP_IS_GENERIC_UTF8_SAFE(s, e, macro) \
(((e) - (s)) <= 0 \
? 0 \
: UTF8_IS_INVARIANT((s)[0]) \
? is ## macro ## _L1((s)[0]) \
: (((e) - (s)) < UTF8SKIP(s)) \
? 0 \
: UTF8_IS_DOWNGRADEABLE_START((s)[0]) \
\
? is ## macro ## _L1((WIDEST_UTYPE) LATIN1_TO_NATIVE( \
UTF8_ACCUMULATE(NATIVE_UTF8_TO_I8((s)[0]) \
& UTF_START_MASK(2), \
(s)[1]))) \
: is ## macro ## _utf8(s))
#define D_PPP_IS_GENERIC_LC_UTF8_SAFE(s, e, macro) \
(((e) - (s)) <= 0 \
? 0 \
: UTF8_IS_INVARIANT((s)[0]) \
? is ## macro ## _LC((s)[0]) \
: (((e) - (s)) < UTF8SKIP(s)) \
? 0 \
: UTF8_IS_DOWNGRADEABLE_START((s)[0]) \
\
? is ## macro ## _LC((WIDEST_UTYPE) LATIN1_TO_NATIVE( \
UTF8_ACCUMULATE(NATIVE_UTF8_TO_I8((s)[0]) \
& UTF_START_MASK(2), \
(s)[1]))) \
: is ## macro ## _utf8(s))
#define D_PPP_IS_GENERIC_LC_UTF8_SAFE_BROKEN(s, e, macro) \
(((e) - (s)) <= 0 \
? 0 \
: UTF8_IS_INVARIANT((s)[0]) \
? is ## macro ## _LC((s)[0]) \
: (((e) - (s)) < UTF8SKIP(s)) \
? 0 \
: UTF8_IS_DOWNGRADEABLE_START((s)[0]) \
\
? is ## macro ## _LC((WIDEST_UTYPE) LATIN1_TO_NATIVE( \
UTF8_ACCUMULATE(NATIVE_UTF8_TO_I8((s)[0]) \
& UTF_START_MASK(2), \
(s)[1]))) \
: is ## macro ## _utf8_safe(s, e))
#ifndef SvRX
#define SvRX(rv) (SvROK((rv)) ? (SvMAGICAL(SvRV((rv))) ? (mg_find(SvRV((rv)), PERL_MAGIC_qr) ? mg_find(SvRV((rv)), PERL_MAGIC_qr)->mg_obj : NULL) : NULL) : NULL)
#endif
#ifndef SvRXOK
#define SvRXOK(sv) (!!SvRX(sv))
#endif
#ifndef PERL_UNUSED_DECL
#ifdef HASATTRIBUTE
#if (defined(__GNUC__) && defined(__cplusplus)) || defined(__INTEL_COMPILER)
#define PERL_UNUSED_DECL

ppport.h  view on Meta::CPAN

#endif
#ifndef isUPPER_A
#define isUPPER_A(c) isUPPER(c)
#endif
#ifndef isWORDCHAR_A
#define isWORDCHAR_A(c) isWORDCHAR(c)
#endif
#ifndef isXDIGIT_A
#define isXDIGIT_A(c) isXDIGIT(c)
#endif
#ifndef isASCII_utf8_safe
#define isASCII_utf8_safe(s,e) (((e) - (s)) <= 0 ? 0 : isASCII(*(s)))
#endif
#ifndef isASCII_uvchr
#define isASCII_uvchr(c) (FITS_IN_8_BITS(c) ? isASCII_L1(c) : 0)
#endif
#if (PERL_BCDVERSION >= 0x5006000)
#ifdef isALPHA_uni
#define D_PPP_is_ctype(upper, lower, c) \
(FITS_IN_8_BITS(c) \
? is ## upper ## _L1(c) \
: is ## upper ## _uni((UV) (c)))

ppport.h  view on Meta::CPAN

#ifndef isUPPER_uvchr
#define isUPPER_uvchr(c) D_PPP_is_ctype(UPPER, upper, c)
#endif
#ifndef isXDIGIT_uvchr
#define isXDIGIT_uvchr(c) D_PPP_is_ctype(XDIGIT, xdigit, c)
#endif
#ifndef isWORDCHAR_uvchr
#define isWORDCHAR_uvchr(c) (FITS_IN_8_BITS(c) \
? isWORDCHAR_L1(c) : isALPHANUMERIC_uvchr(c))
#endif
#ifndef isALPHA_utf8_safe
#define isALPHA_utf8_safe(s,e) D_PPP_IS_GENERIC_UTF8_SAFE(s, e, ALPHA)
#endif
#ifdef isALPHANUMERIC_utf8
#ifndef isALPHANUMERIC_utf8_safe
#define isALPHANUMERIC_utf8_safe(s,e) \
D_PPP_IS_GENERIC_UTF8_SAFE(s, e, ALPHANUMERIC)
#endif
#else
#ifndef isALPHANUMERIC_utf8_safe
#define isALPHANUMERIC_utf8_safe(s,e) \
(isALPHA_utf8_safe(s,e) || isDIGIT_utf8_safe(s,e))
#endif
#endif
#if 'A' == 65
#ifndef isBLANK_utf8_safe
#define isBLANK_utf8_safe(s,e) \
( ( LIKELY((e) > (s)) ) ?  \
( ( 0x09 == ((const U8*)s)[0] || 0x20 == ((const U8*)s)[0] ) ? 1 \
: ( LIKELY(((e) - (s)) >= UTF8SKIP(s)) ) ? \
( ( 0xC2 == ((const U8*)s)[0] ) ? \
( ( 0xA0 == ((const U8*)s)[1] ) ? 2 : 0 ) \
: ( 0xE1 == ((const U8*)s)[0] ) ? \
( ( ( 0x9A == ((const U8*)s)[1] ) && ( 0x80 == ((const U8*)s)[2] ) ) ? 3 : 0 )\
: ( 0xE2 == ((const U8*)s)[0] ) ? \
( ( 0x80 == ((const U8*)s)[1] ) ? \
( ( inRANGE(((const U8*)s)[2], 0x80, 0x8A ) || 0xAF == ((const U8*)s)[2] ) ? 3 : 0 )\
: ( ( 0x81 == ((const U8*)s)[1] ) && ( 0x9F == ((const U8*)s)[2] ) ) ? 3 : 0 )\
: ( ( ( 0xE3 == ((const U8*)s)[0] ) && ( 0x80 == ((const U8*)s)[1] ) ) && ( 0x80 == ((const U8*)s)[2] ) ) ? 3 : 0 )\
: 0 ) \
: 0 )
#endif
#elif 'A' == 193 && '^' == 95
#ifndef isBLANK_utf8_safe
#define isBLANK_utf8_safe(s,e) \
( ( LIKELY((e) > (s)) ) ? \
( ( 0x05 == ((const U8*)s)[0] || 0x40 == ((const U8*)s)[0] ) ? 1 \
: ( LIKELY(((e) - (s)) >= UTF8SKIP(s)) ) ? \
( ( 0x80 == ((const U8*)s)[0] ) ? \
( ( 0x41 == ((const U8*)s)[1] ) ? 2 : 0 ) \
: ( 0xBC == ((const U8*)s)[0] ) ? \
( ( ( 0x63 == ((const U8*)s)[1] ) && ( 0x41 == ((const U8*)s)[2] ) ) ? 3 : 0 )\
: ( 0xCA == ((const U8*)s)[0] ) ? \
( ( 0x41 == ((const U8*)s)[1] ) ? \
( ( inRANGE(((const U8*)s)[2], 0x41, 0x4A ) || 0x51 == ((const U8*)s)[2] ) ? 3 : 0 )\
: ( 0x42 == ((const U8*)s)[1] ) ? \
( ( 0x56 == ((const U8*)s)[2] ) ? 3 : 0 ) \
: ( ( 0x43 == ((const U8*)s)[1] ) && ( 0x73 == ((const U8*)s)[2] ) ) ? 3 : 0 )\
: ( ( ( 0xCE == ((const U8*)s)[0] ) && ( 0x41 == ((const U8*)s)[1] ) ) && ( 0x41 == ((const U8*)s)[2] ) ) ? 3 : 0 )\
: 0 ) \
: 0 )
#endif
#elif 'A' == 193 && '^' == 176
#ifndef isBLANK_utf8_safe
#define isBLANK_utf8_safe(s,e) \
( ( LIKELY((e) > (s)) ) ? \
( ( 0x05 == ((const U8*)s)[0] || 0x40 == ((const U8*)s)[0] ) ? 1 \
: ( LIKELY(((e) - (s)) >= UTF8SKIP(s)) ) ? \
( ( 0x78 == ((const U8*)s)[0] ) ? \
( ( 0x41 == ((const U8*)s)[1] ) ? 2 : 0 ) \
: ( 0xBD == ((const U8*)s)[0] ) ? \
( ( ( 0x62 == ((const U8*)s)[1] ) && ( 0x41 == ((const U8*)s)[2] ) ) ? 3 : 0 )\
: ( 0xCA == ((const U8*)s)[0] ) ? \
( ( 0x41 == ((const U8*)s)[1] ) ? \
( ( inRANGE(((const U8*)s)[2], 0x41, 0x4A ) || 0x51 == ((const U8*)s)[2] ) ? 3 : 0 )\
: ( 0x42 == ((const U8*)s)[1] ) ? \
( ( 0x56 == ((const U8*)s)[2] ) ? 3 : 0 ) \
: ( ( 0x43 == ((const U8*)s)[1] ) && ( 0x72 == ((const U8*)s)[2] ) ) ? 3 : 0 )\
: ( ( ( 0xCE == ((const U8*)s)[0] ) && ( 0x41 == ((const U8*)s)[1] ) ) && ( 0x41 == ((const U8*)s)[2] ) ) ? 3 : 0 )\
: 0 ) \
: 0 )
#endif
#else
#error Unknown character set
#endif
#ifndef isCNTRL_utf8_safe
#define isCNTRL_utf8_safe(s,e) D_PPP_IS_GENERIC_UTF8_SAFE(s, e, CNTRL)
#endif
#ifndef isDIGIT_utf8_safe
#define isDIGIT_utf8_safe(s,e) D_PPP_IS_GENERIC_UTF8_SAFE(s, e, DIGIT)
#endif
#ifndef isGRAPH_utf8_safe
#define isGRAPH_utf8_safe(s,e) D_PPP_IS_GENERIC_UTF8_SAFE(s, e, GRAPH)
#endif
#ifdef isIDCONT_utf8
#ifndef isIDCONT_utf8_safe
#define isIDCONT_utf8_safe(s,e) D_PPP_IS_GENERIC_UTF8_SAFE(s, e, IDCONT)
#endif
#else
#ifndef isIDCONT_utf8_safe
#define isIDCONT_utf8_safe(s,e) isWORDCHAR_utf8_safe(s,e)
#endif
#endif
#ifndef isIDFIRST_utf8_safe
#define isIDFIRST_utf8_safe(s,e) D_PPP_IS_GENERIC_UTF8_SAFE(s, e, IDFIRST)
#endif
#ifndef isLOWER_utf8_safe
#define isLOWER_utf8_safe(s,e) D_PPP_IS_GENERIC_UTF8_SAFE(s, e, LOWER)
#endif
#ifndef isPRINT_utf8_safe
#define isPRINT_utf8_safe(s,e) D_PPP_IS_GENERIC_UTF8_SAFE(s, e, PRINT)
#endif
#undef isPSXSPC_utf8_safe
#ifndef isPSXSPC_utf8_safe
#define isPSXSPC_utf8_safe(s,e) isSPACE_utf8_safe(s,e)
#endif
#ifndef isPUNCT_utf8_safe
#define isPUNCT_utf8_safe(s,e) D_PPP_IS_GENERIC_UTF8_SAFE(s, e, PUNCT)
#endif
#ifndef isSPACE_utf8_safe
#define isSPACE_utf8_safe(s,e) D_PPP_IS_GENERIC_UTF8_SAFE(s, e, SPACE)
#endif
#ifndef isUPPER_utf8_safe
#define isUPPER_utf8_safe(s,e) D_PPP_IS_GENERIC_UTF8_SAFE(s, e, UPPER)
#endif
#ifdef isWORDCHAR_utf8
#ifndef isWORDCHAR_utf8_safe
#define isWORDCHAR_utf8_safe(s,e) D_PPP_IS_GENERIC_UTF8_SAFE(s, e, WORDCHAR)
#endif
#else
#ifndef isWORDCHAR_utf8_safe
#define isWORDCHAR_utf8_safe(s,e) \
(isALPHANUMERIC_utf8_safe(s,e) || (*(s)) == '_')
#endif
#endif
#if 'A' == 65
#ifndef isXDIGIT_utf8_safe
#define isXDIGIT_utf8_safe(s,e) \
( ( LIKELY((e) > (s)) ) ? \
( ( inRANGE(((const U8*)s)[0], 0x30, 0x39 ) || inRANGE(((const U8*)s)[0], 0x41, 0x46 ) || inRANGE(((const U8*)s)[0], 0x61, 0x66 ) ) ? 1\
: ( ( LIKELY(((e) - (s)) >= UTF8SKIP(s)) ) && ( 0xEF == ((const U8*)s)[0] ) ) ? ( ( 0xBC == ((const U8*)s)[1] ) ?\
( ( inRANGE(((const U8*)s)[2], 0x90, 0x99 ) || inRANGE(((const U8*)s)[2], 0xA1, 0xA6 ) ) ? 3 : 0 )\
: ( ( 0xBD == ((const U8*)s)[1] ) && ( inRANGE(((const U8*)s)[2], 0x81, 0x86 ) ) ) ? 3 : 0 ) : 0 )\
: 0 )
#endif
#elif 'A' == 193 && '^' == 95
#ifndef isXDIGIT_utf8_safe
#define isXDIGIT_utf8_safe(s,e) \
( ( LIKELY((e) > (s)) ) ? \
( ( inRANGE(((const U8*)s)[0], 0x81, 0x86 ) || inRANGE(((const U8*)s)[0], 0xC1, 0xC6 ) || inRANGE(((const U8*)s)[0], 0xF0, 0xF9 ) ) ? 1\
: ( ( ( LIKELY(((e) - (s)) >= UTF8SKIP(s)) ) && ( 0xDD == ((const U8*)s)[0] ) ) && ( 0x73 == ((const U8*)s)[1] ) ) ? ( ( 0x67 == ((const U8*)s)[2] ) ?\
( ( inRANGE(((const U8*)s)[3], 0x57, 0x59 ) || inRANGE(((const U8*)s)[3], 0x62, 0x68 ) ) ? 4 : 0 )\
: ( ( inRANGE(((const U8*)s)[2], 0x68, 0x69 ) ) && ( inRANGE(((const U8*)s)[3], 0x42, 0x47 ) ) ) ? 4 : 0 ) : 0 )\
: 0 )
#endif
#elif 'A' == 193 && '^' == 176
#ifndef isXDIGIT_utf8_safe
#define isXDIGIT_utf8_safe(s,e) \
( ( LIKELY((e) > (s)) ) ? \
( ( inRANGE(((const U8*)s)[0], 0x81, 0x86 ) || inRANGE(((const U8*)s)[0], 0xC1, 0xC6 ) || inRANGE(((const U8*)s)[0], 0xF0, 0xF9 ) ) ? 1\
: ( ( ( LIKELY(((e) - (s)) >= UTF8SKIP(s)) ) && ( 0xDD == ((const U8*)s)[0] ) ) && ( 0x72 == ((const U8*)s)[1] ) ) ? ( ( 0x66 == ((const U8*)s)[2] ) ?\
( ( inRANGE(((const U8*)s)[3], 0x57, 0x59 ) || 0x5F == ((const U8*)s)[3] || inRANGE(((const U8*)s)[3], 0x62, 0x67 ) ) ? 4 : 0 )\
: ( ( inRANGE(((const U8*)s)[2], 0x67, 0x68 ) ) && ( inRANGE(((const U8*)s)[3], 0x42, 0x47 ) ) ) ? 4 : 0 ) : 0 )\
: 0 )
#endif
#else
#error Unknown character set
#endif
#ifndef isALPHA_LC_utf8_safe
#define isALPHA_LC_utf8_safe(s,e) D_PPP_IS_GENERIC_LC_UTF8_SAFE(s, e, ALPHA)
#endif
#ifdef isALPHANUMERIC_utf8
#ifndef isALPHANUMERIC_LC_utf8_safe
#define isALPHANUMERIC_LC_utf8_safe(s,e) \
D_PPP_IS_GENERIC_LC_UTF8_SAFE(s, e, ALPHANUMERIC)
#endif
#else
#ifndef isALPHANUMERIC_LC_utf8_safe
#define isALPHANUMERIC_LC_utf8_safe(s,e) \
(isALPHA_LC_utf8_safe(s,e) || isDIGIT_LC_utf8_safe(s,e))
#endif
#endif
#ifndef isBLANK_LC_utf8_safe
#define isBLANK_LC_utf8_safe(s,e) \
D_PPP_IS_GENERIC_LC_UTF8_SAFE_BROKEN(s, e, BLANK)
#endif
#ifndef isCNTRL_LC_utf8_safe
#define isCNTRL_LC_utf8_safe(s,e) D_PPP_IS_GENERIC_LC_UTF8_SAFE(s, e, CNTRL)
#endif
#ifndef isDIGIT_LC_utf8_safe
#define isDIGIT_LC_utf8_safe(s,e) D_PPP_IS_GENERIC_LC_UTF8_SAFE(s, e, DIGIT)
#endif
#ifndef isGRAPH_LC_utf8_safe
#define isGRAPH_LC_utf8_safe(s,e) D_PPP_IS_GENERIC_LC_UTF8_SAFE(s, e, GRAPH)
#endif
#ifdef isIDCONT_utf8
#ifndef isIDCONT_LC_utf8_safe
#define isIDCONT_LC_utf8_safe(s,e) D_PPP_IS_GENERIC_LC_UTF8_SAFE(s, e, IDCONT)
#endif
#else
#ifndef isIDCONT_LC_utf8_safe
#define isIDCONT_LC_utf8_safe(s,e) isWORDCHAR_LC_utf8_safe(s,e)
#endif
#endif
#ifndef isIDFIRST_LC_utf8_safe
#define isIDFIRST_LC_utf8_safe(s,e) D_PPP_IS_GENERIC_LC_UTF8_SAFE(s, e, IDFIRST)
#endif
#ifndef isLOWER_LC_utf8_safe
#define isLOWER_LC_utf8_safe(s,e) D_PPP_IS_GENERIC_LC_UTF8_SAFE(s, e, LOWER)
#endif
#ifndef isPRINT_LC_utf8_safe
#define isPRINT_LC_utf8_safe(s,e) D_PPP_IS_GENERIC_LC_UTF8_SAFE(s, e, PRINT)
#endif
#undef isPSXSPC_LC_utf8_safe
#ifndef isPSXSPC_LC_utf8_safe
#define isPSXSPC_LC_utf8_safe(s,e) isSPACE_LC_utf8_safe(s,e)
#endif
#ifndef isPUNCT_LC_utf8_safe
#define isPUNCT_LC_utf8_safe(s,e) D_PPP_IS_GENERIC_LC_UTF8_SAFE(s, e, PUNCT)
#endif
#ifndef isSPACE_LC_utf8_safe
#define isSPACE_LC_utf8_safe(s,e) D_PPP_IS_GENERIC_LC_UTF8_SAFE(s, e, SPACE)
#endif
#ifndef isUPPER_LC_utf8_safe
#define isUPPER_LC_utf8_safe(s,e) D_PPP_IS_GENERIC_LC_UTF8_SAFE(s, e, UPPER)
#endif
#ifdef isWORDCHAR_utf8
#ifndef isWORDCHAR_LC_utf8_safe
#define isWORDCHAR_LC_utf8_safe(s,e) D_PPP_IS_GENERIC_LC_UTF8_SAFE(s, e, WORDCHAR)
#endif
#else
#ifndef isWORDCHAR_LC_utf8_safe
#define isWORDCHAR_LC_utf8_safe(s,e) \
(isALPHANUMERIC_LC_utf8_safe(s,e) || (*(s)) == '_')
#endif
#endif
#ifndef isXDIGIT_LC_utf8_safe
#define isXDIGIT_LC_utf8_safe(s,e) \
D_PPP_IS_GENERIC_LC_UTF8_SAFE_BROKEN(s, e, XDIGIT)
#endif
#endif
#define D_PPP_TOO_SHORT_MSG "Malformed UTF-8 character starting with:" \
" \\x%02x (too short; %d bytes available, need" \
" %d)\n"
#if (PERL_BCDVERSION >= 0x5007003)
#ifndef toLOWER_uvchr
#define toLOWER_uvchr(c, s, l) UNI_TO_NATIVE(to_uni_lower(NATIVE_TO_UNI(c), s, l))
#endif
#ifndef toUPPER_uvchr
#define toUPPER_uvchr(c, s, l) UNI_TO_NATIVE(to_uni_upper(NATIVE_TO_UNI(c), s, l))
#endif
#ifndef toTITLE_uvchr
#define toTITLE_uvchr(c, s, l) UNI_TO_NATIVE(to_uni_title(NATIVE_TO_UNI(c), s, l))
#endif
#ifndef toFOLD_uvchr
#define toFOLD_uvchr(c, s, l) UNI_TO_NATIVE(to_uni_fold( NATIVE_TO_UNI(c), s, l))
#endif
#if (PERL_BCDVERSION != 0x5015006)
#if defined toLOWER_utf8
#define D_PPP_TO_LOWER_CALLEE(s,r,l) toLOWER_utf8(s,r,l)
#else
#define D_PPP_TO_LOWER_CALLEE(s,r,l) to_utf8_lower(s,r,l)
#endif
#if defined toTITLE_utf8
#define D_PPP_TO_TITLE_CALLEE(s,r,l) toTITLE_utf8(s,r,l)
#else
#define D_PPP_TO_TITLE_CALLEE(s,r,l) to_utf8_title(s,r,l)
#endif
#if defined toUPPER_utf8
#define D_PPP_TO_UPPER_CALLEE(s,r,l) toUPPER_utf8(s,r,l)
#else
#define D_PPP_TO_UPPER_CALLEE(s,r,l) to_utf8_upper(s,r,l)
#endif
#if defined toFOLD_utf8
#define D_PPP_TO_FOLD_CALLEE(s,r,l) toFOLD_utf8(s,r,l)
#else
#define D_PPP_TO_FOLD_CALLEE(s,r,l) to_utf8_fold(s,r,l)
#endif
#else
#define D_PPP_TO_LOWER_CALLEE(s,r,l) \
Perl__to_utf8_lower_flags(aTHX_ s, r, l, 0, NULL)
#define D_PPP_TO_TITLE_CALLEE(s,r,l) \
Perl__to_utf8_title_flags(aTHX_ s, r, l, 0, NULL)
#define D_PPP_TO_UPPER_CALLEE(s,r,l) \
Perl__to_utf8_upper_flags(aTHX_ s, r, l, 0, NULL)
#define D_PPP_TO_FOLD_CALLEE(s,r,l) \
Perl__to_utf8_fold_flags(aTHX_ s, r, l, FOLD_FLAGS_FULL, NULL)
#endif
#define D_PPP_GENERIC_MULTI_ARG_TO(name, s, e,r,l) \
(((((e) - (s)) <= 0) \
\
? (croak("Attempting case change on zero length string"), \
0)  \
: ((e) - (s)) < UTF8SKIP(s)) \
? (croak(D_PPP_TOO_SHORT_MSG, \
s[0], (int) ((e) - (s)), (int) UTF8SKIP(s)), \
0) \
: D_PPP_TO_ ## name ## _CALLEE(s,r,l))
#ifndef toUPPER_utf8_safe
#define toUPPER_utf8_safe(s,e,r,l) \
D_PPP_GENERIC_MULTI_ARG_TO(UPPER,s,e,r,l)
#endif
#ifndef toLOWER_utf8_safe
#define toLOWER_utf8_safe(s,e,r,l) \
D_PPP_GENERIC_MULTI_ARG_TO(LOWER,s,e,r,l)
#endif
#ifndef toTITLE_utf8_safe
#define toTITLE_utf8_safe(s,e,r,l) \
D_PPP_GENERIC_MULTI_ARG_TO(TITLE,s,e,r,l)
#endif
#ifndef toFOLD_utf8_safe
#define toFOLD_utf8_safe(s,e,r,l) \
D_PPP_GENERIC_MULTI_ARG_TO(FOLD,s,e,r,l)
#endif
#elif (PERL_BCDVERSION >= 0x5006000)
#ifdef uvchr_to_utf8
#define D_PPP_UV_TO_UTF8 uvchr_to_utf8
#else
#define D_PPP_UV_TO_UTF8 uv_to_utf8
#endif
#define D_PPP_GENERIC_SINGLE_ARG_TO_UVCHR(name, c, s, l) \
(*(l) = (D_PPP_UV_TO_UTF8(s, \
UNI_TO_NATIVE(to_uni_ ## name(NATIVE_TO_UNI(c)))) - (s)), \
UNI_TO_NATIVE(to_uni_ ## name(NATIVE_TO_UNI(c))))
#ifndef toLOWER_uvchr
#define toLOWER_uvchr(c, s, l) \
D_PPP_GENERIC_SINGLE_ARG_TO_UVCHR(lower, c, s, l)
#endif
#ifndef toUPPER_uvchr
#define toUPPER_uvchr(c, s, l) \
D_PPP_GENERIC_SINGLE_ARG_TO_UVCHR(upper, c, s, l)
#endif
#ifndef toTITLE_uvchr
#define toTITLE_uvchr(c, s, l) \
D_PPP_GENERIC_SINGLE_ARG_TO_UVCHR(title, c, s, l)
#endif
#ifndef toFOLD_uvchr
#define toFOLD_uvchr(c, s, l) toLOWER_uvchr(c, s, l)
#endif
#define D_PPP_GENERIC_SINGLE_ARG_TO_UTF8(name, s, e, r, l) \
(((((e) - (s)) <= 0) \
? (croak("Attempting case change on zero length string"), \
0)  \
: ((e) - (s)) < UTF8SKIP(s)) \
? (croak(D_PPP_TOO_SHORT_MSG, \
s[0], (int) ((e) - (s)), (int) UTF8SKIP(s)), \
0) \
\
: D_PPP_UV_TO_UTF8(r, to_utf8_ ## name(s)), \
\
*(l) = UTF8SKIP(r), to_utf8_ ## name(r))
#ifndef toUPPER_utf8_safe
#define toUPPER_utf8_safe(s,e,r,l) \
D_PPP_GENERIC_SINGLE_ARG_TO_UTF8(upper, s, e, r, l)
#endif
#ifndef toLOWER_utf8_safe
#define toLOWER_utf8_safe(s,e,r,l) \
D_PPP_GENERIC_SINGLE_ARG_TO_UTF8(lower, s, e, r, l)
#endif
#ifndef toTITLE_utf8_safe
#define toTITLE_utf8_safe(s,e,r,l) \
D_PPP_GENERIC_SINGLE_ARG_TO_UTF8(title, s, e, r, l)
#endif
#ifndef toFOLD_utf8_safe
#define toFOLD_utf8_safe(s,e,r,l) toLOWER_utf8_safe(s,e,r,l)
#endif
#endif
#if (PERL_BCDVERSION >= 0x5008000)
#ifndef HeUTF8
#define HeUTF8(he) ((HeKLEN(he) == HEf_SVKEY) ? \
SvUTF8(HeKEY_sv(he)) : \
(U32)HeKUTF8(he))
#endif
#endif
#ifndef C_ARRAY_LENGTH
#define C_ARRAY_LENGTH(a) (sizeof(a)/sizeof((a)[0]))
#endif
#ifndef C_ARRAY_END
#define C_ARRAY_END(a) ((a) + C_ARRAY_LENGTH(a))
#endif
#ifndef LIKELY
#define LIKELY(x) (x)

ppport.h  view on Meta::CPAN

sv_vsetpvfn(sv, pat, strlen(pat), args, Null(SV**), 0, Null(bool*)); \
SvSETMAGIC(sv); \
} STMT_END
#endif
#ifndef sv_2pv_nolen
#define sv_2pv_nolen(sv) SvPV_nolen(sv)
#endif
#ifdef SvPVbyte
#if (PERL_BCDVERSION < 0x5007000)
#ifndef sv_2pvbyte
#define sv_2pvbyte(sv, lp) (sv_utf8_downgrade((sv), 0), SvPV((sv), *(lp)))
#endif
#undef SvPVbyte
#define SvPVbyte(sv, lp) \
((SvFLAGS(sv) & (SVf_POK|SVf_UTF8)) == (SVf_POK) \
? ((lp = SvCUR(sv)), SvPVX(sv)) : sv_2pvbyte(sv, &lp))
#endif
#else
#define SvPVbyte SvPV
#define sv_2pvbyte sv_2pv
#endif
#ifndef sv_2pvbyte_nolen
#define sv_2pvbyte_nolen(sv) sv_2pv_nolen(sv)
#endif
#ifndef SV_IMMEDIATE_UNREF
#define SV_IMMEDIATE_UNREF 0
#endif
#ifndef SV_GMAGIC
#define SV_GMAGIC 0
#endif
#ifndef SV_COW_DROP_PV
#define SV_COW_DROP_PV 0
#endif
#ifndef SV_UTF8_NO_ENCODING
#define SV_UTF8_NO_ENCODING 0
#endif
#ifndef SV_CONST_RETURN
#define SV_CONST_RETURN 0
#endif
#ifndef SV_MUTABLE_RETURN
#define SV_MUTABLE_RETURN 0
#endif
#ifndef SV_SMAGIC
#define SV_SMAGIC 0
#endif

ppport.h  view on Meta::CPAN

#endif
#ifndef WARN_UNINITIALIZED
#define WARN_UNINITIALIZED 41
#endif
#ifndef WARN_UNPACK
#define WARN_UNPACK 42
#endif
#ifndef WARN_UNTIE
#define WARN_UNTIE 43
#endif
#ifndef WARN_UTF8
#define WARN_UTF8 44
#endif
#ifndef WARN_VOID
#define WARN_VOID 45
#endif
#ifndef WARN_ASSERTIONS
#define WARN_ASSERTIONS 46
#endif
#ifndef packWARN
#define packWARN(a) (a)
#endif

ppport.h  view on Meta::CPAN

#ifdef NEED_mess_sv
#define NEED_mess
#endif
#ifdef NEED_mess
#define NEED_mess_nocontext
#define NEED_vmess
#endif
#ifndef croak_sv
#if (PERL_BCDVERSION >= 0x5007003) || ( (PERL_BCDVERSION >= 0x5006001) && (PERL_BCDVERSION < 0x5007000) )
#if ( (PERL_BCDVERSION >= 0x5008000) && (PERL_BCDVERSION < 0x5008009) ) || ( (PERL_BCDVERSION >= 0x5009000) && (PERL_BCDVERSION < 0x5010001) )
#define D_PPP_FIX_UTF8_ERRSV_FOR_SV(sv) \
STMT_START { \
SV *_errsv = ERRSV; \
SvFLAGS(_errsv) = (SvFLAGS(_errsv) & ~SVf_UTF8) | \
(SvFLAGS(sv) & SVf_UTF8); \
} STMT_END
#else
#define D_PPP_FIX_UTF8_ERRSV_FOR_SV(sv) STMT_START {} STMT_END
#endif
#define croak_sv(sv) \
STMT_START { \
SV *_sv = (sv); \
if (SvROK(_sv)) { \
sv_setsv(ERRSV, _sv); \
croak(NULL); \
} else { \
D_PPP_FIX_UTF8_ERRSV_FOR_SV(_sv); \
croak("%" SVf, SVfARG(_sv)); \
} \
} STMT_END
#elif (PERL_BCDVERSION >= 0x5004000)
#define croak_sv(sv) croak("%" SVf, SVfARG(sv))
#else
#define croak_sv(sv) croak("%s", SvPV_nolen(sv))
#endif
#endif
#ifndef die_sv

ppport.h  view on Meta::CPAN

#if (PERL_BCDVERSION < 0x5006000)
#define D_PPP_CONSTPV_ARG(x) ((char *) (x))
#else
#define D_PPP_CONSTPV_ARG(x) (x)
#endif
#ifndef newSVpvn
#define newSVpvn(data,len) ((data) \
? ((len) ? newSVpv((data), (len)) : newSVpv("", 0)) \
: newSV(0))
#endif
#ifndef newSVpvn_utf8
#define newSVpvn_utf8(s, len, u) newSVpvn_flags((s), (len), (u) ? SVf_UTF8 : 0)
#endif
#ifndef SVf_UTF8
#define SVf_UTF8 0
#endif
#ifndef newSVpvn_flags
#if defined(PERL_USE_GCC_BRACE_GROUPS)
#define newSVpvn_flags(s, len, flags) \
({ \
SV * sv = newSVpvn(D_PPP_CONSTPV_ARG(s), (len)); \
SvFLAGS(sv) |= ((flags) & SVf_UTF8); \
if ((flags) & SVs_TEMP) sv = sv_2mortal(sv); \
sv; \
})
#else
PERL_STATIC_INLINE SV* D_PPP_newSVpvn_flags(const char *const s, const STRLEN len, const U32 flags)
{
dTHX;
SV * sv = newSVpvn(s, len);
SvFLAGS(sv) |= (flags & SVf_UTF8);
if (flags & SVs_TEMP) return sv_2mortal(sv);
return sv;
}
#define newSVpvn_flags(s, len, flags) D_PPP_newSVpvn_flags((s), (len), (flags))
#endif
#endif
#ifndef SV_NOSTEAL
#define SV_NOSTEAL 16
#endif
#if ( (PERL_BCDVERSION >= 0x5007003) && (PERL_BCDVERSION < 0x5008007) ) || ( (PERL_BCDVERSION >= 0x5009000) && (PERL_BCDVERSION < 0x5009002) )

ppport.h  view on Meta::CPAN

const MGVTBL* const virt = mg->mg_virtual;
if (mg->mg_type == type && virt == vtbl) {
*mgp = mg->mg_moremagic;
if (virt && virt->svt_free)
virt->svt_free(aTHX_ sv, mg);
if (mg->mg_ptr && mg->mg_type != PERL_MAGIC_regex_global) {
if (mg->mg_len > 0)
Safefree(mg->mg_ptr);
else if (mg->mg_len == HEf_SVKEY)
SvREFCNT_dec(MUTABLE_SV(mg->mg_ptr));
else if (mg->mg_type == PERL_MAGIC_utf8)
Safefree(mg->mg_ptr);
}
if (mg->mg_flags & MGf_REFCOUNTED)
SvREFCNT_dec(mg->mg_obj);
Safefree(mg);
}
else
mgp = &mg->mg_moremagic;
}
if (SvMAGIC(sv)) {

ppport.h  view on Meta::CPAN

length = strlen(src);
if (size > 0) {
copy = (length >= size) ? size - 1 : length;
memcpy(dst, src, copy);
dst[copy] = '\0';
}
return length;
}
#endif
#endif
#ifdef SVf_UTF8
#ifndef SvUTF8
#define SvUTF8(sv) (SvFLAGS(sv) & SVf_UTF8)
#endif
#endif
#if (PERL_BCDVERSION == 0x5019001)
#undef UTF8f
#endif
#ifdef SVf_UTF8
#ifndef UTF8f
#define UTF8f SVf
#endif
#ifndef UTF8fARG
#define UTF8fARG(u,l,p) newSVpvn_flags((p), (l), ((u) ? SVf_UTF8 : 0) | SVs_TEMP)
#endif
#endif
#define D_PPP_MIN(a,b) (((a) <= (b)) ? (a) : (b))
#ifndef UNICODE_REPLACEMENT
#define UNICODE_REPLACEMENT 0xFFFD
#endif
#ifdef UTF8_MAXLEN
#ifndef UTF8_MAXBYTES
#define UTF8_MAXBYTES UTF8_MAXLEN
#endif
#endif
#ifndef UTF_START_MARK
#define UTF_START_MARK(len) \
(((len) > 7) ? 0xFF : (0xFF & (0xFE << (7-(len)))))
#endif
#if (PERL_BCDVERSION < 0x5018000)
#undef UTF8_MAXBYTES_CASE
#endif
#if 'A' == 65
#define D_PPP_BYTE_INFO_BITS 6
#ifndef UTF8_MAXBYTES_CASE
#define UTF8_MAXBYTES_CASE 13
#endif
#else
#define D_PPP_BYTE_INFO_BITS 5
#ifndef UTF8_MAXBYTES_CASE
#define UTF8_MAXBYTES_CASE 15
#endif
#endif
#ifndef UTF_ACCUMULATION_SHIFT
#define UTF_ACCUMULATION_SHIFT D_PPP_BYTE_INFO_BITS
#endif
#ifdef NATIVE_TO_UTF
#ifndef NATIVE_UTF8_TO_I8
#define NATIVE_UTF8_TO_I8(c) NATIVE_TO_UTF(c)
#endif
#else
#ifndef NATIVE_UTF8_TO_I8
#define NATIVE_UTF8_TO_I8(c) (c)
#endif
#endif
#ifdef UTF_TO_NATIVE
#ifndef I8_TO_NATIVE_UTF8
#define I8_TO_NATIVE_UTF8(c) UTF_TO_NATIVE(c)
#endif
#else
#ifndef I8_TO_NATIVE_UTF8
#define I8_TO_NATIVE_UTF8(c) (c)
#endif
#endif
#ifndef UTF_START_MASK
#define UTF_START_MASK(len) \
(((len) >= 7) ? 0x00 : (0x1F >> ((len)-2)))
#endif
#ifndef UTF_IS_CONTINUATION_MASK
#define UTF_IS_CONTINUATION_MASK \
((U8) (0xFF << UTF_ACCUMULATION_SHIFT))
#endif
#ifndef UTF_CONTINUATION_MARK
#define UTF_CONTINUATION_MARK \
(UTF_IS_CONTINUATION_MASK & 0xB0)
#endif
#ifndef UTF_MIN_START_BYTE
#define UTF_MIN_START_BYTE \
((UTF_CONTINUATION_MARK >> UTF_ACCUMULATION_SHIFT) | UTF_START_MARK(2))
#endif
#ifndef UTF_MIN_ABOVE_LATIN1_BYTE
#define UTF_MIN_ABOVE_LATIN1_BYTE \
((0x100 >> UTF_ACCUMULATION_SHIFT) | UTF_START_MARK(2))
#endif
#if (PERL_BCDVERSION < 0x5007000)
#undef UTF8_IS_DOWNGRADEABLE_START
#endif
#ifndef UTF8_IS_DOWNGRADEABLE_START
#define UTF8_IS_DOWNGRADEABLE_START(c) \
inRANGE(NATIVE_UTF8_TO_I8(c), \
UTF_MIN_START_BYTE, UTF_MIN_ABOVE_LATIN1_BYTE - 1)
#endif
#ifndef UTF_CONTINUATION_MASK
#define UTF_CONTINUATION_MASK \
((U8) ((1U << UTF_ACCUMULATION_SHIFT) - 1))
#endif
#ifndef UTF8_ACCUMULATE
#define UTF8_ACCUMULATE(base, added) \
(((base) << UTF_ACCUMULATION_SHIFT) \
| ((NATIVE_UTF8_TO_I8(added)) \
& UTF_CONTINUATION_MASK))
#endif
#ifndef UTF8_ALLOW_ANYUV
#define UTF8_ALLOW_ANYUV 0
#endif
#ifndef UTF8_ALLOW_EMPTY
#define UTF8_ALLOW_EMPTY 0x0001
#endif
#ifndef UTF8_ALLOW_CONTINUATION
#define UTF8_ALLOW_CONTINUATION 0x0002
#endif
#ifndef UTF8_ALLOW_NON_CONTINUATION
#define UTF8_ALLOW_NON_CONTINUATION 0x0004
#endif
#ifndef UTF8_ALLOW_SHORT
#define UTF8_ALLOW_SHORT 0x0008
#endif
#ifndef UTF8_ALLOW_LONG
#define UTF8_ALLOW_LONG 0x0010
#endif
#ifndef UTF8_ALLOW_OVERFLOW
#define UTF8_ALLOW_OVERFLOW 0x0080
#endif
#ifndef UTF8_ALLOW_ANY
#define UTF8_ALLOW_ANY ( UTF8_ALLOW_CONTINUATION \
|UTF8_ALLOW_NON_CONTINUATION \
|UTF8_ALLOW_SHORT \
|UTF8_ALLOW_LONG \
|UTF8_ALLOW_OVERFLOW)
#endif
#if defined UTF8SKIP
#undef UTF8_SAFE_SKIP
#undef UTF8_CHK_SKIP
#ifndef UTF8_SAFE_SKIP
#define UTF8_SAFE_SKIP(s, e) ( \
((((e) - (s)) <= 0) \
? 0 \
: D_PPP_MIN(((e) - (s)), UTF8SKIP(s))))
#endif
#ifndef UTF8_CHK_SKIP
#define UTF8_CHK_SKIP(s) \
(s[0] == '\0' ? 1 : ((U8) D_PPP_MIN(my_strnlen((char *) (s), UTF8SKIP(s)), \
UTF8SKIP(s))))
#endif
#ifndef UTF8_SKIP
#define UTF8_SKIP(s) UTF8SKIP(s)
#endif
#endif
#if 'A' == 65
#ifndef UTF8_IS_INVARIANT
#define UTF8_IS_INVARIANT(c) isASCII(c)
#endif
#else
#ifndef UTF8_IS_INVARIANT
#define UTF8_IS_INVARIANT(c) (isASCII(c) || isCNTRL_L1(c))
#endif
#endif
#ifndef UVCHR_IS_INVARIANT
#define UVCHR_IS_INVARIANT(c) UTF8_IS_INVARIANT(c)
#endif
#ifdef UVCHR_IS_INVARIANT
#if 'A' != 65 || UVSIZE < 8
#define D_PPP_UVCHR_SKIP_UPPER(c) 7
#else
#define D_PPP_UVCHR_SKIP_UPPER(c) \
(((WIDEST_UTYPE) (c)) < \
(((WIDEST_UTYPE) 1) << (6 * D_PPP_BYTE_INFO_BITS)) ? 7 : 13)
#endif
#ifndef UVCHR_SKIP

ppport.h  view on Meta::CPAN

(WIDEST_UTYPE) (c) < ( 8 * (1U << (3 * D_PPP_BYTE_INFO_BITS))) ? 4 : \
(WIDEST_UTYPE) (c) < ( 4 * (1U << (4 * D_PPP_BYTE_INFO_BITS))) ? 5 : \
(WIDEST_UTYPE) (c) < ( 2 * (1U << (5 * D_PPP_BYTE_INFO_BITS))) ? 6 : \
D_PPP_UVCHR_SKIP_UPPER(c)
#endif
#endif
#ifdef is_ascii_string
#ifndef is_invariant_string
#define is_invariant_string(s,l) is_ascii_string(s,l)
#endif
#ifndef is_utf8_invariant_string
#define is_utf8_invariant_string(s,l) is_ascii_string(s,l)
#endif
#endif
#ifdef ibcmp_utf8
#ifndef foldEQ_utf8
#define foldEQ_utf8(s1,pe1,l1,u1,s2,pe2,l2,u2) \
cBOOL(! ibcmp_utf8(s1,pe1,l1,u1,s2,pe2,l2,u2))
#endif
#endif
#if defined(is_utf8_string) && defined(UTF8SKIP)
#ifndef isUTF8_CHAR
#define isUTF8_CHAR(s, e) ( \
(e) <= (s) || ! is_utf8_string(s, UTF8_SAFE_SKIP(s, e)) \
? 0 \
: UTF8SKIP(s))
#endif
#endif
#if 'A' == 65
#ifndef BOM_UTF8
#define BOM_UTF8 "\xEF\xBB\xBF"
#endif
#ifndef REPLACEMENT_CHARACTER_UTF8
#define REPLACEMENT_CHARACTER_UTF8 "\xEF\xBF\xBD"
#endif
#elif '^' == 95
#ifndef BOM_UTF8
#define BOM_UTF8 "\xDD\x73\x66\x73"
#endif
#ifndef REPLACEMENT_CHARACTER_UTF8
#define REPLACEMENT_CHARACTER_UTF8 "\xDD\x73\x73\x71"
#endif
#elif '^' == 176
#ifndef BOM_UTF8
#define BOM_UTF8 "\xDD\x72\x65\x72"
#endif
#ifndef REPLACEMENT_CHARACTER_UTF8
#define REPLACEMENT_CHARACTER_UTF8 "\xDD\x72\x72\x70"
#endif
#else
#error Unknown character set
#endif
#if (PERL_BCDVERSION < 0x5035010)
#undef utf8_to_uvchr_buf
#endif
#if (PERL_BCDVERSION >= 0x5006001) && ! defined(utf8_to_uvchr_buf)
#if defined(utf8n_to_uvchr) || defined(utf8_to_uvchr) || defined(utf8_to_uv)
#if defined(utf8n_to_uvchr)
#define D_PPP_utf8_to_uvchr_buf_callee utf8n_to_uvchr
#elif  \
defined(utf8_to_uv) && defined(utf8_to_uv_simple)
#define D_PPP_utf8_to_uvchr_buf_callee utf8_to_uv
#elif defined(utf8_to_uvchr)
#define D_PPP_utf8_to_uvchr_buf_callee(s, curlen, retlen, flags) \
utf8_to_uvchr((U8 *)(s), (retlen))
#else
#define D_PPP_utf8_to_uvchr_buf_callee(s, curlen, retlen, flags) \
utf8_to_uv((U8 *)(s), (retlen))
#endif
#endif
#if defined(NEED_utf8_to_uvchr_buf)
static UV DPPP_(my_utf8_to_uvchr_buf)(pTHX_ const U8 * s, const U8 * send, STRLEN * retlen);
static
#else
extern UV DPPP_(my_utf8_to_uvchr_buf)(pTHX_ const U8 * s, const U8 * send, STRLEN * retlen);
#endif
#if defined(NEED_utf8_to_uvchr_buf) || defined(NEED_utf8_to_uvchr_buf_GLOBAL)
#ifdef utf8_to_uvchr_buf
#undef utf8_to_uvchr_buf
#endif
#define utf8_to_uvchr_buf(a,b,c) DPPP_(my_utf8_to_uvchr_buf)(aTHX_ a,b,c)
#define Perl_utf8_to_uvchr_buf DPPP_(my_utf8_to_uvchr_buf)
UV
DPPP_(my_utf8_to_uvchr_buf)(pTHX_ const U8 *s, const U8 *send, STRLEN *retlen)
{
#if (PERL_BCDVERSION >= 0x5031004)
#if (PERL_BCDVERSION != 0x5035009)
if (send <= s) s = send = (U8 *) "?";
return Perl_utf8_to_uvchr_buf_helper(aTHX_ s, send, retlen);
#else
if (send > s) return Perl_utf8_to_uvchr_buf_helper(aTHX_ s, send, retlen);
if (! ckWARN_d(WARN_UTF8)) {
if (retlen) *retlen = 0;
return UNICODE_REPLACEMENT;
}
else {
s = send = (U8 *) "?";
(void) Perl__utf8n_to_uvchr_msgs_helper(s, 0, NULL, 0, NULL, NULL);
if (retlen) *retlen = (STRLEN) -1;
return 0;
}
#endif
#else
UV ret;
STRLEN curlen;
bool overflows = 0;
const U8 *cur_s = s;
const bool do_warnings = ckWARN_d(WARN_UTF8);
#if (PERL_BCDVERSION < 0x5026000) && ! defined(EBCDIC)
STRLEN overflow_length = 0;
#endif
if (send > s) {
curlen = send - s;
}
else {
assert(0);
curlen = 0;
if (! do_warnings) {

ppport.h  view on Meta::CPAN

}
}
if (UNLIKELY(overflows)) {
ret = 0;
if (! do_warnings && retlen) {
*retlen = overflow_length;
}
}
else
#endif
ret = D_PPP_utf8_to_uvchr_buf_callee(
(U8 *)
s, curlen, retlen, (UTF8_ALLOW_ANYUV
& ~(UTF8_ALLOW_LONG|UTF8_ALLOW_EMPTY)));
#if (PERL_BCDVERSION >= 0x5026000) && (PERL_BCDVERSION < 0x5028000)
if (UNLIKELY(ret > IV_MAX)) {
overflows = 1;
}
#endif
if (UNLIKELY(overflows)) {
if (! do_warnings) {
if (retlen) {
*retlen = D_PPP_MIN(*retlen, UTF8SKIP(s));
*retlen = D_PPP_MIN(*retlen, curlen);
}
return UNICODE_REPLACEMENT;
}
else {
Perl_warner(aTHX_ packWARN(WARN_UTF8),
"Malformed UTF-8 character (overflow at 0x%" UVxf
", byte 0x%02x, after start byte 0x%02x)",
ret, *cur_s, *s);
if (retlen) {
*retlen = (STRLEN) -1;
}
return 0;
}
}
if (UNLIKELY(ret == 0 && (curlen == 0 || *s != '\0'))) {
if (do_warnings) {
if (retlen) {
*retlen = (STRLEN) -1;
}
}
else {
ret = D_PPP_utf8_to_uvchr_buf_callee(
(U8 *)
s, curlen, retlen, UTF8_ALLOW_ANY);
ret = UNICODE_REPLACEMENT;
#if (PERL_BCDVERSION < 0x5016000)
if (retlen && (IV) *retlen >= 0) {
unsigned int i = 1;
*retlen = D_PPP_MIN(*retlen, curlen);
*retlen = D_PPP_MIN(*retlen, UTF8SKIP(s));
do {
#ifdef UTF8_IS_CONTINUATION
if (! UTF8_IS_CONTINUATION(s[i]))
#else
if (s[i] < 0x80 || s[i] > 0xBF)
#endif
{
*retlen = i;
break;
}
} while (++i < *retlen);
}
#endif
}
}
return ret;
#endif
}
#endif
#endif
#if defined(UTF8SKIP) && defined(utf8_to_uvchr_buf)
#undef utf8_to_uvchr
#ifndef utf8_to_uvchr
#define utf8_to_uvchr(s, lp) \
((*(s) == '\0') \
? utf8_to_uvchr_buf(s,((s)+1), lp)  \
: utf8_to_uvchr_buf(s, (s) + UTF8_CHK_SKIP(s), (lp)))
#endif
#endif
#ifdef sv_len_utf8
#if (PERL_BCDVERSION >= 0x5017005)
#ifndef sv_len_utf8_nomg
#if defined(PERL_USE_GCC_BRACE_GROUPS)
#define sv_len_utf8_nomg(sv) \
({ \
SV *sv_ = (sv); \
sv_len_utf8(!SvGMAGICAL(sv_) \
? sv_ \
: sv_mortalcopy_flags(sv_, SV_NOSTEAL)); \
})
#else
PERL_STATIC_INLINE STRLEN D_PPP_sv_len_utf8_nomg(SV * sv)
{
dTHX;
if (SvGMAGICAL(sv))
return sv_len_utf8(sv_mortalcopy_flags(sv,
SV_NOSTEAL));
else return sv_len_utf8(sv);
}
#define sv_len_utf8_nomg(sv) D_PPP_sv_len_utf8_nomg(sv)
#endif
#endif
#else
#undef sv_len_utf8
#if defined(PERL_USE_GCC_BRACE_GROUPS)
#define sv_len_utf8_nomg(sv) \
({ \
SV *sv2 = (sv); \
STRLEN len; \
if (SvUTF8(sv2)) { \
if (SvGMAGICAL(sv2)) \
len = Perl_sv_len_utf8(aTHX_ \
sv_mortalcopy_flags(sv2, \
SV_NOSTEAL));\
else \
len = Perl_sv_len_utf8(aTHX_ sv2); \
} \
else SvPV_nomg(sv2, len); \
len; \
})
#define sv_len_utf8(sv) ({ SV *_sv1 = (sv); \
SvGETMAGIC(_sv1); \
sv_len_utf8_nomg(_sv1); \
})
#else
PERL_STATIC_INLINE STRLEN D_PPP_sv_len_utf8_nomg(SV * sv)
{
dTHX;
STRLEN len;
if (SvUTF8(sv)) {
if (SvGMAGICAL(sv))
len = Perl_sv_len_utf8(aTHX_
sv_mortalcopy_flags(sv,
SV_NOSTEAL));
else
len = Perl_sv_len_utf8(aTHX_ sv);
}
else SvPV_nomg(sv, len);
return len;
}
#define sv_len_utf8_nomg(sv) D_PPP_sv_len_utf8_nomg(sv)
PERL_STATIC_INLINE STRLEN D_PPP_sv_len_utf8(SV * sv)
{
dTHX;
SvGETMAGIC(sv);
return sv_len_utf8_nomg(sv);
}
#define sv_len_utf8(sv) D_PPP_sv_len_utf8(sv)
#endif
#endif
#endif
#ifndef PERL_PV_ESCAPE_QUOTE
#define PERL_PV_ESCAPE_QUOTE 0x0001
#endif
#ifndef PERL_PV_PRETTY_QUOTE
#define PERL_PV_PRETTY_QUOTE PERL_PV_ESCAPE_QUOTE
#endif
#ifndef PERL_PV_PRETTY_ELLIPSES

ppport.h  view on Meta::CPAN

DPPP_(my_pv_escape)(pTHX_ SV *dsv, char const * const str,
const STRLEN count, const STRLEN max,
STRLEN * const escaped, const U32 flags)
{
const char esc = flags & PERL_PV_ESCAPE_RE ? '%' : '\\';
const char dq = flags & PERL_PV_ESCAPE_QUOTE ? '"' : esc;
char octbuf[32] = "%123456789ABCDF";
STRLEN wrote = 0;
STRLEN chsize = 0;
STRLEN readsize = 1;
#if defined(is_utf8_string) && defined(utf8_to_uvchr_buf)
bool isuni = flags & PERL_PV_ESCAPE_UNI ? 1 : 0;
#endif
const char *pv = str;
const char * const end = pv + count;
octbuf[0] = esc;
if (!(flags & PERL_PV_ESCAPE_NOCLEAR))
sv_setpvs(dsv, "");
#if defined(is_utf8_string) && defined(utf8_to_uvchr_buf)
if ((flags & PERL_PV_ESCAPE_UNI_DETECT) && is_utf8_string((U8*)pv, count))
isuni = 1;
#endif
for (; pv < end && (!max || wrote < max) ; pv += readsize) {
const UV u =
#if defined(is_utf8_string) && defined(utf8_to_uvchr_buf)
isuni ? utf8_to_uvchr_buf((U8*)pv, end, &readsize) :
#endif
(U8)*pv;
const U8 c = (U8)u & 0xFF;
if (u > 255 || (flags & PERL_PV_ESCAPE_ALL)) {
if (flags & PERL_PV_ESCAPE_FIRSTCHAR)
chsize = my_snprintf(octbuf, sizeof octbuf,
"%" UVxf, u);
else
chsize = my_snprintf(octbuf, sizeof octbuf,
"%cx{%" UVxf "}", esc, u);

t/cases.t  view on Meta::CPAN

use strict;
use warnings;
use utf8;

use HTML::Parser ();
use Test::More;

my @result;
{

    package P;
    use strict;
    use warnings;

t/dtext.t  view on Meta::CPAN

use strict;
use warnings;
use utf8;

use HTML::Parser ();
use Test::More tests => 2;

my $dtext = "";
my $text  = "";

sub append {
    $dtext .= shift;
    $text  .= shift;

t/entities.t  view on Meta::CPAN

use strict;
use warnings;
use utf8;

use HTML::Entities qw(decode_entities encode_entities encode_entities_numeric);
use Test::More tests => 31;

my $x = "V&aring;re norske tegn b&oslash;r &#230res";

decode_entities($x);

is($x, "Våre norske tegn bør æres");

t/entities2.t  view on Meta::CPAN

use strict;
use warnings;
use utf8;

use HTML::Entities qw(_decode_entities);
use Test::More tests => 9;

{
    local $@;
    my $error;

    #<<<  do not let perltidy touch this
    $error = $@ || 'Error' unless eval {

t/filter-methods.t  view on Meta::CPAN

use strict;
use warnings;
use utf8;

use HTML::Parser ();
use Test::More tests => 12;

my $p = HTML::Parser->new(api_version => 3, ignore_tags => [qw(b i em tt)],);
$p->ignore_elements("script");
$p->unbroken_text(1);

$p->handler(default => [], "event, text");
$p->parse(<<"EOT")->eof;

t/headparser.t  view on Meta::CPAN

use strict;
use warnings;
use utf8;

use HTML::HeadParser ();
use Test::More tests => 17;

{

    package H;
    use strict;
    use warnings;
    sub new { bless {}, shift; }

t/headparser.t  view on Meta::CPAN


SKIP: {
    # Test that the Unicode BOM does not confuse us?
    $p = HTML::HeadParser->new(H->new);
    ok($p->parse("\x{FEFF}\n<title>Hi <foo></title>"));
    $p->eof;

    is($p->header("title"), "Hi <foo>");

    $p = HTML::HeadParser->new(H->new);
    $p->utf8_mode(1);
    $p->parse(
        <<"EOT"); # example from http://rt.cpan.org/Ticket/Display.html?id=27522
\xEF\xBB\xBF<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html>
 <head>
 <title>
Parkinson's disease</title>
 <meta name="Keywords" content="brain,disease,dopamine,drug,levodopa,parkinson,patients,symptoms,,Medications, Medications">
 </meta>
 \t

t/headparser.t  view on Meta::CPAN

\t </head>
 <body>
EOT
    $p->eof;

    is($p->header("title"), "Parkinson's disease");
    is($p->header("link")->[0],
        '<../../css/ummAdam.css>; rel="stylesheet"; type="text/css"');

    $p = HTML::HeadParser->new(H->new);
    $p->utf8_mode(1);
    $p->parse(<<"EOT");    # example from http://www.mjw.com.pl/
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">\r
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="pl" lang="pl"> \r
\r
<head profile="http://gmpg.org/xfn/11">\r
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />\r
\r
<title> ko\xC5\x84c\xC3\xB3wki kolekcji, outlet, hurtownia odzie\xC5\xBCy Warszawa &#8211; MJW</title>\r
<link rel="shortcut icon" href="favicon.ico" type="image/x-icon" />\r

EOT
    $p->eof;
    is($p->header("title"),
        "ko\xC5\x84c\xC3\xB3wki kolekcji, outlet, hurtownia odzie\xC5\xBCy Warszawa \xE2\x80\x93 MJW"
    );
}

t/marked-sect.t  view on Meta::CPAN

use strict;
use warnings;
use utf8;

use HTML::Parser ();
use Test::More tests => 14;

my $tag;
my $text;

my $p = HTML::Parser->new(
    start_h => [sub { $tag = shift }, "tagname"],
    text_h  => [sub { $text .= shift }, "dtext"],

t/parser.t  view on Meta::CPAN

use strict;
use warnings;
use utf8;

use HTML::Parser ();
use IO::File     ();
use Test::More tests => 7;

my $HTML = <<'HTML';

<!DOCTYPE HTML>

<body>

t/stack-realloc-eof.t  view on Meta::CPAN

use warnings;

use HTML::Parser ();
use Test::More tests => 2;

# HTML-Parser core dumps on this because
# of missing SPAGAIN calls in parse() XS code.  It was not prepared for
# the stack to get realloced.

my $em = <<'EOF';
<html><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /></head><body style='font-size: 10pt; font-family: Verdana,Geneva,sans-serif'>
<p><span style="font-size: 10.0pt; font-family: 'Verdana','sans-serif';">cat</span></p>
<p><span style="font-size: 10.0pt; font-family: 'Verdana','sans-serif';">cat</span></p>
<p><span style="font-size: 10.0pt; font-family: 'Verdana','sans-serif';">cat</span></p>
<p><span style="font-size: 10.0pt; font-family: 'Verdana','sans-serif';">cat</span></p>
<p><span style="font-size: 10.0pt; font-family: 'Verdana','sans-serif';">cat</span></p>
<p><span style="font-size: 10.0pt; font-family: 'Verdana','sans-serif';">cat</span></p>
<p><span style="font-size: 10.0pt; font-family: 'Verdana','sans-serif';">cat</span></p>
<p><span style="font-size: 10.0pt; font-family: 'Verdana','sans-serif';">cat</span></p>
<p><span style="font-size: 10.0pt; font-family: 'Verdana','sans-serif';">cat</span></p>
<p><span style="font-size: 10.0pt; font-family: 'Verdana','sans-serif';">cat</span></p>

t/uentities.t  view on Meta::CPAN

use strict;
use warnings;
use utf8;

use HTML::Entities qw(decode_entities encode_entities);
use Test::More tests => 26;

# Test Unicode entities

SKIP: {
    skip "Unicode entities not selected", 26
        if !&HTML::Entities::UNICODE_SUPPORT;

t/unicode-bom.t  view on Meta::CPAN

#!perl -w

use strict;
use warnings;
use utf8;

use HTML::Parser ();
use Test::More tests => 2;

my @parsed;
my $p
    = HTML::Parser->new(api_version => 3, start_h => [\@parsed, 'tag, attr'],);

my @warn;
$SIG{__WARN__} = sub {

t/unicode-bom.t  view on Meta::CPAN

$p->eof;

$p->parse("\xFE\xFF\0\0<head>Hi there</head>");
$p->eof;

for (@warn) {
    s/line (\d+)/line ##/g;
}

is(join("", @warn), <<EOT);
Parsing of undecoded UTF-8 will give garbage when decoding entities at $0 line ##.
Parsing of undecoded UTF-8 will give garbage when decoding entities at $0 line ##.
Parsing of undecoded UTF-16 at $0 line ##.
Parsing of undecoded UTF-16 at $0 line ##.
Parsing of undecoded UTF-32 at $0 line ##.
Parsing of undecoded UTF-32 at $0 line ##.
EOT

@warn = ();

$p = HTML::Parser->new(api_version => 3, start_h => [\@parsed, 'tag'],);

$p->parse("\xEF\xBB\xBF<head>Hi there</head>");
$p->eof;
ok(!@warn);

t/unicode.t  view on Meta::CPAN

#!perl -w

use strict;
use warnings;
use utf8;

use HTML::Parser ();
use Test::More tests => 107;

my @warn;
$SIG{__WARN__} = sub {
    push(@warn, $_[0]);
};

my @parsed;

t/unicode.t  view on Meta::CPAN

$p->parse($doc)->eof;

#use Data::Dump; Data::Dump::dump(@parsed);

is(@parsed,       9);
is($parsed[0][0], "start_document");

is($parsed[1][0], "start");
is($parsed[1][1], "<title>");
SKIP: {
    skip "no utf8::is_utf8", 1 if !defined(&utf8::is_utf8);
    ok(utf8::is_utf8($parsed[1][1]), "is_utf8");
}
is($parsed[1][3], 0);
is($parsed[1][4], 7);

is($parsed[2][0],      "text");
is(ord($parsed[2][1]), 0x263A);
is($parsed[2][2],      chr(0x263A));
is($parsed[2][3],      7);
is($parsed[2][4],      1);
is($parsed[2][5],      8);

t/unicode.t  view on Meta::CPAN

is($parsed[7][0], "text");
is($parsed[7][1], "\x{0420}");
is($parsed[7][2], "\x{0420}");

is($parsed[8][0], "end_document");
is($parsed[8][3], length($doc));
is($parsed[8][5], length($doc));
is($parsed[8][6], length($doc));
is(@warn,         0);

# Try to parse it as an UTF8 encoded string
utf8::encode($doc);
is(length($doc), 51);

@parsed = ();
$p->parse($doc)->eof;

#use Data::Dump; Data::Dump::dump(@parsed);

is(@parsed,       9);
is($parsed[0][0], "start_document");

is($parsed[1][0], "start");
is($parsed[1][1], "<title>");
SKIP: {
    skip "no utf8::is_utf8", 1 if !defined(&utf8::is_utf8);
    ok(!utf8::is_utf8($parsed[1][1]), "!is_utf8");
}
is($parsed[1][3], 0);
is($parsed[1][4], 7);

is($parsed[2][0],      "text");
is(ord($parsed[2][1]), 226);
is($parsed[2][1],      "\xE2\x98\xBA");
is($parsed[2][2],      "\xE2\x98\xBA");
is($parsed[2][3],      7);
is($parsed[2][4],      3);

t/unicode.t  view on Meta::CPAN

is($parsed[5][1], "Smile &#x263a");
is($parsed[5][2], "Smile \x{263A}");

is($parsed[8][0], "end_document");
is($parsed[8][3], length($doc));
is($parsed[8][5], length($doc));
is($parsed[8][6], length($doc));

is(@warn, 1);
like($warn[0],
    qr/^Parsing of undecoded UTF-8 will give garbage when decoding entities/);

my $file = "test-$$.html";
open(my $fh, ">:utf8", $file) || die;
print $fh <<EOT;
\x{FEFF}
<title>\x{263A} Love! </title>
<h1 id=&hearts;\x{2665}>&hearts; Love \x{2665}<h1>
EOT
close($fh) || die;

@warn   = ();
@parsed = ();
$p->parse_file($file);
is(@parsed,           "11");
is($parsed[6][0],     "start");
is($parsed[6][8]{id}, "\x{2665}\xE2\x99\xA5");
is($parsed[7][0],     "text");
is($parsed[7][1],     "&hearts; Love \xE2\x99\xA5");
is($parsed[7][2],     "\x{2665} Love \xE2\x99\xA5");    # expected garbage
is($parsed[10][3],    -s $file);
is(@warn,             1);
like($warn[0],
    qr/^Parsing of undecoded UTF-8 will give garbage when decoding entities/);

@warn   = ();
@parsed = ();
open($fh, "<:raw:utf8", $file) || die;
$p->parse_file($fh);
is(@parsed,           "11");
is($parsed[6][0],     "start");
is($parsed[6][8]{id}, "\x{2665}\x{2665}");
is($parsed[7][0],     "text");
is($parsed[7][1],     "&hearts; Love \x{2665}");
is($parsed[7][2],     "\x{2665} Love \x{2665}");
is($parsed[10][3], (-s $file) - 2 * 4);
is(@warn, 0);

@warn   = ();
@parsed = ();
open($fh, "<:raw", $file) || die;
$p->utf8_mode(1);
$p->parse_file($fh);
is(@parsed,           "11");
is($parsed[6][0],     "start");
is($parsed[6][8]{id}, "\xE2\x99\xA5\xE2\x99\xA5");
is($parsed[7][0],     "text");
is($parsed[7][1],     "&hearts; Love \xE2\x99\xA5");
is($parsed[7][2],     "\xE2\x99\xA5 Love \xE2\x99\xA5");
is($parsed[10][3],    -s $file);
is(@warn,             0);

unlink($file);

@parsed = ();
$p->parse(q(<a href="a=1&lang=2&times=3">foo</a>))->eof;
is(@parsed,             "5");
is($parsed[1][0],       "start");
is($parsed[1][8]{href}, "a=1&lang=2\xC3\x97=3");

ok(!HTML::Entities::_probably_utf8_chunk(""));
ok(!HTML::Entities::_probably_utf8_chunk("f"));
ok(HTML::Entities::_probably_utf8_chunk("f\xE2\x99\xA5"));
ok(HTML::Entities::_probably_utf8_chunk("f\xE2\x99\xA5o"));
ok(HTML::Entities::_probably_utf8_chunk("f\xE2\x99\xA5o\xE2"));
ok(HTML::Entities::_probably_utf8_chunk("f\xE2\x99\xA5o\xE2\x99"));
ok(!HTML::Entities::_probably_utf8_chunk("f\xE2"));
ok(!HTML::Entities::_probably_utf8_chunk("f\xE2\x99"));

$p = HTML::Parser->new(
    api_version  => 3,
    default_h    => [\@parsed, 'event, text, tag, attr'],
    attr_encoded => 1,
);

@warn   = ();
@parsed = ();

$p->parse($doc)->eof;

ok(!@warn);
is(@parsed, 9);

@parsed = ();
$p      = HTML::Parser->new(
    api_version   => 3,
    utf8_mode     => 1,
    unbroken_text => 1,
    default_h     => [\@parsed, 'event,dtext'],
);

$p->parse(
    "<p>R\xC3\xA9ductions jusqu'&agrave; -70%.<p>R&eacute;ductions jusqu'&agrave; -70%."
);
$p->eof;

is($parsed[2][1], "R\xC3\xA9ductions jusqu'\xC3\xA0 -70%.");

util.c  view on Meta::CPAN

decode_entities(pTHX_ SV* sv, HV* entity2char, bool expand_prefix)
{
    STRLEN len;
    char *s = SvPV_force(sv, len);
    char *t = s;
    char *end = s + len;
    char *ent_start;

    char *repl;
    STRLEN repl_len;
    char buf[UTF8_MAXLEN];
    int repl_utf8;
    int high_surrogate = 0;

#if defined(__GNUC__)
    /* gcc -Wall reports this variable as possibly used uninitialized */
    repl_utf8 = 0;
#endif

    while (s < end) {
	assert(t <= s);

	if ((*t++ = *s++) != '&')
	    continue;

	ent_start = s;
	repl = 0;

util.c  view on Meta::CPAN

		    if (num > 0x10FFFF) {
			/* overflow */
			ok = 0;
			break;
		    }
		    s++;
		    ok = 1;
		}
	    }
	    if (num && ok) {
		if (!SvUTF8(sv) && num <= 255) {
		    buf[0] = (char) num;
		    repl = buf;
		    repl_len = 1;
		    repl_utf8 = 0;
		}
		else if (num == 0xFFFE || num == 0xFFFF) {
		    /* illegal */
		}
		else {
		    char *tmp;
		    if ((num & 0xFFFFFC00) == 0xDC00) {  /* low-surrogate */
			if (high_surrogate != 0) {
			    t -= 3; /* Back up past 0xFFFD */
			    num = ((high_surrogate - 0xD800) << 10) +

util.c  view on Meta::CPAN

			high_surrogate = 0;
			/* otherwise invalid? */
			if ((num >= 0xFDD0 && num <= 0xFDEF) ||
			    ((num & 0xFFFE) == 0xFFFE) ||
			    num > 0x10FFFF)
			{
			    num = 0xFFFD;
			}
		    }

		    tmp = (char*)uvuni_to_utf8((U8*)buf, num);
		    repl = buf;
		    repl_len = tmp - buf;
		    repl_utf8 = 1;
		}
	    }
	}
	else {
	    char *ent_name = s;
	    while (s < end && isALNUM(*s))
		s++;
	    if (ent_name != s && entity2char) {
		SV** svp;
		if (              (svp = hv_fetch(entity2char, ent_name, s - ent_name, 0)) ||
		    (*s == ';' && (svp = hv_fetch(entity2char, ent_name, s - ent_name + 1, 0)))
		   )
		{
		    repl = SvPV(*svp, repl_len);
		    repl_utf8 = SvUTF8(*svp);
		}
		else if (expand_prefix) {
		    char *ss = s - 1;
		    while (ss > ent_name) {
			svp = hv_fetch(entity2char, ent_name, ss - ent_name, 0);
			if (svp) {
			    repl = SvPV(*svp, repl_len);
			    repl_utf8 = SvUTF8(*svp);
			    s = ss;
			    break;
			}
			ss--;
		    }
		}
	    }
	    high_surrogate = 0;
	}

	if (repl) {
	    char *repl_allocated = 0;
	    if (s < end && *s == ';')
		s++;
	    t--;  /* '&' already copied, undo it */

	    if (*s != '&') {
		high_surrogate = 0;
	    }

	    if (!SvUTF8(sv) && repl_utf8) {
		/* need to upgrade sv before we continue */
		STRLEN before_gap_len = t - SvPVX(sv);
		char *before_gap = (char*)bytes_to_utf8((U8*)SvPVX(sv), &before_gap_len);
		STRLEN after_gap_len = end - s;
		char *after_gap = (char*)bytes_to_utf8((U8*)s, &after_gap_len);

		sv_setpvn(sv, before_gap, before_gap_len);
		sv_catpvn(sv, after_gap, after_gap_len);
		SvUTF8_on(sv);

		Safefree(before_gap);
		Safefree(after_gap);

		s = t = SvPVX(sv) + before_gap_len;
		end = SvPVX(sv) + before_gap_len + after_gap_len;
	    }
	    else if (SvUTF8(sv) && !repl_utf8) {
		repl = (char*)bytes_to_utf8((U8*)repl, &repl_len);
		repl_allocated = repl;
	    }

	    if (t + repl_len > s) {
		/* need to grow the string */
		grow_gap(aTHX_ sv, repl_len - (s - t), &t, &s, &end);
	    }

	    /* copy replacement string into string */
	    while (repl_len--)

util.c  view on Meta::CPAN

    SvCUR_set(sv, t - SvPVX(sv));

    return sv;
}

static bool
has_hibit(char *s, char *e)
{
    while (s < e) {
	U8 ch = *s++;
	if (!UTF8_IS_INVARIANT(ch)) {
	    return 1;
	}
    }
    return 0;
}


EXTERN bool
probably_utf8_chunk(pTHX_ char *s, STRLEN len)
{
    char *e = s + len;
    STRLEN clen;

    /* ignore partial utf8 char at end of buffer */
    while (s < e && UTF8_IS_CONTINUATION((U8)*(e - 1)))
	e--;
    if (s < e && UTF8_IS_START((U8)*(e - 1)))
	e--;
    clen = len - (e - s);
    if (clen && UTF8SKIP(e) == clen) {
	/* all promised continuation bytes are present */
	e = s + len;
    }

    if (!has_hibit(s, e))
	return 0;

    return is_utf8_string((U8*)s, e - s);
}

xt/release/changes_has_content.t  view on Meta::CPAN

use Test::More tests => 2;

note 'Checking Changes';
my $changes_file = 'Changes';
my $newver = '3.83';
my $trial_token = '-TRIAL';
my $encoding = 'UTF-8';

SKIP: {
    ok(-e $changes_file, "$changes_file file exists")
        or skip 'Changes is missing', 1;

    ok(_get_changes($newver), "$changes_file has content for $newver");
}

done_testing;



( run in 0.500 second using v1.01-cache-2.11-cpan-1dc43b0fbd2 )