XHTML results from the CPAN

XHTML

Daizu

view release on metacpan or search on metacpan

 * Article loading:
    - Articles are loaded (using an article loader plugin) whenever they
      are updated in the database working copies, and the results are
      stored in the database.  This makes publishing much faster, and a
      lot of code simpler, because for example the true article title
      (which may be supplied by the plugin if the user hasn't supplied
      one) is available in the 'wc_file' table.  r503
    - A permalink URL for articles ('article_pages_url') is stored when
      an article is loaded, so that it can be used by index pages or
      whatever to link to the article's first page.  r518
    - The XHTML article loading has been separated out into a plugin.  This
      means you can now provide an alternative plugin for 'text/html' files
      if you don't like my one, but you have to load at least one article
      loader plugin in the config file.  r502
    - The content from an article loader now has XInclude processing done,
      not just for XHTML content.  Other loader plugins could make use of
      that if they want to provide a file inclusion feature.  XInclude is
      now restricted to 'daizu:' URLs, since there may be security issues
      with other URL schemes.  r505
    - Various pieces of code which previously got article metadata and
      permalink URLs from a Daizu::File object now get it directly from
      the 'wc_file' table when that would save a query.  r513, r518

 * Output files are closed properly to check for last-minute errors.  r496

 * Several things are now done inside a database transaction where they

config.xml view on Meta::CPAN

       search engines).
    -->
  <generator class="Daizu::Gen">
   <xml-sitemap />
  </generator>

 <!-- End of configuration specific to 'example.com'. -->
 </config>

 <!-- You'll need at least one plugin to load articles from
      files.  This one is for files which have XHTML fragments
      as their content.  -->
 <plugin class="Daizu::Plugin::XHTMLArticle" />

 <!-- Enable the syntax-highlighting plugin, which is supplied
      with Daizu CMS.  -->
 <plugin class="Daizu::Plugin::SyntaxHighlight" />

 <!-- This plugin adds convenient 'anchors' to headings, so
      that you can link to specific sections of pages.  -->
 <plugin class="Daizu::Plugin::HeaderAnchor" />

lib/Daizu.pm view on Meta::CPAN

Value: I</etc/daizu/config.xml>

=item $Daizu::CONFIG_NS

The URI used as an XML namespace for the elements in the config file.

Value: L<http://www.daizucms.org/ns/config/>

=item $Daizu::HTML_EXTENSION_NS

The URI used as an XML namespace for special elements in XHTML content.

Value: L<http://www.daizucms.org/ns/html-extension/>

=item $Daizu::HIDING_FILENAMES

A list of file and directory names which prevent any publication of
files with one of the names, or anything inside a directory so named.
Separated by '|' so that the whole string can be included in Perl
and PostgreSQL regular expressions.

lib/Daizu.pm view on Meta::CPAN

return false to indicate that it can't handle the file.

The hash returned can contain the following values:

=over

=item content

Required.  All the other values are optional.

This should be an XHTML DOM of the article's content, as it will be published.
It should be an L<XML::LibXML::Document> object, with a root element called
C<body> in the XHTML namespace.  It can contain extension elements to be
processed by article filter plugins.  It can contain XInclude elements,
which will be processed by the
L<expand_xinclude() function|Daizu::Util/expand_xinclude($db, $doc, $wc_id, $path)>.
Entity references should not be present.

=item title

The title to use for the article.  If this is present and not undef then
it will override the value of the C<dc:title> property.

lib/Daizu.pm view on Meta::CPAN


sub add_article_loader
{
    my ($self, $mime_type, $path, $object, $method) = @_;
    push @{$self->{article_loaders}{$mime_type}{$path}}, [ $object => $method ];
}

=item $cms-E<gt>add_html_dom_filter($path, $object, $method)

Plugins can use this to register a method which will be called whenever
an XHTML file is being published.  C<$method> (a method name) will be
called on C<$object>, and will be passed C<$cms>, a L<Daizu::File> object
for the file being filtered, and an XML DOM object
of the source, as a L<XML::LibXML::Document> object.  The plugin method
should return a reference to a hash containing a C<content> value which
is the filtered content, either a completely new copy of the DOM
or the same value it was passed (which it might have modified in place).

The returned hash can also contain an C<extra_urls> array, in the same
way as an article loader, if the filter adds additional URLs for the file.

lib/Daizu/Feed.pm view on Meta::CPAN


If the article's content contains a 'fold' (indicated with a C<daizu:fold>
element) or a page break, then only the content before the fold or first page
break is included in the feed.  If there is any more content in the full
article then a text link to the article's URL is included after the extract
to make it more obvious that only part of the article is shown.  If there
is no fold or page break then the full article is included in the feed, as
for the C<content> type feeds described below.

For Atom feeds: the extract of the article content is provided as raw
XHTML in an C<atom:content> element.

For RSS feeds: the extract of the article content is provided in a
C<content:encoded> element.  The C<description> element will still carry
the description or extract as described above.

=item content

The full content of the article is included in the feed, even if the article
has page breaks.  Any C<daizu:fold> elements or C<daizu:page> elements in
the article's content will be ignored (and will not appear in the feed).

For Atom feeds: the article content is provided as raw XHTML in an
C<atom:content> element.

For RSS feeds: the article content is provided in a C<content:encoded>
element.  The C<description> element will still carry the description or
extract as described above.

=back

=head1 METHODS

lib/Daizu/File.pm view on Meta::CPAN


The first 'article' page URL is the one which should be used when linking
to an article, unless you have some special reason to link to a particular
page or an alternative URL for the same file.  For example, this is
the URL which will be included in blog feeds and navigation menus.
To get at it conveniently, see the L<permalink()|/$file-E<gt>permalink> method.

=item *

There may be additional URLs for supplementary resources generated by
plugins, although by default a simple article written in XHTML won't
have any 'extra' URLs.  These URLs are the ones supplied by the article
loader plugin as C<extra_urls>, and stored in the database in the
C<wc_article_extra_url> table.  One example of an 'extra'
URL is a POD file (Perl documentation, like this document itself)
published with the L<Daizu::Plugin::PodArticle> plugin.  If the filename
of the POD file ends in '.pm', then this plugin will add an extra
URL for the original source code, since that might be of interest
to programmers reading API documentation.

=back

lib/Daizu/File.pm view on Meta::CPAN

}

sub _update_loaded_article_in_db_txn
{
    my ($self) = @_;
    my $cms = $self->{cms};

    my $mime_type = $self->{content_type};
    if (!defined $mime_type) {
        # Articles must have a mime type, but allow a default based on file
        # extension for the built-in XHTML format.
        croak "article in file '$self->{path}' has no mime type specified"
            unless $self->{name} =~ /\.html?$/i;
        $mime_type = 'text/html';
    }
    $mime_type =~ m!^(.+?)/!
        or croak "bad article mime type '$mime_type' in file '$self->{path}'";
    my $mime_type_family = "$1/*";

    # Search through applicable MIME type patterns.
    my $file_path = $self->{path};

lib/Daizu/HTML.pm view on Meta::CPAN

use XML::LibXML;
use HTML::Tagset;
use URI;
use Encode qw( encode );
use Carp qw( croak );
use Carp::Assert qw( assert DEBUG );
use Daizu::Util qw( trim );

=head1 NAME

Daizu::HTML - functions for handling HTML and XHTML content

=head1 FUNCTIONS

The following functions are available for export from this module.
None of them are exported by default.

=over

=item dom_body_to_html4($doc, [$start_node], [$end_node])

Given an L<XML::LibXML::Document> object for an XHTML document fragment,
whose root element should be C<body>, returns a string representation of
the content in S<HTML 4> format.

C<$start_node> and C<$end_node> are both independently optional.
If either is present then only part of the document will be presented
in the HTML output.  Both must be either C<undef> or a node from the
root (C<body>) element of the document.  C<$start_node> should be the first
node to be shown in the output, or C<undef> to start from the beginning.
C<$end_node> should be the node I<after> the last node to be output,
or C<undef> to end at the end of the document.

lib/Daizu/HTML.pm view on Meta::CPAN

#       XML::LibXML::XML_DTD_NODE = 14
#       XML::LibXML::XML_ELEMENT_DECL = 15
#       XML::LibXML::XML_ATTRIBUTE_DECL = 16
#       XML::LibXML::XML_ENTITY_DECL = 17
#       XML::LibXML::XML_NAMESPACE_DECL = 18
#       XML::LibXML::XML_DOCB_DOCUMENT_NODE = 21
}

=item dom_body_to_text($doc)

Given an XHTML body (as an L<XML::LibXML::Document> object in the usually
format) return a plain text version of the content, with some markup
translatted into text formatting in a limited way to make it reasonably
readable.

=cut

sub dom_body_to_text
{
    my ($doc) = @_;
    my $text = '';

lib/Daizu/HTML.pm view on Meta::CPAN


Return a new version of the article content in C<$doc>, with bits of
markup which aren't relevant or might be unwelcome in feed content,
such as C<script> elements and C<style> attributes.  Also remove C<span>
elements because they're not needed when there's no custom styling,
and Bloglines currently turns them into invalid HTML.  Also remove
C<class> attributes in case they cause some unexpected styling to be
applied.

In addition, any elements in the Daizu HTML extension namespace are
removed.  Elements in other non-XHTML namespaces will cause this function
to fail.  They shouldn't be there by the time the content is being output
anyway.

Both C<$doc> and the return value are L<XML::LibXML::Document> objects
of the kind returned by
L<the article_doc() method in Daizu::File|Daizu::File/$file-E<gt>article_doc>.
The original DOM in C<$doc> is not altered.  The return value is a
completely independent copy.

=cut

lib/Daizu/HTML.pm view on Meta::CPAN

#       XML::LibXML::XML_DTD_NODE = 14
#       XML::LibXML::XML_ELEMENT_DECL = 15
#       XML::LibXML::XML_ATTRIBUTE_DECL = 16
#       XML::LibXML::XML_ENTITY_DECL = 17
#       XML::LibXML::XML_NAMESPACE_DECL = 18
#       XML::LibXML::XML_DOCB_DOCUMENT_NODE = 21
}

=item absolutify_links($doc, $base_url)

Given an XHTML document (as an L<XML::LibXML::Document> object), find
all the attributes in the markup which are relative URLs and turn them
into absolute URLs relative to C<$base_url>.  This can be used to prepare
content from an article to be published in a different place with a different
URL, such as in an RSS feed or on an index page, while ensuring that any
links or embedded files continue to work.

The document's elements must be in the XHTML namespace, or they will be
ignored.

TODO - some of this could be refactored with the link replacing stuff
in Daizu::Preview to be more thorough.  For now though it just works on
'a href' and 'img src', since that will catch almost all cases.

=cut

sub absolutify_links
{

lib/Daizu/Plugin/PodArticle.pm view on Meta::CPAN

# containing the name and version number of my POD translator.

=head1 NAME

Daizu::Plugin::PodArticle - a plugin for publishing Perl POD documentation on websites

=head1 DESCRIPTION

This plugin adds the ability for Daizu CMS to load content from POD files
(or Perl code containing POD documentation).  Once this module has parsed
the file it provides Daizu with the content in XHTML format (as a DOM
structure), and from then on it can be treated as a normal article.

With this module loaded it should be possible to publish Perl documentation
simply by adding the files containing POD to the repository, marking them
as being articles like any other, and giving them a C<svn:mime-type>
property with the value 'text/x-perl'.

=head1 CONFIGURATION

To turn on this plugin, include the following in your Daizu CMS configuration

lib/Daizu/Plugin/PodArticle.pm view on Meta::CPAN

enabled too.

Each of these C<=for> commands will only affect a single indented
block (whichever one is found next).  Blank lines in blocks won't
break them up; the syntax highlighting will last up until the next
thing which isn't indented (a command or a normal paragraph).

=item The fold

You can get the same effect as the special C<daizu:fold> element gives
in XHTML articles using the following markup:

=for syntax-highlight pod

    =for daizu-fold

This is not likely to be useful unless you're writing blog articles
in POD, in which case the content above the fold will be shown in
index pages (and possibly feeds, depending on how they're configured).

=item Page breaks

You can get the same effect as the special C<daizu:page> element gives
in XHTML articles using the following markup:

=for syntax-highlight pod

    =for daizu-page

Occurances of this will separate pages of content, allowing a long
document to be split into multiple pages for web publication.

=back

lib/Daizu/Plugin/PodArticle.pm view on Meta::CPAN

        extra_urls => \@extra_url,
        extra_templates => \@extra_template,
    };
}

=back

=head1 Daizu::Plugin::PodArticle::Parser

This class is the subclass of L<Pod::Parser> used for parsing POD documents
into XHTML DOM documents.  It overrides the methods
L<command()|Pod::Parser/command()>,
L<textblock()|Pod::Parser/textblock()>, and
L<verbatim()|Pod::Parser/verbatim()>.

=cut

package Daizu::Plugin::PodArticle::Parser;
use base 'Pod::Parser';

use XML::LibXML;

lib/Daizu/Plugin/SyntaxHighlight.pm view on Meta::CPAN


use Text::VimColor;
use Daizu;

=head1 NAME

Daizu::Plugin::SyntaxHighlight - a plugin for syntax-highlighting code samples in HTML pages

=head1 DESCRIPTION

This plugin filters XHTML content expanding any
C<daizu::syntax-highlight> elements by passing their contents through
the L<Text::VimColor> module, which is required for it to work.  The source
of your articles can contain markup like this:

=for syntax-highlight xml

    <daizu:syntax-highlight filetype="perl">
    # A piece of Perl code which will be syntax highlighted.
    my $foo = 'bar';
    </daizu:syntax-highlight>

The C<daizu> prefix should be bound to the
L<Daizu HTML extension namespace|Daizu/$Daizu::HTML_EXTENSION_NS>
(which is done automatically for the content of XHTML articles).  The output
will be an HTML C<pre> element, containing text and C<span> elements with
appropriate classes.

Extra whitespace at the start or end of the content is trimmed off.

If you want to highlight a larger amount of code, put it in a separate
file and use XInclude to insert it into the C<syntax-highlight> element.
For example:

=for syntax-highlight xml

lib/Daizu/Plugin/XHTMLArticle.pm view on Meta::CPAN

sub register
{
    my ($class, $cms, $whole_config, $plugin_config, $path) = @_;
    my $self = bless {}, $class;
    $cms->add_article_loader($_, '', $self => 'load_article')
        for qw( text/html application/xhtml+xml );
}

=item $self-E<gt>load_article($cms, $file)

Does the actual parsing of the XHTML content of C<$file> (which should
be a L<Daizu::File> object), and returns the appropriate content as an XHTML
DOM of the file.

Never rejects a file, and therefore always returns true.

=cut

sub load_article
{
    my ($self, $cms, $file) = @_;

lib/Daizu/Preview.pm view on Meta::CPAN

    $parser->parse($html);
    $parser->eof;
}

sub _start_h
{
    my ($cms, $wc_id, $base_url, $fh, $in_style, $tagname, $attr) = @_;

    ++$$in_style if $tagname eq 'style';

    delete $attr->{'/'};      # to cope with XHTML empty elements

    # The keys are sorted to allow for testing.
    my $attrtext = join ' ', map {
        "$_=\"" . html_escape_attr(exists $HTML_URL_ATTR{"$tagname:$_"}
            ? adjust_link_for_preview($cms, $wc_id, $base_url, $attr->{$_},
                                       $HTML_URL_ATTR{"$tagname:$_"})
            : $attr->{$_}) . '"';
    } sort keys %$attr;

    print $fh ($attrtext ? "<$tagname $attrtext>" : "<$tagname>");

lib/Daizu/xml/xhtml-special.ent view on Meta::CPAN

<!-- Special characters for XHTML -->

<!-- Character entity set. Typical invocation:
     <!ENTITY % HTMLspecial PUBLIC
        "-//W3C//ENTITIES Special for XHTML//EN"
        "http://www.w3.org/TR/xhtml1/DTD/xhtml-special.ent">
     %HTMLspecial;
-->

<!-- Portions (C) International Organization for Standardization 1986:
     Permission to copy in any form is granted for use with

lib/Daizu/xml/xhtml-symbol.ent view on Meta::CPAN

<!-- Mathematical, Greek and Symbolic characters for XHTML -->

<!-- Character entity set. Typical invocation:
     <!ENTITY % HTMLsymbol PUBLIC
        "-//W3C//ENTITIES Symbols for XHTML//EN"
        "http://www.w3.org/TR/xhtml1/DTD/xhtml-symbol.ent">
     %HTMLsymbol;
-->

<!-- Portions (C) International Organization for Standardization 1986:
     Permission to copy in any form is granted for use with

t/24file.t view on Meta::CPAN


    # There should be three paragraphs and a daizu:fold element, and the
    # only other nodes at the top level should be text (newlines).
    my $node = $body->firstChild;
    my $pos = 1;
    while (defined $node) {
        if ($pos == 1 || $pos == 3 || $pos == 7) {
            isa_ok($node, 'XML::LibXML::Element', "article_doc: $pos: element");
            is($node->localname, 'p', "article_doc: $pos: <p>");
            is($node->namespaceURI, 'http://www.w3.org/1999/xhtml',
               "article_doc: $pos: XHTML namespace");
        }
        elsif ($pos == 5) {
            isa_ok($node, 'XML::LibXML::Element', "article_doc: $pos: element");
            is($node->localname, 'fold', "article_doc: $pos: <fold>");
            is($node->namespaceURI, $Daizu::HTML_EXTENSION_NS,
               "article_doc: $pos: Daizu HTML extension namespace");
        }
        else {
            assert($pos <= 8);
            isa_ok($node, 'XML::LibXML::Text', "article_doc: $pos: text");

t/24file.t view on Meta::CPAN

    my $text = $para[2]->textContent;
    is($text, "It also has some UTF-8 stuff:\x{A0}\x{201C}\x{2014}\x{201D}",
       'article_doc: article 2, UTF-8 characters preserved');

    # Make sure the filtering has been done for the <daizu:syntax-highlight/>
    # element.  It should have been replaced by a <pre> element.
    $doc = $file_5->article_doc;
    my (@pre) = $doc->documentElement->getChildrenByTagName('pre');
    is(scalar @pre, 1, 'article_doc: article 5, syntax highlighting done');
    is($pre[0]->namespaceURI, 'http://www.w3.org/1999/xhtml',
       'article_doc: article 5, new <pre> element in XHTML namespace');
    $text = $pre[0]->textContent;
    like($text, qr/syntax coloured external file/,
       'article_doc: article 5, highlighting on text from XIncluded file');
}

# article_body
{
    my $body = $file_1->article_body;
    isa_ok($body, 'XML::LibXML::Element', 'article_body: is element');
    is($body->localname, 'body', 'article_body: is <body>');
    is($body->namespaceURI, 'http://www.w3.org/1999/xhtml',
       'article_body: XHTML namespace');
}

# article_content_html4
{
    is($file_2->article_content_html4,
       "<p>Blog article 2</p>\n\n" .
       "<p>This one has three pages but no fold mark, so the first" .
       " page break\012should be treated like a fold.</p>\n\n" .
       enc("<!-- Unicode text: \x{8A9E} -->\n" .
           "<p title=\"Some \x{2018}UTF-8\x{2019} text\">" .

t/80feeds.t view on Meta::CPAN

    (@elem) = $entry_elem->getChildrenByTagName('content');
    is(scalar @elem, 1, 'atom: one entry <content>');
    is($elem[0]->getAttributeNS('http://www.w3.org/XML/1998/namespace', 'base'),
       $exp_url, 'atom: entry <content> xml:base');
    my $content_elem = $elem[0];
    is($content_elem->getAttribute('type'), 'xhtml',
       'atom: entry <content> type');
    (@elem) = $content_elem->getChildrenByTagName('div');
    is(scalar @elem, 1, 'atom: one entry <content><div>');
    is($elem[0]->namespaceURI, 'http://www.w3.org/1999/xhtml',
       'atom: entry <content><div> in XHTML namespace');

    # Article 5 has a syntax highlighted bit in, but for the feed
    # content the <span> elements and 'class' attribute should be
    # removed, so as not to confuse aggregators.
    if ($exp_title eq 'Article 5') {
        (@elem) = $content_elem->getElementsByTagName('span');
        is(scalar @elem, 0, 'atom: entry <content> has no <span> elements');
        (@elem) = $content_elem->getElementsByTagName('pre');
        SKIP: {
            is(scalar @elem, 1, 'atom: entry <content> has a <pre> element');

t/data/15html/text-expected.txt view on Meta::CPAN

A short paragraph.

A long paragraph: Given an XHTML document (as an
L<XML::LibXML::Document> object), find all the attributes in the markup
which are relative URLs and turn them into absolute URLs relative to
C<$base_url>. This can be used to prepare content from an article to be
published in a different place with a different URL, such as in an RSS
feed or on an index page, while ensuring that any links or embedded
files continue to work.

extra whitespace around it

Bullet list:

t/data/15html/text-input.html view on Meta::CPAN

<body xmlns="http://www.w3.org/1999/xhtml"
      xmlns:daizu="http://www.daizucms.org/ns/html-extension/"
      xmlns:xi="http://www.w3.org/2001/XInclude"
      xml:base="daizu:///foo.com/blog/2006/fish-fingers/article-1.html"
><p>A short paragraph.</p>


<p>A long paragraph:

Given an XHTML document (as an L&lt;XML::LibXML::Document&gt; object), find all the attributes in the markup which are relative URLs and turn them
into absolute URLs relative to C&lt;$base_url&gt;.  This can be used to prepare
content from an article to be published in a different place with a different
URL, such as in an RSS feed or on an index page, while ensuring that any links or embedded files continue to work.

</p>


	  <p>extra whitespace around it</p>  

<p>Bullet list:</p>

( run in 0.759 second using v1.01-cache-2.11-cpan-00829025b61 )