Alvis-Convert

 view release on metacpan or  search on metacpan

bin/html2alvis  view on Meta::CPAN


  Options:

    --html-ext                 HTML file identifying filename extension
    --meta-ext                 meta file identifying filename extension
    --out-dir                  output directory
    --N-per-out-dir            # of records per output directory
    --meta-encoding            the encoding of the meta files
    --html-encoding            the encoding of all HTML files
    --html-encoding-from-meta  take the encoding of the HTML files from
                               the meta files (attribute 'detected-charset')
    --[no]original             include original document?
    --help                     brief help message
    --man                      full documentation
    --[no]warnings             warnings output flag
    
=head1 OPTIONS
    
=over 8

=item B<--html-ext>

lib/Alvis/Document/Meta.pm  view on Meta::CPAN

	    $self->{attr}{url}=$value;
	}
	elsif ($name=~/^\s*date\s*$/isgo)
	{
	    $self->{attr}{date}=$value;
	}
	elsif ($name=~/^\s*title\s*$/isgo)
	{
	    $self->{attr}{title}=$value;
	}
	elsif ($name=~/^\s*detected\s*\-\s*charset\s*$/isgo)
	{
	    $self->{attr}{detectedCharSet}=$value;
	}
	elsif ($name=~/^\s*Meta\-\s*(.*)$/isgo)
	{
	    my $metafield=$1;

	    $metafield=lc($metafield);
	    if (exists($MetaMap{$metafield}))
	    {

lib/Alvis/Document/Meta.pm  view on Meta::CPAN


See the source for the exact mapping from HTML header fields to DC.
Syntax of the meta information file:

       <feature name>\t<feature value>\n

"Special" field names are
      url   
      title
      date
      detected-charset 

=head1 METHODS

=head2 new()

Options:

    text    The text of a meta information file.

=head2 parse($meta)

Maps the features to the Dublin Core set (dc:title etc.). 

"Special" field names are
      url   
      title
      date
      detected-charset 

=head2 get_dcs()

Returns all Dublin Core mapped features as 
([<name>,<value>],[<name>,<value>],...)

=head2 get($param)

Returns the setting for the attribute. 
"Special" parameters are

t/test-data/to-split/29.xml  view on Meta::CPAN

        <modifiedDate>1146649940912</modifiedDate>
        <httpServer>Apache/1.3.34 (Unix) mod_fastcgi/2.4.2 mod_auth_passthrough/1.8 mod_log_bytes/1.2 mod_bwlimited/1.4 PHP/4.4.2 FrontPage/5.0.2.2635 mod_ssl/2.8.25 OpenSSL/0.9.7i</httpServer>
        <urls>
          <url>http://www.searchenginejournal.com/?p=3363</url>
        </urls>
      </acquisitionData>
      <canonicalDocument>        
        <section>Yahoo’s YPN Says No to MySpace Traffic If you use MySpace profiles, blogs, comments, and mailings to spam or influence the teenie boppers over at MySpace to clickover to your website and that MySpace traffic is a major source of yo...
      <metaData>
        <meta name="title">Yahoo’s YPN Says No to MySpace Traffic</meta>
        <meta name="dc:type">text/html; charset=utf-8</meta>
      </metaData>
      <links>
        <outlinks>
          <link type="a">
            <anchorText>Jen Slegg</anchorText>
            <location>http://www.jensense.com/archives/2006/05/myspacecom_and.html</location>
          </link>
          <link type="a">
            <anchorText>Problogger.net</anchorText>
            <location>http://www.problogger.net/archives/2006/05/03/yahoo-publisher-network-terminates-more-publisher-accounts/</location>

t/test-data/to-split/29.xml  view on Meta::CPAN

        <modifiedDate>1150315246240</modifiedDate>
        <httpServer>Apache/1.3.36 (Unix) mod_fastcgi/2.4.2 mod_auth_passthrough/1.8 mod_log_bytes/1.2 mod_bwlimited/1.4 PHP/4.4.2 FrontPage/5.0.2.2635.SR1.2 mod_ssl/2.8.27 OpenSSL/0.9.7a</httpServer>
        <urls>
          <url>http://www.searchenginejournal.com/?p=3530</url>
        </urls>
      </acquisitionData>
      <canonicalDocument>        
        <section>RSS - Things That Make You Go Hmmm Why doesn’t the new Yahoo Spark Blog publish an RSS feed? Of any kind? Not even an “add to my Yahoo” button? Why can’t I subscribe to the Technorati Hot Tags widget that’s (supposedly) upd...
      <metaData>
        <meta name="title">RSS - Things That Make You Go Hmmm</meta>
        <meta name="dc:type">text/html; charset=utf-8</meta>
      </metaData>
      <links>
        <outlinks>
          <link type="a">
            <anchorText>Technorati Hot Tags</anchorText>
            <location>http://www.technorati.com/tags/</location>
          </link>
          <link type="a">
            <anchorText>eBay</anchorText>
            <location>http://www2.ebay.com/aw/core/200603200913002.html</location>



( run in 0.246 second using v1.01-cache-2.11-cpan-4d50c553e7e )