Alvis-Convert

 view release on metacpan or  search on metacpan

t/test-data/original/0/101.alvis  view on Meta::CPAN

<?xml version="1.0" encoding="UTF-8"?>
<documentCollection>
  <documentRecord id="0717FBB236A4A067DC9BE4FA48801BE3">
    <acquisition>
      <acquisitionData>
        <modifiedDate>1141065614536</modifiedDate>
        <httpServer>Apache/1.3.34 (Unix) DAV/1.0.3 mod_auth_passthrough/1.8 mod_log_bytes/1.2 mod_bwlimited/1.4 PHP/4.4.1 FrontPage/5.0.2.2635 mod_ssl/2.8.25 OpenSSL/0.9.7a</httpServer>
        <urls>
          <url>http://battellemedia.com/archives/2004_08.php</url>
        </urls>
      </acquisitionData>
      <canonicalDocument>        
        <section>I'm slow to report the news here (the embargo lifted last night at 9 pm) but today Yahoo launched its local search product. I was on an informal "advisory board" for this product, but I have to admit that my focus on the book did not...
      <metaData>
        <meta name="title">John Battelle's Searchblog: August 2004 Archives</meta>
        <meta name="dc.date">2004-08-03</meta>
        <meta name="dc.type">text/html</meta>

t/test-data/original/1/1000.alvis  view on Meta::CPAN

<?xml version="1.0" encoding="UTF-8"?>
<documentCollection>
  <documentRecord id="A40C00BE1D17086E4AE59ABB355FFF3C">
    <acquisition>
      <acquisitionData>
        <modifiedDate>1141065636598</modifiedDate>
        <httpServer>Apache/1.3.34 (Unix) DAV/1.0.3 mod_auth_passthrough/1.8 mod_log_bytes/1.2 mod_bwlimited/1.4 PHP/4.4.1 FrontPage/5.0.2.2635 mod_ssl/2.8.25 OpenSSL/0.9.7a</httpServer>
        <urls>
          <url>http://battellemedia.com/archives/2005_01.php</url>
        </urls>
      </acquisitionData>
      <canonicalDocument>        
        <section>Many of you may have noticed that last night Searchblog was hacked, apparently by someone in Albania (!). For a brief period of time my site redirected to a very odd page, and it appeared I had entirely lost my mind. All is well now,...
      <metaData>
        <meta name="title">John Battelle's Searchblog: January 2005 Archives</meta>
        <meta name="dc.date">2005-01-15</meta>
        <meta name="dc.type">text/html</meta>

t/test-data/to-split/29.xml  view on Meta::CPAN

<?xml version="1.0" encoding="UTF-8"?>
<documentCollection  xmlns="http://alvis.info/enriched/" version="1.1">
<documentRecord id="A4AFC8E9BD3073A4EFADEB400B80D54A" xmlns="http://alvis.info/enriched/">
    <acquisition>
      <acquisitionData>
        <modifiedDate>1146649940912</modifiedDate>
        <httpServer>Apache/1.3.34 (Unix) mod_fastcgi/2.4.2 mod_auth_passthrough/1.8 mod_log_bytes/1.2 mod_bwlimited/1.4 PHP/4.4.2 FrontPage/5.0.2.2635 mod_ssl/2.8.25 OpenSSL/0.9.7i</httpServer>
        <urls>
          <url>http://www.searchenginejournal.com/?p=3363</url>
        </urls>
      </acquisitionData>
      <canonicalDocument>        
        <section>Yahoo’s YPN Says No to MySpace Traffic If you use MySpace profiles, blogs, comments, and mailings to spam or influence the teenie boppers over at MySpace to clickover to your website and that MySpace traffic is a major source of yo...
      <metaData>
        <meta name="title">Yahoo’s YPN Says No to MySpace Traffic</meta>
        <meta name="dc:type">text/html; charset=utf-8</meta>
      </metaData>

t/test-data/to-split/29.xml  view on Meta::CPAN

      <semantic_unit><named_entity><form>Yahoo Publisher Network</form><named_entity_type>soft</named_entity_type></named_entity></semantic_unit>
      <semantic_unit><named_entity><form>Google AdSense</form><named_entity_type>soft</named_entity_type></named_entity></semantic_unit>
    </semantic_unit_level>
  </linguisticAnalysis>

  </documentRecord>
<documentRecord id="A62EEF2D8BE45A8D097087B515598C68" xmlns="http://alvis.info/enriched/">
    <acquisition>
      <acquisitionData>
        <modifiedDate>1148355445154</modifiedDate>
        <httpServer>Apache/1.3.34 (Unix) DAV/1.0.3 mod_auth_passthrough/1.8 mod_log_bytes/1.2 mod_bwlimited/1.4 PHP/4.4.1 FrontPage/5.0.2.2635 mod_ssl/2.8.25 OpenSSL/0.9.7a</httpServer>
        <urls>
          <url>http://battellemedia.com/archives/002584.php</url>
        </urls>
      </acquisitionData>
      <canonicalDocument>        
        <section>Two items of very related interest today: 1. Wired News Releases Full Text of AT&amp;T NSA Document (Slashdot). 2. Gonzales Says Publishing Leaks Is A Crime (Also Slashdot) Thank God for outlets like Wired. And best of luck.</section...
      <metaData>
        <meta name="title">Wired News: Will the US Sue?</meta>
        <meta name="dc:type">text/html</meta>
      </metaData>

t/test-data/to-split/29.xml  view on Meta::CPAN

      <semantic_unit><named_entity><form>Google Blog Search</form><named_entity_type>soft</named_entity_type></named_entity></semantic_unit>
      <semantic_unit><named_entity><form>Google News</form><named_entity_type>soft</named_entity_type></named_entity></semantic_unit>
    </semantic_unit_level>
  </linguisticAnalysis>

  </documentRecord>
<documentRecord id="48FFC0A03C2756C583F6D80C9E527393" xmlns="http://alvis.info/enriched/">
    <acquisition>
      <acquisitionData>
        <modifiedDate>1142422246164</modifiedDate>
        <httpServer>Apache/1.3.33 (Unix)</httpServer>
        <urls>
          <url>http://blog.outer-court.com/archive/2006-03-15-n42.html</url>
        </urls>
      </acquisitionData>
      <canonicalDocument>        
        <section>Google releases their desktop search tool in an updated version today. Among some bugfixes, there’s a new Quick Search box. Hit Ctrl twice to make it appear in the middle of your desktop, and then search for anything – your compu...
      <metaData>
        <meta name="title">Google Desktop's Quick Search Box</meta>
        <meta name="dc:date">Wed, 15 Mar 2006 11:20:57 GMT</meta>
        <meta name="dc:type">text/html</meta>

t/test-data/to-split/29.xml  view on Meta::CPAN

      <semantic_unit><named_entity><form>Google</form><named_entity_type>soft</named_entity_type></named_entity></semantic_unit>
      <semantic_unit><named_entity><form>Google Desktop</form><named_entity_type>soft</named_entity_type></named_entity></semantic_unit>
    </semantic_unit_level>
  </linguisticAnalysis>

  </documentRecord>
<documentRecord id="18C9FD35812DFC4D4CCF0FD6AC1646BC" xmlns="http://alvis.info/enriched/">
    <acquisition>
      <acquisitionData>
        <modifiedDate>1149133052555</modifiedDate>
        <httpServer>Apache/1.3.33 (Unix)</httpServer>
        <urls>
          <url>http://blog.outer-court.com/archive/2006-05-30-n12.html</url>
        </urls>
      </acquisitionData>
      <canonicalDocument>        
        <section>Some bloggers are complaining that Google didn’t have a Memorial day logo yesterday. Memorial Day “commemorates U.S. men and women who have died in military service,”Wikipedia explains. From a comment at Newsbusters by Warner T...
      <metaData>
        <meta name="title">Complaints Due to Lack of Google Memorial Day Logo</meta>
        <meta name="dc:date">Thu, 01 Jun 2006 02:44:56 GMT</meta>
        <meta name="dc:type">text/html</meta>

t/test-data/to-split/29.xml  view on Meta::CPAN

      <semantic_unit><named_entity><form>Wikipedia</form><named_entity_type>soft</named_entity_type></named_entity></semantic_unit>
      <semantic_unit><named_entity><form>Google</form><named_entity_type>soft</named_entity_type></named_entity></semantic_unit>
    </semantic_unit_level>
  </linguisticAnalysis>

  </documentRecord>
<documentRecord id="0770964CAC923ACCDC189E0EA4208AE0" xmlns="http://alvis.info/enriched/">
    <acquisition>
      <acquisitionData>
        <modifiedDate>1141993156883</modifiedDate>
        <httpServer>Apache/1.3.34 (Unix) DAV/1.0.3 mod_auth_passthrough/1.8 mod_log_bytes/1.2 mod_bwlimited/1.4 PHP/4.4.1 FrontPage/5.0.2.2635 mod_ssl/2.8.25 OpenSSL/0.9.7a</httpServer>
        <urls>
          <url>http://battellemedia.com/archives/002391.php</url>
        </urls>
      </acquisitionData>
      <canonicalDocument>        
        <section>From a Reuters story: Sen. Ron Wyden on Thursday proposed legislation aimed at preventing high-speed Internet service providers from charging content companies extra so consumers have faster access to their Web sites or receive speci...
      <metaData>
        <meta name="title">Net Neutrality Bill Unveiled</meta>
        <meta name="dc:type">text/html</meta>
      </metaData>

t/test-data/to-split/29.xml  view on Meta::CPAN

  <linguisticAnalysis>
    <semantic_unit_level>
    </semantic_unit_level>
  </linguisticAnalysis>

  </documentRecord>
<documentRecord id="35D3C71D8D04A7A782CD2E8CBF17220C" xmlns="http://alvis.info/enriched/">
    <acquisition>
      <acquisitionData>
        <modifiedDate>1144681935588</modifiedDate>
        <httpServer>Apache/1.3.28 (Unix) mod_gzip/1.3.26.1a PHP/4.3.10 mod_ssl/2.8.15 OpenSSL/0.9.7c</httpServer>
        <urls>
          <url>http://www.seroundtable.com/archives/003633.html</url>
        </urls>
      </acquisitionData>
      <canonicalDocument>        
        <section>A featured Search Engine Watch Forum thread named SEO &amp; Newspapers discusses a recent NYTimes article named This Boring Headline Is Written for Google. The first paragraph of the article somes it up; Journalists over the years ha...
      <metaData>
        <meta name="title">New York Times Changes Web Only Headlines To Be Search Engine Friendly</meta>
        <meta name="dc:date">Mon, 10 Apr 2006 13:37:11 GMT</meta>
        <meta name="dc:type">text/html</meta>

t/test-data/to-split/29.xml  view on Meta::CPAN

      <semantic_unit><named_entity><form>Google Search</form><named_entity_type>soft</named_entity_type></named_entity></semantic_unit>
      <semantic_unit><named_entity><form>Google Video</form><named_entity_type>soft</named_entity_type></named_entity></semantic_unit>
    </semantic_unit_level>
  </linguisticAnalysis>

  </documentRecord>
<documentRecord id="7F0D97BDACC9D73DA79364ADF93A9080" xmlns="http://alvis.info/enriched/">
    <acquisition>
      <acquisitionData>
        <modifiedDate>1144768340466</modifiedDate>
        <httpServer>Apache/1.3.28 (Unix) mod_gzip/1.3.26.1a PHP/4.3.10 mod_ssl/2.8.15 OpenSSL/0.9.7c</httpServer>
        <urls>
          <url>http://www.seroundtable.com/archives/003639.html</url>
        </urls>
      </acquisitionData>
      <canonicalDocument>        
        <section>There is a DigitalPoint Forum thread named that discusses a neat PageRank tool at http://www.webmastereyes.com/. The PageRank tool is different from others, in that it will enable you to plug in a URL and it will then place graphical...
      <metaData>
        <meta name="title">New Google PageRank Tool Plots PR Values Overlays On Page</meta>
        <meta name="dc:date">Tue, 11 Apr 2006 12:40:49 GMT</meta>
        <meta name="dc:type">text/html</meta>

t/test-data/to-split/29.xml  view on Meta::CPAN

      <semantic_unit><named_entity><form>Google</form><named_entity_type>soft</named_entity_type></named_entity></semantic_unit>
      <semantic_unit><named_entity><form>Google PageRank</form><named_entity_type>soft</named_entity_type></named_entity></semantic_unit>
    </semantic_unit_level>
  </linguisticAnalysis>

  </documentRecord>
<documentRecord id="E25E5DBF90E6C6A3CDF200F61F6A20E6" xmlns="http://alvis.info/enriched/">
    <acquisition>
      <acquisitionData>
        <modifiedDate>1150315246240</modifiedDate>
        <httpServer>Apache/1.3.36 (Unix) mod_fastcgi/2.4.2 mod_auth_passthrough/1.8 mod_log_bytes/1.2 mod_bwlimited/1.4 PHP/4.4.2 FrontPage/5.0.2.2635.SR1.2 mod_ssl/2.8.27 OpenSSL/0.9.7a</httpServer>
        <urls>
          <url>http://www.searchenginejournal.com/?p=3530</url>
        </urls>
      </acquisitionData>
      <canonicalDocument>        
        <section>RSS - Things That Make You Go Hmmm Why doesn’t the new Yahoo Spark Blog publish an RSS feed? Of any kind? Not even an “add to my Yahoo” button? Why can’t I subscribe to the Technorati Hot Tags widget that’s (supposedly) upd...
      <metaData>
        <meta name="title">RSS - Things That Make You Go Hmmm</meta>
        <meta name="dc:type">text/html; charset=utf-8</meta>
      </metaData>

t/test-data/to-split/29.xml  view on Meta::CPAN

      <semantic_unit><named_entity><form>Yahoo</form><named_entity_type>comp</named_entity_type></named_entity></semantic_unit>
      <semantic_unit><named_entity><form>Technorati</form><named_entity_type>comp</named_entity_type></named_entity></semantic_unit>
    </semantic_unit_level>
  </linguisticAnalysis>

  </documentRecord>
<documentRecord id="070E7EB628CC943FBF90E7C6A703D9B2" xmlns="http://alvis.info/enriched/">
    <acquisition>
      <acquisitionData>
        <modifiedDate>1149606759016</modifiedDate>
        <httpServer>Apache/1.3.28 (Unix) mod_gzip/1.3.26.1a PHP/4.3.10 mod_ssl/2.8.15 OpenSSL/0.9.7c</httpServer>
        <urls>
          <url>http://www.seroundtable.com/archives/003894.html</url>
        </urls>
      </acquisitionData>
      <canonicalDocument>        
        <section>Any SEO/M will tell you their job description sucks because in the process of describing exactly what they do, they nearly always watch the listener's eyes glaze over, waiting for a topic that may make better sense. Same thing with u...
      <metaData>
        <meta name="title">Officer Usability and General SEO</meta>
        <meta name="dc:date">Mon, 05 Jun 2006 11:52:34 GMT</meta>
        <meta name="dc:type">text/html</meta>

t/test-data/to-split/29.xml  view on Meta::CPAN

  <linguisticAnalysis>
    <semantic_unit_level>
    </semantic_unit_level>
  </linguisticAnalysis>

  </documentRecord>
<documentRecord id="C5E3217E0849D4E0F5C78C132B7E826D" xmlns="http://alvis.info/enriched/">
    <acquisition>
      <acquisitionData>
        <modifiedDate>1146772829195</modifiedDate>
        <httpServer>Apache/1.3.28 (Unix) mod_gzip/1.3.26.1a PHP/4.3.10 mod_ssl/2.8.15 OpenSSL/0.9.7c</httpServer>
        <urls>
          <url>http://www.seroundtable.com/archives/003764.html</url>
        </urls>
      </acquisitionData>
      <canonicalDocument>        
        <section>Yesterday, I posted at SEW blog Ask.com Second TV Blitz Stars Chief Scientist Guru, Apostolos Gerasoulis. I have now spotted the commercials that you can view for yourself at http://about.ask.com/docs/about/televisionads.shtml. Yes, ...
      <metaData>
        <meta name="title">Ask.com's New TV Commercials Sport Apostolos Gerasoulis, Ask.com's Technology Founder</meta>
        <meta name="dc:date">Thu, 04 May 2006 19:35:39 GMT</meta>
        <meta name="dc:type">text/html</meta>

t/test-data/to-split/29.xml  view on Meta::CPAN

      <semantic_unit><named_entity><form>Rutgers University</form><named_entity_type>comp</named_entity_type></named_entity></semantic_unit>
      <semantic_unit><named_entity><form>Scient</form><named_entity_type>comp</named_entity_type></named_entity></semantic_unit>
    </semantic_unit_level>
  </linguisticAnalysis>

  </documentRecord>
<documentRecord id="F3F560D7ED8DE899CD17D9302AADE8EF" xmlns="http://alvis.info/enriched/">
    <acquisition>
      <acquisitionData>
        <modifiedDate>1147377627223</modifiedDate>
        <httpServer>Apache/1.3.28 (Unix) mod_gzip/1.3.26.1a PHP/4.3.10 mod_ssl/2.8.15 OpenSSL/0.9.7c</httpServer>
        <urls>
          <url>http://www.seroundtable.com/archives/003799.html</url>
        </urls>
      </acquisitionData>
      <canonicalDocument>        
        <section>This morning I described what is Google Co-op, but I also promised I would try to implement an example for this site. Well, we have implemented phase one of Google Co-op subscription links for this site. You can subscribe to the coop...
      <metaData>
        <meta name="title">Dynamic Implementation of Google Co-op for Search Engine Roundtable</meta>
        <meta name="dc:date">Thu, 11 May 2006 19:35:25 GMT</meta>
        <meta name="dc:type">text/html</meta>



( run in 2.219 seconds using v1.01-cache-2.11-cpan-39bf76dae61 )