Alvis-NLPPlatform

 view release on metacpan or  search on metacpan

examples/InputDocument.xml  view on Meta::CPAN

<?xml version="1.0" encoding="UTF-8"?>
<documentCollection xmlns="http://alvis.info/enriched/" version="1.1">
<documentRecord xmlns="http://alvis.info/enriched/" id="EE84646F3CDF765B8EE759DC235DF475">
  <acquisition>
      <acquisitionData>
        <modifiedDate>2007-12-13 22:03:06</modifiedDate>
        <urls>
          <url>file://Fichiers-Test/NLPPlatform.html</url>
        </urls>
      </acquisitionData>
      <canonicalDocument>        
        <section>
          <list>
            <item>NAME</item> 
            <item>SYNOPSIS</item> 
            <item>DESCRIPTION</item> 
            <item>Linguistic annotation: requirements</item> 
            <item>METHODS</item> 
            <item>compute_dependencies()</item> 
            <item>starttimer()</item> 
            <item>endtimer()</item> 
            <item>linguistic_annotation()</item> 
            <item>standalone()</item> 
            <item>standalone_main()</item> 
            <item>client_main()</item> 
            <item>load_config()</item> 
            <item>client()</item> 
            <item>sigint_handler()</item> 
            <item>server()</item> 
            <item>disp_log()</item> 
            <item>split_to_docRecs()</item> 
            <item>sub_dir_from_id()</item> 
            <item>record_id()</item> 
            <item>delete_id()</item> 
            <item>init_server()</item> 
            <item>token_id_is_in_list_refid_token()</item> 
            <item>token_id_follows_list_refid_token()</item> 
            <item>token_id_just_before_last_of_list_refid_token()</item> 
            <item>unparseable_id()</item></list> PLATFORM CONFIGURATION DEFAULT INTEGRATED/WRAPPED NLP TOOLS 
          <list>
            <item>Named Entity Tagger</item> 
            <item>Word and sentence segmenter</item> 
            <item>Part-of-Speech Tagger</item> 
            <item>Term Tagger</item> 
            <item>Part-of-Speech specialized for Biological texts</item> 
            <item>Parser</item> 
            <item>Parser specialized for biological texts</item></list> TUNING THE NLP PLATFORM PROTOCOL SEE ALSO AUTHORS LICENSE  
          <section title="NAME">
            <section>NAME</section> 
            <section>Alvis::NLPPlatform - Perl extension for linguistically annotating XML documents in Alvis</section></section>
          <section title="SYNOPSIS">
            <section>SYNOPSIS</section> 
            <list>
              <item>Standalone mode: use Alvis::NLPPlatform; Alvis::NLPPlatform::standalone_main(\%config, $doc_xml, \*STDOUT);</item> 
              <item>Distributed mode: # Server process use Alvis::NLPPlatform; Alvis::NLPPlatform::server($rcfile); # Client process use Alvis::NLPPlatform; Alvis::NLPPlatform::client($rcfile);</item></list></section>
          <section title="DESCRIPTION">
            <section>DESCRIPTION</section> 
            <section>This module is the main part of the Alvis NLP platform. It provides overall methods for the linguistic annotation of web documents. Linguistic annotations depend on the configuration variables and dependencies between linguistic ...
            <section>Input documents are assumed to be in the ALVIS XML format ( standalone_main ) or to be loaded in a hashtable ( client_main ). The annotated document is recorded in the given descriptor ( standalone_main ) or returned as a hashtab...
            <section>The ALVIS format is described here:</section> 
            <section><ulink url="http://www.alvis.info/alvis/Architecture_2fFormats?action=show&amp;redirect=architecture%2Fformats#documents">http://www.alvis.info/alvis/Architecture_2fFormats?action=show&amp;redirect=architecture%2Fformats#document...
            <section>The DTD and XSD are provied in etc/alvis-nlpplatform.</section></section>
          <section title="Linguistic annotation: requirements">
            <section>Linguistic annotation: requirements</section> 
            <list>
              <item>Tokenization: this step has no dependency. It is required for any following annotation level.</item> 
              <item>Named Entity Tagging: this step requires tokenization.</item> 
              <item>Word segmentation: this step requires tokenization. The Named Entity Tagging step is recommended to improve the segmentation.</item> 
              <item>Sentence segmentation: this step requires tokenization. The Named Entity Tagging step is recommended to improve the segmentation.</item> 
              <item>Part-Of-Speech Tagging: this step requires tokenization, and word and sentence segmentation.</item> 
              <item>Lemmatization: this step requires tokenization, word and sentence segmentation, and Part-of-Speech tagging.</item> 
              <item>Term Tagging: this step requires tokenization, word and sentence segmentation, and Part-of-Speech tagging. Lemmatization is recommended to improve the term recognition.</item> 
              <item>Parsing: this step requires tokenization, word and sentence segmentation. Term tagging is recommended to improve the parsing of noun phrases.</item> 
              <item>Semantic feature tagging: To be determined</item> 
              <item>Semantic relation tagging: To be determined</item> 
              <item>Anaphora resolution: To be determined</item></list></section>
          <section title="METHODS">
            <section>METHODS</section>  
            <section title="compute_dependencies()">
              <section>compute_dependencies()</section> compute_dependencies($hashtable_config); 
              <section>This method processes the configuration variables defining the linguistic annotation steps. $hash_config is the reference to the hashtable containing the variables defined in the configuration file. The dependencies of the ling...
            <section title="starttimer()">
              <section>starttimer()</section> starttimer() 
              <section>This method records the current date and time. It is used to compute the time of a processing step.</section></section>
            <section title="endtimer()">
              <section>endtimer()</section> endtimer(); 
              <section>This method ends the timer and returns the time of a processing step, according to the time recorded by starttimer() .</section></section>
            <section title="linguistic_annotation()">
              <section>linguistic_annotation()</section> linguistic_annotation($h_config,$doc_hash); 
              <section>This methods carries out the lingsuitic annotation according to the list of required annotations. Required annotations are defined by the configuration variables ( $hash_config is the reference to the hashtable containing the v...
              <section>The document to annotate is passed as a hash table ( $doc_hash ). The method adds annotation to this hash table.</section></section>
            <section title="standalone()">
              <section>standalone()</section> standalone($config, $HOSTNAME, $doc); 
              <section>This method is used to annotate a document in the standalone mode of the platform. The document $doc is given in the ALVIS XML format.</section> 
              <section>The reference to the hashtable $config contains the configuration variables. The variable $HOSTNAME is the host name.</section> 
              <section>The method returns the annotation document.</section></section>
            <section title="standalone_main()">
              <section>standalone_main()</section> standalone_main($hash_config, $doc_xml, \*STDOUT); 
              <section>This method is used to annotate a document in the standalone mode of the platform. The document ( %doc_xml ) is given in the ALVIS XML format.</section> 
              <section>The document is loaded into memory and then annotated according to the steps defined in the configuration variables ( $hash_config is the reference to the hashtable containing the variables defined in the configuration file). T...
              <section>The function returns the time of the XML rendering.</section></section>
            <section title="client_main()">
              <section>client_main()</section> client_main($doc_hash, $r_config); 
              <section>This method is used to annotate a document in the distributed mode of the NLP platform. The document given in the ALVIS XML format is already is loaded into memory ( $doc_hash ).</section> 
              <section>The document is annotated according to the steps defined in the configuration variables. The annotated document is returned to the calling method.</section></section>
            <section title="load_config()">
              <section>load_config()</section> load_config($rcfile); 
              <section>The method loads the configuration of the NLP Platform by reading the configuration file given in argument.</section></section>
            <section title="client()">
              <section>client()</section> client($rcfile) 
              <section>This is the main method for the client process. $rcfile is the file name containing the configuration.</section></section>
            <section title="sigint_handler()">
              <section>sigint_handler()</section> sigint_handler($signal); 
              <section>This method is used to catch the INT signal and send a ABORTING message to the server.</section></section>
            <section title="server()">
              <section>server()</section> server($rcfile) 
              <section>This is the main method for the server process. $rcfile is the file name containing the configuration.</section></section>
            <section title="disp_log()">
              <section>disp_log()</section> disp_log($hostname,$message); 
              <section>This method prints the message ( $message ) on the standard error output, in a formatted way:</section> 
              <section>date: (client=hostname) message</section></section>
            <section title="split_to_docRecs()">
              <section>split_to_docRecs()</section> split_to_docRecs($xml_docs); 
              <section>This method splits a list of documents into a table and return it. Each element of the table is a two element table containing the document id and the document.</section></section>
            <section title="sub_dir_from_id()">
              <section>sub_dir_from_id()</section> sub_dir_from_id($doc_id) 
              <section>Ths method returns the subdirectory where annotated document will stored. It computes the subdirectory from the two first characters of the document id ( $doc_id ).</section></section>
            <section title="record_id()">
              <section>record_id()</section> record_id($doc_id, $r_config); 
              <section>This method records in the file $ALVISTMP/.proc_id , the id of the document that has been sent to the client.</section></section>
            <section title="delete_id()">
              <section>delete_id()</section> delete_id($doc_id,$r_config); 
              <section>This method delete the id of the document that has been sent to the client, from the file $ALVISTMP/.proc_id .</section></section>
            <section title="init_server()">
              <section>init_server()</section> init_server($r_config); 
              <section>This method initializes the server. It reads the document id from the file $ALVISTMP/.proc_id and loads the corresponding documents i.e. documents which have been annotated but not recorded due to a server crash.</section></sec...
            <section title="token_id_is_in_list_refid_token()">
              <section>token_id_is_in_list_refid_token()</section> token_id_is_in_list_refid_token($list_refid_token, $token_to_search); 
              <section>The method returns 1 if the token $token_to_search is in the list $list_refid_token , 0 else.</section></section>
            <section title="token_id_follows_list_refid_token()">
              <section>token_id_follows_list_refid_token()</section> token_id_follows_list_refid_token($list_refid_token, $token_to_search); 
              <section>The method returns 1 if the token $token_to_search is the foollwing of the last token of the list $list_refid_token , 0 else.</section></section>
            <section title="token_id_just_before_last_of_list_refid_token()">
              <section>token_id_just_before_last_of_list_refid_token()</section> token_id_just_before_last_of_list_refid_token($list_refid_token, $token_to_search); 
              <section>The method returns 1 if the token $token_to_search is just before the first token of the list $list_refid_token , 0 else.</section></section>
            <section title="unparseable_id()">
              <section>unparseable_id()</section> unparseable_id($id) 



( run in 0.538 second using v1.01-cache-2.11-cpan-f0fbb3f571b )