AI-Categorizer

 view release on metacpan or  search on metacpan

README  view on Meta::CPAN

    A "knowledge set" is defined as a collection of documents, together with
    some information on the categories each document belongs to. Note that this
    term is somewhat unique to this project - other sources may call it a
    "training corpus", or "prior knowledge". A knowledge set also contains some
    information on how documents will be parsed and how their features (words)
    will be extracted and turned into meaningful representations. In this sense,
    a knowledge set represents not only a collection of data, but a particular
    view on that data.

    A knowledge set is encapsulated by the "AI::Categorizer::KnowledgeSet"
    class. Before you can start playing with categorizers, you will have to
    start playing with knowledge sets, so that the categorizers have some data
    to train on. See the documentation for the "AI::Categorizer::KnowledgeSet"
    module for information on its interface.

   Feature selection

    Deciding which features are the most important is a very large part of the
    categorization task - you cannot simply consider all the words in all the
    documents when training, and all the words in the document being
    categorized. There are two main reasons for this - first, it would mean that
    your training and categorizing processes would take forever and use tons of

eg/categorizer  view on Meta::CPAN

  my ($section, $stage, $do_stage) = @_;
  return unless $do_stage->{$stage};
  if (keys %$do_stage > 1) {
    print " % $0 @ARGV -$stage\n" if $c->verbose;
    die "$0 is not executable, please change its execution permissions"
      unless -x $0;
    system($0, @ARGV, "-$stage") == 0
      or die "$0 returned nonzero status, \$?=$?";
    return;
  }
  my $start = new Benchmark;
  $c->$section();
  my $end = new Benchmark;
  my $summary = timestr(timediff($end, $start));
  my ($rss, $vsz) = memory_usage();
  print "$summary (memory: rss=$rss, vsz=$vsz)\n" if $c->verbose;
  print $out_fh "Stage $stage: $summary (memory: rss=$rss, vsz=$vsz)\n" if $out_fh;
}

sub parse_command_line {
  my (%opt, %do_stage);

  while (@_) {
    if ($_[0] =~ /^-(\d+)$/) {

lib/AI/Categorizer.pm  view on Meta::CPAN

A "knowledge set" is defined as a collection of documents, together
with some information on the categories each document belongs to.
Note that this term is somewhat unique to this project - other sources
may call it a "training corpus", or "prior knowledge".  A knowledge
set also contains some information on how documents will be parsed and
how their features (words) will be extracted and turned into
meaningful representations.  In this sense, a knowledge set represents
not only a collection of data, but a particular view on that data.

A knowledge set is encapsulated by the
C<AI::Categorizer::KnowledgeSet> class.  Before you can start playing
with categorizers, you will have to start playing with knowledge sets,
so that the categorizers have some data to train on.  See the
documentation for the C<AI::Categorizer::KnowledgeSet> module for
information on its interface.

=head3 Feature selection

Deciding which features are the most important is a very large part of
the categorization task - you cannot simply consider all the words in
all the documents when training, and all the words in the document
being categorized.  There are two main reasons for this - first, it

lib/AI/Categorizer/Document/XML.pm  view on Meta::CPAN

  # it is a hash which includes a pair of <elementName, weight>
  my $elementWeight= $args{elementWeight};	

  # construct Handler which receive event of element, data, comment, processing_instruction
  # And convert their values into a sequence  of string and save it into buffer
  my $xmlHandler = $self->create_contained_object('xml_handler', weights => $elementWeight);

  # construct parser
  my $xmlParser= XML::SAX::ParserFactory->parser(Handler => $xmlHandler);

  # let's start parsing XML, where the methids of Handler will be called
  $xmlParser->parse_string($body);

  # extract the converted string from Handler
  $body= $xmlHandler->getContent;

  # Now, construct Document Object and return it
  return { body => $body };
}

##########################################################################

lib/AI/Categorizer/Document/XML.pm  view on Meta::CPAN

  # weight is times duplication of corresponding element
  # It is provided by caller(one of parameters) at construction, and
  # we must save it in order to use doing duplication at end_element
  $self->{weightHash} = $args{weights};

  # It is storage to store the data produced by Text, CDataSection and etc.
  $self->{content} = '';

  # This array is used to store the data for every element from root to the current visiting element.
  # Thus, data of 0~($levelPointer-1)th in the array is only valid.
  # The array which store the starting location(index) of the content for an element, 
  # From it, we can know all the data produced by an element at the end_element
  # It is needed at the duplication of the data produced by the specific element
  $self->{locationArray} = [];

  return $self;
}
	
# Input: None
# Output: None
# Description:
# 	it is called whenever the parser meets the document
# 	it will be called at once
#	Currently, confirm if the content buffer is an empty
sub start_document{
  my ($self, $doc)= @_;

  # The level(depth) of the last called element in XML tree
  # Calling of start_element is the preorder of the tree traversal.
  # The level is the level of current visiting element in tree.
  # the first element is 0-level
  $self->{levelPointer} = 0;

  # all data will be saved into here, initially, it is an empty
  $self->{content} = "";

  #$self->SUPER::start_document($doc);
}

# Input: None
# Output: None
# Description:
# 	it is called whenever the parser ends the document
# 	it will be called at once
#	Nothing to do
sub end_document{
  my ($self, $doc)= @_;

lib/AI/Categorizer/Document/XML.pm  view on Meta::CPAN

#	Attributes	$el->{Attributes}
#	for each attribute
#		LocalName: 	$el->{LocalName}
#		NamespaceURI: 	$el->{NamespaceURI}
#		Name		$el->{Name}
#		Prefix		$el->{Prefix}
#		Value		$el->{Value}
# Output: None
# Description:
# 	it is called whenever the parser meets the element
sub start_element{
  my ($self, $el)= @_;

  # find the last location of the content
  # its meaning is to append the new data at this location
  my $location= length $self->{content};

  # save the last location of the current content
  # so that at end_element the starting location of data of this element can be known
  $self->{locationArray}[$self->{levelPointer}] = $location;

  # for the next element, increase levelPointer
  $self->{levelPointer}++;

  #$self->SUPER::start_document($el);
}

# Input: None
# Output: None
# Description:
# 	it is called whenever the parser ends the element
sub end_element{
  my ($self, $el)= @_;

  $self->{levelPointer}--;

lib/AI/Categorizer/Document/XML.pm  view on Meta::CPAN


  # 1 - dont duplicate
  if($weight == 1){
    return;
  }
  
  # n - duplicate data by n times
  # get new content
  my $newContent= substr($self->{content}, $location);

  # start to copy
  for(my $i=1; $i<$weight;$i++){
    $self->{content} .= $newContent;
  }

  #$self->SUPER::end_document($el);
}

# Input: a hash which consists of pair <Data, Value>
# Output: None
# Description:



( run in 0.750 second using v1.01-cache-2.11-cpan-0d8aa00de5b )