AI-MicroStructure
    
    
  
  
  
view release on metacpan or search on metacpan
 Copyright (C) 1989 Free Software Foundation, Inc.
 51 Franklin St, Fifth Floor, Boston, MA  02110-1301  USA
 Everyone is permitted to copy and distribute verbatim copies
 of this license document, but changing it is not allowed.
                            Preamble
  The license agreements of most software companies try to keep users
at the mercy of those companies.  By contrast, our General Public
License is intended to guarantee your freedom to share and change free
software--to make sure the software is free for all its users.  The
General Public License applies to the Free Software Foundation's
software and to any other program whose authors commit to using it.
You can use it for your programs, too.
  When we speak of free software, we are referring to freedom, not
price.  Specifically, the General Public License is designed to make
sure that you have the freedom to give away or sell copies of free
software, that you receive source code or can get it if you want it,
that you can change the software or use pieces of it in new free
programs; and that you know you can do these things.
  To protect your rights, we need to make restrictions that forbid
anyone to deny you these rights or to ask you to surrender the rights.
These restrictions translate to certain responsibilities for you if you
distribute copies of the software, or if you modify it.
  For example, if you distribute copies of a such a program, whether
gratis or for a fee, you must give the recipients all the rights that
you have.  You must make sure that they, too, receive or can get the
General Public License and to the absence of any warranty; and give any
other recipients of the Program a copy of this General Public License
along with the Program.  You may charge a fee for the physical act of
transferring a copy.
  2. You may modify your copy or copies of the Program or any portion of
it, and copy and distribute such modifications under the terms of Paragraph
1 above, provided that you also do the following:
    a) cause the modified files to carry prominent notices stating that
    you changed the files and the date of any change; and
    b) cause the whole of any work that you distribute or publish, that
    in whole or in part contains the Program or any part thereof, either
    with or without modifications, to be licensed at no charge to all
    third parties under the terms of this General Public License (except
    that you may choose to grant warranty protection to some or all
    third parties, at your option).
    c) If the modified program normally reads commands interactively when
    run, you must cause it, when started running for such interactive use
    in the simplest and most usual way, to print or display an
    announcement including an appropriate copyright notice and a notice
    that there is no warranty (or else, saying that you provide a
    warranty) and that users may redistribute the program under these
    conditions, and telling the user how to view a copy of this General
    Public License.
    d) You may charge a fee for the physical act of transferring a
    copy, and you may at your option offer warranty protection in
    exchange for a fee.
Mere aggregation of another independent work with the Program (or its
derivative) on a volume of a storage or distribution medium does not bring
the other work under the scope of these terms.
  3. You may copy and distribute the Program (or a portion or derivative of
it, under Paragraph 2) in object code or executable form under the terms of
Paragraphs 1 and 2 above provided that you also do one of the following:
    a) accompany it with the complete corresponding machine-readable
YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER
PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE
POSSIBILITY OF SUCH DAMAGES.
                     END OF TERMS AND CONDITIONS
        Appendix: How to Apply These Terms to Your New Programs
  If you develop a new program, and you want it to be of the greatest
possible use to humanity, the best way to achieve this is to make it
free software which everyone can redistribute and change under these
terms.
  To do so, attach the following notices to the program.  It is safest to
attach them to the start of each source file to most effectively convey
the exclusion of warranty; and each file should have at least the
"copyright" line and a pointer to where the full notice is found.
    <one line to give the program's name and a brief idea of what it does.>
    Copyright (C) 19yy  <name of author>
1. You may make and give away verbatim copies of the source form of the
Standard Version of this Package without restriction, provided that you
duplicate all of the original copyright notices and associated disclaimers.
2. You may apply bug fixes, portability fixes and other modifications derived
from the Public Domain or from the Copyright Holder. A Package modified in such
a way shall still be considered the Standard Version.
3. You may otherwise modify your copy of this Package in any way, provided that
you insert a prominent notice in each changed file stating how and when you
changed that file, and provided that you do at least ONE of the following:
  a) place your modifications in the Public Domain or otherwise make them
     Freely Available, such as by posting said modifications to Usenet or an
     equivalent medium, or placing the modifications on a major archive site
     such as ftp.uu.net, or by allowing the Copyright Holder to include your
     modifications in the Standard Version of the Package.
  b) use the modified Package only within your corporation or organization.
  c) rename any non-standard executables so the names do not conflict with
bin/micro-dict view on Meta::CPAN
else
cmd=echo
fi
stop=$(perl -MAI::MicroStructure::WordBlacklist -E  "my \$s=AI::MicroStructure::WordBlacklist::getStopWords('de'); my @s = keys %\$s; print join('|',@s);")
IFS=$'\n';
$cmd $1 |   tr A-Z a-z |                # Convert to lowercase.
        tr ' ' '_' |             # New: change spaces to newlines.
       #tr -cd '\012[a-z][0-9]' |   #  Get rid of everything
                                    #+ non-alphanumeric (in orig. script).
        tr -c '\012a-z'  '\012' |   #  Rather than deleting non-alpha
        egrep -v '^#' |              # Delete lines starting with hashmark.
        egrep -v "^[ ]*([A-Za-z][A-Za-z]|[A-Za-z])$" | egrep -v "^$" | egrep -v -i "^ (denkbarer|ganze|bez|ver�ffentlichtes|uns�gliches|ungew�hnliche|vollstaendig|erstem|Inf.|titel|unsaeglichem|beforehand|denkbares|yours|contains|gedurft|seithe...
 stop=$(perl -MAI::MicroStructure::WordBlacklist -E  "my \$s=AI::MicroStructure::WordBlacklist::getStopWords('de'); my @s = keys %\$s; print join('|',@s);")
 cat /tmp/micro-dict.tmp | sort -n | egrep -v "^.*.[\ ].*.[1-9][\:][\ ][\ ]($stop)";
 #if [ !  "$(echo  "$stop" | egrep -i zzzzzzzzzzzz)" ]; then  echo cool; fi
bin/micro-dict view on Meta::CPAN
function masher(){
if [ -f "$1" ]
then                                #+ valid file argument.
cmd=cat
else
cmd=echo
fi
$cmd $1 | tr A-Z a-z |                # Convert to lowercase.
        tr ' ' '\012' |             # New: change spaces to newlines.
   #    tr -cd '\012[a-z][0-9]' |   #  Get rid of everything
                                    #+ non-alphanumeric (in orig. script).
        tr -c '\012a-z'  '\012' |   #  Rather than deleting non-alpha
                                    #+ chars, change them to newlines.
        egrep -v '^#' |              # Delete lines starting with hashmark.
        egrep -v "^[ ]*([A-Za-z][A-Za-z]|[A-Za-z])$" |
        egrep -v '^$'
}
bin/micro-dict~ view on Meta::CPAN
cmd=cat
else
cmd=echo
fip
stop=$(perl -MAI::MicroStructure::WordBlacklist -E  "my \$s=AI::MicroStructure::WordBlacklist::getStopWords('de'); my @s = keys %\$s; print join('|',@s);")
res=$($cmd $1 |   tr A-Z a-z |                # Convert to lowercase.
        tr ' ' '_' |             # New: change spaces to newlines.
        tr -c '\012a-z'  '\012' |   #  Rather than deleting non-alpha
        egrep -v "^[ ]*([A-Za-z][A-Za-z]|[A-Za-z])$" | egrep -v "^$");
             echo "$res"
bin/micro-dict~ view on Meta::CPAN
function masher(){
if [ -f "$1" ]
then                                #+ valid file argument.
cmd=cat
else
cmd=echo
fi
$cmd $1 | tr A-Z a-z |                # Convert to lowercase.
        tr ' ' '\012' |             # New: change spaces to newlines.
   #    tr -cd '\012[a-z][0-9]' |   #  Get rid of everything
                                    #+ non-alphanumeric (in orig. script).
        tr -c '\012a-z'  '\012' |   #  Rather than deleting non-alpha
                                    #+ chars, change them to newlines.
        egrep -v '^#' |              # Delete lines starting with hashmark.
        egrep -v "^[ ]*([A-Za-z][A-Za-z]|[A-Za-z])$" |
        egrep -v '^$'
}
lib/AI/MicroStructure/WordBlacklist.pm view on Meta::CPAN
use strict;
use warnings;
use Exporter;
our @ISA = qw(Exporter);
our %EXPORT_TAGS = ( 'all' => [ qw( getStopWords ) ] );
our @EXPORT_OK = ( @{ $EXPORT_TAGS{'all'} } );
sub getStopWordsSmall{
my @search = ("a","a's","able","about","above","according","accordingly","across","actually","after","afterwards","again","against","ain't","all","allow","allows","almost","alone","along","already","also","although","always","am","among","amongst","a...
return @search;
}
sub getStopWords {
if ( @_ and $_[0] eq 'UTF-8' ) {
# adding U0 causes the result to be flagged as UTF-8
my %stoplist = map { ( pack("U0a*", $_), 1 ) } qw(
a able about above according accordingly across actually after afterwards again against aint all allow allows almost alone along already also although always am among amongst an and another any anybody anyhow anyone anything anyway anyways anywhere a...
b be became because become becomes becoming been before beforehand behind being believe below beside besides best better between beyond both brief but by
c came can cannot cant cant cause causes certain certainly changes clearly cmon co com come comes concerning consequently consider considering contain containing contains corresponding could couldnt course cs currently
d definitely described despite did didnt different do does doesnt doing done dont down downwards during
e each edu eg eight either else elsewhere enough entirely especially et etc even ever every everybody everyone everything everywhere ex exactly example except
f far few fifth first five followed following follows for former formerly forth four from further furthermore
g get gets getting given gives go goes going gone got gotten greetings
h had hadnt happens hardly has hasnt have havent having he hello help hence her here hereafter hereby herein heres hereupon hers herself hes hi him himself his hither hopefully how howbeit however
i id ie if ignored ill im immediate in inasmuch inc indeed indicate indicated indicates inner insofar instead into inward is isnt it itd itll its its itself ive
j just k keep keeps kept know known knows
l last lately later latter latterly least less lest let lets like liked likely little look looking looks ltd
m mainly many may maybe me mean meanwhile merely might more moreover most mostly much must my myself
n name namely nd near nearly necessary need needs neither never nevertheless new next nine no nobody non none noone nor normally not nothing novel now nowhere
lib/AI/MicroStructure/WordBlacklist.pm view on Meta::CPAN
where why how all any both each few more most other some such
no nor not only own same so than too very a a's able about above
according accordingly across actually after afterwards again against
ain't all allow allows almost alone along already also although always
am among amongst an and another any anybody anyhow anyone anything
anyway anyways anywhere apart appear appreciate appropriate are aren't
around as aside ask asking associated at available away awfully b be
became because become becomes becoming been before beforehand behind
being believe below beside besides best better between beyond both brief
but by c c'mon c's came can can't cannot cant cause causes certain
certainly changes clearly co com come comes concerning consequently
consider considering contain containing contains corresponding could
 couldn't course currently d definitely described despite did didn't
 different do does doesn't doing don't done down downwards during e each
  edu eg eight either else elsewhere enough entirely especially et etc
  even ever every everybody everyone everything everywhere ex exactly
  example except f far few fifth first five followed following follows
  for former formerly forth four from further furthermore g get gets
  getting given gives go goes going gone got gotten greetings h had
   hadn't happens hardly has hasn't have haven't having he he's hello
   help hence her here here's hereafter hereby herein hereupon hers
lib/AI/MicroStructure/WordBlacklist.pm view on Meta::CPAN
seine seinem seinen seiner seines seit seitdem seite seiten seither selbe selben selber selbst selbstredend selbstredende selbstredendem selbstredenden selbstredender selbstredendes seltsamerweise senke senken senkt senkte senkten setzen setzt setzte...
unmaßgeblichem unmaßgeblichen unmaßgeblicher unmaßgebliches unmoeglich unmoegliche unmoeglichem unmoeglichen unmoeglicher unmoegliches unmöglich unmögliche unmöglichen unmöglicher unnötig uns unsaeglich unsaegliche unsaeglichem unsaeglichen unsaeglic...
vollends vollstaendig vollstaendige vollstaendigem vollstaendigen vollstaendiger vollstaendiges vollständig vollständige vollständigem vollständigen vollständiger vollständiges vom von vor voran vorbei vorgestern vorher vorherig vorherige vorherigem ...
würde würden während währenddessen wär wäre wären x übel über überall überallhin überaus überdies überhaupt übermorgen üblicherweise übrig übrigens z.B. zahlreich zahlreichem zahlreicher zB zb. zehn zeitweise zeitweisem zeitweisen zeitweiser ziehen z...
return \%stoplist;
}
else {
my %stoplist = map { ( $_, 1 ) } qw(
a able about above according accordingly across actually after afterwards again against aint all allow allows almost alone along already also although always am among amongst an and another any anybody anyhow anyone anything anyway anyways anywhere a...
b be became because become becomes becoming been before beforehand behind being believe below beside besides best better between beyond both brief but by
c came can cannot cant cant cause causes certain certainly changes clearly cmon co com come comes concerning consequently consider considering contain containing contains corresponding could couldnt course cs currently
d definitely described despite did didnt different do does doesnt doing done dont down downwards during
e each edu eg eight either else elsewhere enough entirely especially et etc even ever every everybody everyone everything everywhere ex exactly example except
f far few fifth first five followed following follows for former formerly forth four from further furthermore
g get gets getting given gives go goes going gone got gotten greetings
h had hadnt happens hardly has hasnt have havent having he hello help hence her here hereafter hereby herein heres hereupon hers herself hes hi him himself his hither hopefully how howbeit however
i id ie if ignored ill im immediate in inasmuch inc indeed indicate indicated indicates inner insofar instead into inward is isnt it itd itll its its itself ive
j just k keep keeps kept know known knows
l last lately later latter latterly least less lest let lets like liked likely little look looking looks ltd
m mainly many may maybe me mean meanwhile merely might more moreover most mostly much must my myself
n name namely nd near nearly necessary need needs neither never nevertheless new next nine no nobody non none noone nor normally not nothing novel now nowhere
lib/AI/MicroStructure/WordBlacklist.pm view on Meta::CPAN
i'd you'd he'd she'd we'd they'd i'll you'll he'll she'll we'll
they'll isn't aren't wasn't weren't hasn't haven't hadn't
doesn't don't didn't won't wouldn't shan't shouldn't can't
cannot couldn't mustn't let's that's who's what's here's
there's when's where's why's how's a an the and but if or
because as until while of at by for with about against between
into through during before after above below to from up down in
out on off over under again further then once here there when
where why how all any both each few more most other some such
no nor not only own same so than too very
a a's able about above according accordingly across actually after afterwards again against ain't all allow allows almost alone along already also although always am among amongst an and another any anybody anyhow anyone anything anyway anyways anywh...
qqq rrr sss ttt uuu vvv www xxx yyy zzz .... unsere ihrer uns wurde wer gegen diesem bis nur wieder unserem einer war man bei wir einen vom einem unter jeder werden wie als durch zum hat vor unseres email bel ihnen unseren bzw lieber uft kommen nicht...
anderweitigen anderweitiger anderweitiges anerkannt anerkannte anerkannter anerkanntes anfangen anfing angefangen angesetze angesetzt angesetzten angesetzter ans anscheinend ansetzen ansonst ansonsten anstatt anstelle arbeiten auch auf aufgehört aufg...
bessere besserem besseren besserer besseres bestehen besteht bestenfalls bestimmt bestimmte bestimmtem bestimmten bestimmter bestimmtes betraechtlich betraechtliche betraechtlichem betraechtlichen betraechtlicher betraechtliches betreffend betreffend...
diesseitiges diesseits dinge dir direkt direkte direkten direkter doch doppelt dort dorther dorthin dran drauf drei dreißig drin dritte drueber drum drunter drüber du dunklen durch durchaus durchweg durchwegs durfte durften dürfen dürfte eben ebenfal...
entsprechender entsprechendes entweder er ergo ergänze ergänzen ergänzte ergänzten erhalten erhielt erhielten erhält erneut erst erste erstem ersten erster erstere ersterem ersteren ersterer ersteres erstes eröffne eröffnen eröffnet eröffnete eröffne...
häufige häufigem häufigen häufiger häufigere häufigeren häufigerer häufigeres höchst höchstens ich igitt ihm ihn ihnen ihr ihre ihrem ihren ihrer ihres ihretwegen im immer immerhin immerwaehrend immerwaehrende immerwaehrendem immerwaehrenden immerwae...
jeglichen jeglicher jegliches jemals jemand jene jenem jenen jener jenes jenseitig jenseitigem jenseitiger jenseits jetzt jährig jährige jährigem jährigen jähriges kaeumlich kam kann kannst kaum kein keine keinem keinen keiner keinerlei keines keines...
naechste naemlich nahm naturgemaess naturgemaeß naturgemäss naturgemäß natürlich neben nebenan nehmen nein neu neue neuem neuen neuer neuerdings neuerlich neuerliche neuerlichem neuerlicher neuerliches neues neulich neun nicht nichts nichtsdestotrotz...
seine seinem seinen seiner seines seit seitdem seite seiten seither selbe selben selber selbst selbstredend selbstredende selbstredendem selbstredenden selbstredender selbstredendes seltsamerweise senke senken senkt senkte senkten setzen setzt setzte...
unmaßgeblichem unmaßgeblichen unmaßgeblicher unmaßgebliches unmoeglich unmoegliche unmoeglichem unmoeglichen unmoeglicher unmoegliches unmöglich unmögliche unmöglichen unmöglicher unnötig uns unsaeglich unsaegliche unsaeglichem unsaeglichen unsaeglic...
  ok($docs->{"First Document"}, "contains first document");
  ok($docs->{"Third Document"}, "contains third document");
  is(sprintf("%2.2f", $docs->{"First Document"}), 18.86, 'correct relevance on search doc #1');
  is(sprintf("%2.2f", $docs->{"Third Document"}), 35.35, 'correct relevance on search doc #3');
  ( $docs, $words ) = $cg->search('snake');
  is(sprintf("%2.2f", $docs->{"Third Document"}), 35.35, "repeating search does not change results" );
  ( $docs, $words ) = $cg->search('pony');
  #Test search starting at singleton node
  #is(sprintf("%2.2f", $docs->{"Second Document"}), '50.00', "search starting at singleton");
  # Try adding a duplicate title
  eval{ $cg->add_documents( %docs ); };
  ok(  $@ =~ /^Tried to add document with duplicate identifier:/,
     "complained about duplicate title");
  is ( $cg->doc_count(), 5, "document count is correct" );
  # Check that the word count is right
  my @words = $cg->term_list();
  is ( scalar @words, 9, "word count is correct" );
  my $flat = join '', sort @words;
  is ( $flat, 'boabullcamelconstrictoreagleelephantfoxponysnake', "word list is correct" );
  ( $docs, $words ) = $cg->search('pony');
  is(sprintf("%2.2f", $docs->{"Second Document"}), '50.00', "singleton search did not change");
  my $raw = $cg->raw_search('T:pony');
  is(sprintf("%2.2f", $raw->{"D:Second Document"}), '50.00', "raw search gives same result");
  ( $docs, $words ) = $cg->search('snake');
  is(sprintf("%2.2f", $docs->{"First Document"}), 28.36, 'result changed for non-singleton search');
  ( $docs, $words ) = $cg->find_similar('First Document');
  is(sprintf("%2.2f", $docs->{"First Document"}), 122.34, 'find similar search correct');
  is(sprintf("%2.2f", $docs->{"Fourth Document"}), 5.57, 'find similar search correct');
  # Try storing the sucker
  if ( !$XS ) {
    my $path = "Search::ContextGraph::Test::Stored";
    eval { $cg->store( $path ) };
( run in 0.646 second using v1.01-cache-2.11-cpan-c333fce770f )