DTA-CAB
view release on metacpan or search on metacpan
* moved orig/cab.cmdi-xml back to .
* added WebLichtWebServices.url
* moved WebLichtWebServices.url -> WebLichtWebServices.url_old
* fixed TCF parsing bug
v1.65 2014-12-02 moocow
* don't let topkwrap ignore mapclass attribute in tei mode
* TEIws format update#
- allow #-prefixed IDs in @prev,@next attributes gracefully
* disabled debug code
* ignore some stuff
* tcf tweaks: encode tei in textCorpus/textSource as schema trunk describes
* tei-in-tcf embedding uses textSource element
v1.64 2014-11-27 moocow
* disable cab demo debug
* Format/JSON fix: don't output scalar references (e.g. teibufr, textbufr)
* tcf token id fix
* tcf sentence id fix
* fixed TCF typos
* always include //sentence/@ID for TCF format
v1.63 2014-11-25 moocow
* htdocs/demo.js fixes for implicit tokenization of un-tokenized tcf
- effectively ignore 'tokenize' checkbox for tcf
* clean Version.pm
* TCF format fixes and updates
- improved tcf parsing using getChildrenByLocalName() instead of findnodes()
- added tcf tokenization if only 'text' layer is present using DTA::CAB::Format::Raw
* ifmt is safe too
* improved tcf parsing
v1.62 2014-11-12 moocow
* added 'ofmt' to list of safe pass-through parameters
* status home link: .. (for demo)
* demo fix: disable raw text for live-mode
* demo.js fixes for inline return
* more tcf options
* output format option only for upload gui
* more tcf i/o tweaks
* more tei/tcf and server i/o format tweaks: looks good, go live on MONDAY
* different in- and output-formats for server, TEI, TCF format tweaks using doc->{textbufr}
v1.61 2014-10-16 moocow
* added eval files
* don't output sentence comments for ExpandList
* verbose logging options
* log-stderr typo
* added playground/logo as symlink
* removed old logo/ symlink ; replacing with real mccoy
* cabx directory basically in place
* automaton resultfst crashing
* added logos
* cab demo: added logo
* added 48p logo
* tag-hacks: added mathematical operators to 'punctuation-like' class
* MootSub tag-tweaking hacks: avoid 'normal' tags for non-wordlike tokens
v1.60 2014-08-22 moocow
* fixed DTA::CAB::Analyzer::_am_wordlike_regex() to allow combining diacritical whetver [[:alpha:]] is included
- unicode should really call these things alphabetic, imho, but it doesn't
v1.59 2014-06-24 moocow
* added dta 'lemma', 'lemma1' chains (with exlex)
* sleep between stop and start actions on restart
* allow direct demo-gui display of xml responses
- fixed 'pretty' parameter pass-through bug in DTA::CAB::Format::Registry::newFormat()
- stop tcf format complaining about missing document for spliceback (avoid garbage in apache logs)
v1.58 2014-06-16 moocow
* added example scripts cab-curl-post.sh, cab-curl-xpost.sh
* reapClient chost fix2
* daemonMode=fork for DTA::CAB::Server::HTTP
- only for POST queries
* xlit-http.plm : turned down logLevel
* server status tweaks
v1.57 2014-06-13 moocow
* added OpenThesaurus expander to dta chain (uses Analyzer::GermaNet class)
* added OpenThesaurus expander
v1.56 2014-06-11 moocow
* GermaNet : allow synset names as 'lemma' queries
* apache-cgi-wrap default host = localhost
* ExpandList/LemmaList alias fixes (no CODE refs in default formats)
* v1.56: added ExpandList aliases LemmaList,llist,ll,lemmata,lemmas,lemma
+ added Chain::DTA analyzers default.lemma, default.lemma1
* added LemmaList|llist|ll|lemmata|lemmas alias for ExpandList
+ using CODE-ref hack to extract non-root attribute moot/lemma
+ better solution would be to polish up and use (something like) Data::ZPath
v1.55 2014-05-27 moocow
* moved tagh-http.plm to taghx-http-9098.plm
* eliminated 'ge|' prefix removal hack for tagh-lemmatization
- for compatibility with dwds-kc20 lemmatization
v1.54 2014-05-15 moocow
* updated format docs
* replace 'xml' with 'txml' in demo list
* allow lowercase letters in morph tags parsed by Analyzer.pm accessor macro am_tagh_fst2moota
- fixes bogus VV* tags for new [roman] pseudo-analyses from dta-morph-additions
v1.53 2014-03-16 moocow
* set default CAB_SLEEP=5
- try to avoid restart failures on services (Cannot bind socket 0.0.0.0 port 9099: Address already in use);
- but SO_REUSEADDR ought to be set - what gives?
* don't set ReusePort, since it gives errors: "Your vendor has not defined Socket macro SO_REUSEPORT"
* documented ExpandList
* added csv1g formatter
* added moot/details field: best analysis, for saving tagh analyses
- new moot/details should be swept by analyzeClean
v1.52 2014-01-31 moocow
* tei: disabled debug
* added twTokenizeClass pass-through to DTA::TokWrap
* fixed tei rmtree() bug on multiple processes
* apostrophe-s handling
* v1.52: updated 'word-like' regex to include 's suffixes
+ centralized word-like regex to DTA::CAB::Analyzer::_am_wordlike_regex()
+ updated/unified email address to moocow@cpan.org
* re-built logos using inkscape
* added new compatibility symlink cab-favicon.png
* removed old cab-favicon.png
* added new logos
* added caberr-64.png
* updated cab favicon
* MorphSafe badTypes map now maps (text=>isGood) rather than (text=>isBad)
- fixes bug in which badMorph heuristics were overriding a
__good__ entry in badTypes file (Gutherzigkeit)
v1.44 2013-07-22 moocow
* tcf / format fixes
v1.43 2013-07-11 moocow
* TCF format fix: reset temp variables ($pos,$lemma,$orth) between words
* added TCF to demo formats
* default TOKENIZE_CLASS='auto' for TEI via TokWrap
* checkin with updated Version.pm
* first version with TCF support
- how finicky do we need to be with offset-based tokens, sentences, etc?
- and how do we handle metadata?
* added basic TCF format (output only atm)
v1.42 2013-06-23 moocow
* -fc option added to dta-cab-splice-syncope.perl
* better version check
* TEI format debugging and tweaks
- can now set -fo=txmlfmt=XmlTokWrapFast for e.g. fast TEI-format input, but this slows down TEI-format output
- best results seem to be with -io=txmlfmt=XmlTokWrapFast
-oo=XmlTokWrap for plain convert; ymmv with actual analysis going on
* lots of debugging code
* better TEI format debugging with e.g. -fo teilog=debug
* removed Format::TEI debug flag
* fixed ugly regex-slowing $POSTMATCH in CAB::Format::XmlNative::blockScanFoot()
- use perl 5.10 /p modifier and ${^POSTMATCH} instead
v1.41 2013-06-05 moocow
* default xml format now resolves to tei
* cab.perl: read dirname($0)/.htcabrc for local overrides
* cab.perl: read cab.perl.rc
* demo.js: fix cab_url_base guessing regex if parameters are specified
- e.g. http://localhost:9099/?q=foo
* MootSub lemmatization: honor 'FM.*' tags
* cab demo: pass through 'file' parameter
* demo links seem to work now!
* demo init: fix links
* demo.js &-expansion woes
* workaround for Unify.pm choking on REGEXPs in Format::Registry
- implement STORABLE_(freeze|thaw) for Format::Registry
- allows rollback of Unify.pm changes in r9738 (explicit
DS-traversal with potential cycles, caused infinite allocation
loop and memory explosion in 'real' CAB servers)
* added /upload and /file paths to cab-http.plm
* demo/upload tweaks (don't call it 'upload')
* file upload updates
* merged in branch htdocs-1.41-upload -r9728:9736
* fixed YAML dispatch
* updated demo.js: make traffic-light frame work in proxy mode
* language guesser tests
* wrap various YAML implementations directly in YAML.pm (rather than subclass hacks)
* LangId::Simple: only use unicode character block hacks for words of length >= 2
* hasmorph for text-mode output
* updated DTAClean: added 'hasmorph' key
* prune analyzers in cab.perl wrapper
* dingler: try to enable autoclean
* cab-http-9099: auto-clean on
* trimmed cab-http-9099.plm to ignore authentication
* updates from kaskade2 for debian/wheezy
* lang-guesser updates: unicode hacks
* Morph::Latin : only analyze if isLatinExt
* Moot: use FM.$lang as tag for language-guesser hack
* XML formatting woes
* built in langid heuristics to Moot/Boltzmann and Moot
* added LangId::Simple analyzer, built into DTA chain as 'langid'
v1.40 2013-04-30 moocow
* smarter verbosity for cab-rc-update.sh
* updated to use (my own) GermaNet::Flat API module, rather than clunky google code variant
* added -begin and -end CODE options to dta-cab-analyze.perl
* Format::Raw : parse underscores as word-like
v1.39 2013-04-24 moocow
* removed xlemma stuff again
* MootSub: generate moot/xlemma field: raw TAGH segmentation for best lemma
* bugfix lemma(Christentum) -> Christenenum (cab lemmatizer ~e)
* lemmatizer: rename verb inflections
* GermaNet runs sentence-wise, in order to access moot/lemma
+ added GermanNet::Synonyms
+ changed GermaNet labels to:
- gn-syn (Synonyms)
- gn-isa (Hyperonyms~superclasses)
- gn-asi (Hyponyms~subclasses)
+ added GermaNet analyzer option LABEL_max_depth e.g. gn-syn_max_depth for some control of resolution
* oops: fixed multi-load of GermaNet and descendants
* added germanet hypoyms to DTA
* added and tested basic GermaNet relation closures
* added GermaNet/{RelationClosure,Hyperonyms,Hyponyms}.pm
* added Analyzer::GermaNet.pm
v1.38 2013-03-11 moocow
* added xlist format to demo
* ExpandList fix
* pretty-printing for ExpandList
* TokPP: replaced some bad [[:digit:]]* with [[:digit:]]+ regexes
- upshot: don't analyze empty string as CARD
* Analyzer::Morph::Latin::CDB : use _am_xlit rather than $_->{text} as key
- fixes caberr bug #66980 (PhaÅ¿mate -> FaÃmate != Phasmate) b/c utf8 variant isn't in latin lexicon
v1.37 2013-03-08 moocow
* added dingler server, running on kaskade @ port 9097
* added dingler server configs
* fix typo
* add FM,XY moot analyses for words with non-latin characters
* v1.37: dmoot: leave as-is if !isLatinExt
v1.36 2013-02-22 moocow
* syncope csv format: let "'s" be LOWERCASE_WORD (python regex compatibility hack)
* v1.36: fixed moot bug resulting in e.g. --/NE
- problem was bad propagation of tokeinizer (toka) tags of the form [$(] through _am_tagh_list2moota rsp _am_tagh_fst2moota
v1.35 2013-02-11 moocow
* updated lemmatization heuristics: punish orgnames
v1.34 2013-02-05 moocow
* format/syncope/csv: 'digit' type now includes dotted numerics
* ignore dta-syncope-ner.*
* remove debug code from dta-cab-convert.perl
* Format::TEI fix: include PID in tmpdir name so parallelization works
* morph fst: check_symbols=>0
v1.32 2012-10-04 moocow
* fixed more tokwrap v0.37 bugs (explicit <toka> grouping now output by tokwrap)
* fixes for dta-tokwrap v0.37
* updated Client::HTTP docs
* added 'ws' attribute to XmlTokWrapFast
* got Format::TEIws working
+ updated for dta-tokwrap v0.36
v1.31 2012-09-24 moocow
* moved gfsmxl parameters from old setLookupOptions() API to new 'analyzePre' key for Analyzer::Automaton subclasses
+ more flexible in general
+ updated cab.plm to reflect changes in semantics
+ old-style code using max_paths, max_weight, and max_ops should still work if no 'analyzePre' key is present
* updated cab-rc-update.sh: changed source url from 'dta2012' back to 'dta'
v1.30 2012-09-18 moocow
* content-length fixes for kaskade
* updated demo.hs, demo.html.tpl: fixes for apache-cgi-wrap/
* added generic apache cgi wrapper dir: system/apache-cgi-wrap
* updated CAB::Format::TEI for dta-tokwrap v0.35
v1.29 2012-09-05 moocow
* Format::SQLite updates for almost-ready eval-corpus
* syncope-tab alias for SynCoPe::CSV
* another name change: now in XmlTokWrapFast
* oops: another id->nid rename
* syncope/ner fixes: 'id' is a bad attribute name for subsequent splice
* syncope splice fixes
* added dta-cab-splice-syncope.perl
* use HYPHEN-MINUS instead of HYPHEN_MINUS for syncope csv
* add sid,wid numeric suffixes to syncope-csv location
* oops: mapclass was already in XmlTokWrapFast
* added mapclass attribute to Format::XmlTokWrapFast
* removed analyzeDebug option from Analyzer::Moot::Boltzmann
* copy fixes for dmoot
* empty sentence fix for moot,dmoot
* added dmoot flag 'lctags': bash dmoot tags to lower case
+ added moot flag 'lctext': bash text to lower-case
+ for use with new build hmms '*.lc.(1|12|123).hmm'
* abs() rule for TJ : level=-2 --> -text, +canonical
* added dta-cab-eval.perl
v1.28 2012-07-23 moocow
* SQLite changes: history now stored directly as json (TODO: move to version control)
* improved Format/SQLite parsing -- throughput up from <100 tok/sec to >15k tok/sec
* added CAB::Format::SQLite.pm for EvalCorpus
v1.27 2012-07-18 moocow
* updated default.(base|type) chains in CAB/Chain/DTA.pm
* map 'old' key to 'text' in Format::XmlTokWrap
* v1.27: blockScan fixes for Format::XmlNative (and by inheritance Format::XmlTokWrapFast)
- fixes mantis bug #543 : disappearing pages
- this worked with negative lookahead regexes, but those crash perl on some inputs (grr....)
v1.26 2012-07-06 moocow
* debug
* cab-rc-update.sh: pull from dta2012/cab rather than ddc/cab
* real new DTA-unknown-char U+FFFC (object replacement character), various bugfixes
v1.25 2012-07-04 moocow
* cab improvements for dealing with unicode replacement character (U+FFFD) as unknown-text marker
* workaround for blockScan() segfault: slower but works on plato
* segfault bughunt / kaskade:
- dying at Format/XmlNative.pm line 146 (regex match in blockScanFoot) for
ddc/dta2012/build/xml_tok/campe_robinson02_1780.TEI-P5.chr.ddc.t.xml
in build/cab_corpus
- only dying under make (make -j , -blockSize don't matter)
- segfault backtrace:
0x00002b26f788ef77 in ?? () from /usr/lib/libperl.so.5.10
(gdb) bt
#0 0x00002b26f788ef77 in ?? () from /usr/lib/libperl.so.5.10
#1 0x00002b26f7896fd0 in ?? () from /usr/lib/libperl.so.5.10
#2 0x00002b26f789ad29 in Perl_regexec_flags () from
/usr/lib/libperl.so.5.10
#3 0x00002b26f7837e76 in Perl_pp_match () from
/usr/lib/libperl.so.5.10
#4 0x00002b26f7831392 in Perl_runops_standard () from
/usr/lib/libperl.so.5.10
#5 0x00002b26f782c5df in perl_run () from
/usr/lib/libperl.so.5.10
#6 0x0000000000400d0c in main ()
* more choice stuff!
* 'null' analyzer fix
* add explicit 'null' analyzer (not just empty chain) to DTA
* tei re-fix (revision 7415:7416 broke DTAQ)
* added DTA pseudo-analyzer 'null'
* tei fix
* ner fix
* added NER to DTA chain
* moved nerec/ into tests/
* added nerec/ test directory for syncope ne-recognition
* added Analyzer::SynCoPe::NER : named-entity recognition via SynCoPe XML-RPC server
v1.24 2012-03-28 moocow
* dta-cab-analyze.perl -fo option fix
* even more msafe adaptation; use unicode class \p{Letter}
* more msafe adaptation
* typo fix
* updated MorphSafe:
- all-non-alphabetic tokens are now considered "safe" (replaces /^[[:punct:][:digit:]]*$/ heuristic)
* add U+A75B (r rotunda) to latin1x-safe symbols
* added rudimentary query handling to cab demo.js, demo.html.tpl
* improved lemmatization for XY (no lower-case bashing)
* added canonical option to Format::TJ if level>=0
* hack: remove ge\| prefixes in lemmatizer
* added live javascript demo.js to taghx-http.plm
* updated MANIFEST: remove CAB/Format/JSON/*.pm, CAB/Format/YAML/*.pm
* fixed cab/moot bug 'nachgesucht->VVFIN'
- problem was inconsitency between model (uses TAGH tags for lex
classes e.g. VVPP2) and CAB-generated input (used translated
tags, VVPP2->VVPP)
- CAB now uses raw (tagh) tags for input and applies the tag
translation dict __after__ tagging (so lemmatization should still work
* fixed utf-8 bug in dta-cab-http-client.perl
v1.23 2012-01-17 moocow
* sysv-ified dta-cab.sh
* improved demo: added arbitrary user options (JSON-encoded)
* allow non-refs in JSON input
+ also updated demo page to use backgrounded javascript-based queries a la cab error db
v1.22 2011-12-16 moocow
* services fixes
+ http server response logging option (srv->{logResponse})
* fixed "'frobble' is not a HASH reference in Format/TT.pm" bug with eqlemma as array-of-strings
v1.21 2011-12-09 moocow
* changed undef to 'off' in cab-http.plm (avoid unification glitch)
* fixed rmlog actions on check-ok
* improved cab-rc-update.sh cron script
* added caberr1, norm1 chains
* removed local ssh keys; use id_dsa by default
* changed default actions for cab-rc-update.sh to 'check update': no implicit restart
* fixed JSON format bug blowing up logs e.g. on services
* updated cab-rc-update.sh script for resources.new->resources renaming
* rc changes (services)
* moved resources.new/ pointers to resources/
* moved resources.new/ -> resources/
* removed stale resources/ dir
* turned up CAB_SLEEP to 3 in dta-cab-server.sh: auto-restart was failing
* cabEval fix (global %::analyzeOpts)
* added logResponse option to cab-http.plm
* default re-starteable servers
* TEI format fixes
* updated cab-rc-update.sh (added basic actions to command-line)
* added and tested CAB/Analyzer/EqRW/JsonCDB.pm
* added and tested CAB/Analyzer/EqPho/JsonCDB.pm
* added CAB/Analyzer/EqLemma/JsonCDB : new moot-only lemma-equivalence
v1.20 2011-09-15 moocow
* explicitly set static type keys
* static typeKeys fixes: auto-scan on prepareLoaded()
+ MootSub bug fix
* lemmatizer fixes
* updated MootSub: now basically tomasotath-compatible
* added stringsim/testme.perl : string similarity benchmarking
( run in 2.539 seconds using v1.01-cache-2.11-cpan-39bf76dae61 )