DTA-CAB

 view release on metacpan or  search on metacpan

Changes  view on Meta::CPAN

	* moved orig/cab.cmdi-xml back to .
	* added WebLichtWebServices.url
	* moved WebLichtWebServices.url -> WebLichtWebServices.url_old
	* fixed TCF parsing bug

v1.65 2014-12-02  moocow
	* don't let topkwrap ignore mapclass attribute in tei mode
	* TEIws format update#
	  - allow #-prefixed IDs in @prev,@next attributes gracefully
	* disabled debug code
	* ignore some stuff
	* tcf tweaks: encode tei in textCorpus/textSource as schema trunk describes
	* tei-in-tcf embedding uses textSource element

v1.64 2014-11-27  moocow
	* disable cab demo debug
	* Format/JSON fix: don't output scalar references (e.g. teibufr, textbufr)
	* tcf token id fix
	* tcf sentence id fix
	* fixed TCF typos
	* always include //sentence/@ID for TCF format

v1.63 2014-11-25  moocow
	* htdocs/demo.js fixes for implicit tokenization of un-tokenized tcf
	  - effectively ignore 'tokenize' checkbox for tcf
	* clean Version.pm
	* TCF format fixes and updates
	  - improved tcf parsing using getChildrenByLocalName() instead of findnodes()
	  - added tcf tokenization if only 'text' layer is present using DTA::CAB::Format::Raw
	* ifmt is safe too
	* improved tcf parsing

v1.62 2014-11-12  moocow
	* added 'ofmt' to list of safe pass-through parameters
	* status home link: .. (for demo)
	* demo fix: disable raw text for live-mode
	* demo.js fixes for inline return
	* more tcf options
	* output format option only for upload gui
	* more tcf i/o tweaks
	* more tei/tcf and server i/o format tweaks: looks good, go live on MONDAY
	* different in- and output-formats for server, TEI, TCF format tweaks using doc->{textbufr}

v1.61 2014-10-16  moocow
	* added eval files
	* don't output sentence comments for ExpandList
	* verbose logging options
	* log-stderr typo
	* added playground/logo as symlink
	* removed old logo/ symlink ; replacing with real mccoy
	* cabx directory basically in place
	* automaton resultfst crashing
	* added logos
	* cab demo: added logo
	* added 48p logo
	* tag-hacks: added mathematical operators to 'punctuation-like' class
	* MootSub tag-tweaking hacks: avoid 'normal' tags for non-wordlike tokens

v1.60 2014-08-22  moocow
	* fixed DTA::CAB::Analyzer::_am_wordlike_regex() to allow combining diacritical whetver [[:alpha:]] is included
	  - unicode should really call these things alphabetic, imho, but it doesn't

v1.59 2014-06-24  moocow
	* added dta 'lemma', 'lemma1' chains (with exlex)
	* sleep between stop and start actions on restart
	* allow direct demo-gui display of xml responses
	  - fixed 'pretty' parameter pass-through bug in DTA::CAB::Format::Registry::newFormat()
	  - stop tcf format complaining about missing document for spliceback (avoid garbage in apache logs)

v1.58 2014-06-16  moocow
	* added example scripts cab-curl-post.sh, cab-curl-xpost.sh
	* reapClient chost fix2
	* daemonMode=fork for DTA::CAB::Server::HTTP
	  - only for POST queries
	* xlit-http.plm : turned down logLevel
	* server status tweaks

v1.57 2014-06-13  moocow
	* added OpenThesaurus expander to dta chain (uses Analyzer::GermaNet class)
	* added OpenThesaurus expander

v1.56 2014-06-11  moocow
	* GermaNet : allow synset names as 'lemma' queries
	* apache-cgi-wrap default host = localhost
	* ExpandList/LemmaList alias fixes (no CODE refs in default formats)
	* v1.56: added ExpandList aliases LemmaList,llist,ll,lemmata,lemmas,lemma
	  + added Chain::DTA analyzers default.lemma, default.lemma1
	* added LemmaList|llist|ll|lemmata|lemmas alias for ExpandList
	  + using CODE-ref hack to extract non-root attribute moot/lemma
	  + better solution would be to polish up and use (something like) Data::ZPath

v1.55 2014-05-27  moocow
	* moved tagh-http.plm to taghx-http-9098.plm
	* eliminated 'ge|' prefix removal hack for tagh-lemmatization
	  - for compatibility with dwds-kc20 lemmatization

v1.54 2014-05-15  moocow
	* updated format docs
	* replace 'xml' with 'txml' in demo list
	* allow lowercase letters in morph tags parsed by Analyzer.pm accessor macro am_tagh_fst2moota
	  - fixes bogus VV* tags for new [roman] pseudo-analyses from dta-morph-additions

v1.53 2014-03-16  moocow
	* set default CAB_SLEEP=5
	  - try to avoid restart failures on services (Cannot bind socket 0.0.0.0 port 9099: Address already in use);
	  - but SO_REUSEADDR ought to be set - what gives?
	* don't set ReusePort, since it gives errors: "Your vendor has not defined Socket macro SO_REUSEPORT"
	* documented ExpandList
	* added csv1g formatter
	* added moot/details field: best analysis, for saving tagh analyses
	  - new moot/details should be swept by analyzeClean

v1.52 2014-01-31  moocow
	* tei: disabled debug
	* added twTokenizeClass pass-through to DTA::TokWrap
	* fixed tei rmtree() bug on multiple processes
	* apostrophe-s handling
	* v1.52: updated 'word-like' regex to include 's suffixes
	  + centralized word-like regex to DTA::CAB::Analyzer::_am_wordlike_regex()
	  + updated/unified email address to moocow@cpan.org

Changes  view on Meta::CPAN

	* re-built logos using inkscape
	* added new compatibility symlink cab-favicon.png
	* removed old cab-favicon.png
	* added new logos
	* added caberr-64.png
	* updated cab favicon
	* MorphSafe badTypes map now maps (text=>isGood) rather than (text=>isBad)
	  - fixes bug in which badMorph heuristics were overriding a
	    __good__ entry in badTypes file (Gutherzigkeit)

v1.44 2013-07-22  moocow
	* tcf / format fixes

v1.43 2013-07-11  moocow
	* TCF format fix: reset temp variables ($pos,$lemma,$orth) between words
	* added TCF to demo formats
	* default TOKENIZE_CLASS='auto' for TEI via TokWrap
	* checkin with updated Version.pm
	* first version with TCF support
	  - how finicky do we need to be with offset-based tokens, sentences, etc?
	  - and how do we handle metadata?
	* added basic TCF format (output only atm)

v1.42 2013-06-23  moocow
	* -fc option added to dta-cab-splice-syncope.perl
	* better version check
	* TEI format debugging and tweaks
	  - can now set -fo=txmlfmt=XmlTokWrapFast for e.g. fast TEI-format input, but this slows down TEI-format output
	  - best results seem to be with -io=txmlfmt=XmlTokWrapFast
	  -oo=XmlTokWrap for plain convert; ymmv with actual analysis going on
	* lots of debugging code
	* better TEI format debugging with e.g. -fo teilog=debug
	* removed Format::TEI debug flag
	* fixed ugly regex-slowing $POSTMATCH in CAB::Format::XmlNative::blockScanFoot()
	  - use perl 5.10 /p modifier and ${^POSTMATCH} instead

v1.41 2013-06-05  moocow
	* default xml format now resolves to tei
	* cab.perl: read dirname($0)/.htcabrc for local overrides
	* cab.perl: read cab.perl.rc
	* demo.js: fix cab_url_base guessing regex if parameters are specified
	  - e.g. http://localhost:9099/?q=foo
	* MootSub lemmatization: honor 'FM.*' tags
	* cab demo: pass through 'file' parameter
	* demo links seem to work now!
	* demo init: fix links
	* demo.js &-expansion woes
	* workaround for Unify.pm choking on REGEXPs in Format::Registry
	  - implement STORABLE_(freeze|thaw) for Format::Registry
	  - allows rollback of Unify.pm changes in r9738 (explicit
	    DS-traversal with potential cycles, caused infinite allocation
	    loop and memory explosion in 'real' CAB servers)
	* added /upload and /file paths to cab-http.plm
	* demo/upload tweaks (don't call it 'upload')
	* file upload updates
	* merged in branch htdocs-1.41-upload -r9728:9736
	* fixed YAML dispatch
	* updated demo.js: make traffic-light frame work in proxy mode
	* language guesser tests
	* wrap various YAML implementations directly in YAML.pm (rather than subclass hacks)
	* LangId::Simple: only use unicode character block hacks for words of length >= 2
	* hasmorph for text-mode output
	* updated DTAClean: added 'hasmorph' key
	* prune analyzers in cab.perl wrapper
	* dingler: try to enable autoclean
	* cab-http-9099: auto-clean on
	* trimmed cab-http-9099.plm to ignore authentication
	* updates from kaskade2 for debian/wheezy
	* lang-guesser updates: unicode hacks
	* Morph::Latin : only analyze if isLatinExt
	* Moot: use FM.$lang as tag for language-guesser hack
	* XML formatting woes
	* built in langid heuristics to Moot/Boltzmann and Moot
	* added LangId::Simple analyzer, built into DTA chain as 'langid'

v1.40 2013-04-30  moocow
	* smarter verbosity for cab-rc-update.sh
	* updated to use (my own) GermaNet::Flat API module, rather than clunky google code variant
	* added -begin and -end CODE options to dta-cab-analyze.perl
	* Format::Raw : parse underscores as word-like

v1.39 2013-04-24  moocow
	* removed xlemma stuff again
	* MootSub: generate moot/xlemma field: raw TAGH segmentation for best lemma
	* bugfix lemma(Christentum) -> Christenenum (cab lemmatizer ~e)
	* lemmatizer: rename verb inflections
	* GermaNet runs sentence-wise, in order to access moot/lemma
	  + added GermanNet::Synonyms
	  + changed GermaNet labels to:
	    - gn-syn (Synonyms)
	    - gn-isa (Hyperonyms~superclasses)
	    - gn-asi (Hyponyms~subclasses)
	  + added GermaNet analyzer option LABEL_max_depth e.g. gn-syn_max_depth for some control of resolution
	* oops: fixed multi-load of GermaNet and descendants
	* added germanet hypoyms to DTA
	* added and tested basic GermaNet relation closures
	* added GermaNet/{RelationClosure,Hyperonyms,Hyponyms}.pm
	* added Analyzer::GermaNet.pm

v1.38 2013-03-11  moocow
	* added xlist format to demo
	* ExpandList fix
	* pretty-printing for ExpandList
	* TokPP: replaced some bad [[:digit:]]* with [[:digit:]]+ regexes
	  - upshot: don't analyze empty string as CARD
	* Analyzer::Morph::Latin::CDB : use _am_xlit rather than $_->{text} as key
	  - fixes caberr bug #66980 (Phaſmate -> Faßmate != Phasmate) b/c utf8 variant isn't in latin lexicon

v1.37 2013-03-08  moocow
	* added dingler server, running on kaskade @ port 9097
	* added dingler server configs
	* fix typo
	* add FM,XY moot analyses for words with non-latin characters
	* v1.37: dmoot: leave as-is if !isLatinExt

v1.36 2013-02-22  moocow
	* syncope csv format: let "'s" be LOWERCASE_WORD (python regex compatibility hack)
	* v1.36: fixed moot bug resulting in e.g. --/NE
	  - problem was bad propagation of tokeinizer (toka) tags of the form [$(] through _am_tagh_list2moota rsp _am_tagh_fst2moota

v1.35 2013-02-11  moocow
	* updated lemmatization heuristics: punish orgnames

v1.34 2013-02-05  moocow
	* format/syncope/csv: 'digit' type now includes dotted numerics
	* ignore dta-syncope-ner.*
	* remove debug code from dta-cab-convert.perl
	* Format::TEI fix: include PID in tmpdir name so parallelization works
	* morph fst: check_symbols=>0

Changes  view on Meta::CPAN

v1.32 2012-10-04  moocow
	* fixed more tokwrap v0.37 bugs (explicit <toka> grouping now output by tokwrap)
	* fixes for dta-tokwrap v0.37
	* updated Client::HTTP docs
	* added 'ws' attribute to XmlTokWrapFast
	* got Format::TEIws working
	  + updated for dta-tokwrap v0.36

v1.31 2012-09-24  moocow
	* moved gfsmxl parameters from old setLookupOptions() API to new 'analyzePre' key for Analyzer::Automaton subclasses
	  + more flexible in general
	  + updated cab.plm to reflect changes in semantics
	  + old-style code using max_paths, max_weight, and max_ops should still work if no 'analyzePre' key is present
	* updated cab-rc-update.sh: changed source url from 'dta2012' back to 'dta'

v1.30 2012-09-18  moocow
	* content-length fixes for kaskade
	* updated demo.hs, demo.html.tpl: fixes for apache-cgi-wrap/
	* added generic apache cgi wrapper dir: system/apache-cgi-wrap
	* updated CAB::Format::TEI for dta-tokwrap v0.35

v1.29 2012-09-05  moocow
	* Format::SQLite updates for almost-ready eval-corpus
	* syncope-tab alias for SynCoPe::CSV
	* another name change: now in XmlTokWrapFast
	* oops: another id->nid rename
	* syncope/ner fixes: 'id' is a bad attribute name for subsequent splice
	* syncope splice fixes
	* added dta-cab-splice-syncope.perl
	* use HYPHEN-MINUS instead of HYPHEN_MINUS for syncope csv
	* add sid,wid numeric suffixes to syncope-csv location
	* oops: mapclass was already in XmlTokWrapFast
	* added mapclass attribute to Format::XmlTokWrapFast
	* removed analyzeDebug option from Analyzer::Moot::Boltzmann
	* copy fixes for dmoot
	* empty sentence fix for moot,dmoot
	* added dmoot flag 'lctags': bash dmoot tags to lower case
	  + added moot flag 'lctext': bash text to lower-case
	  + for use with new build hmms '*.lc.(1|12|123).hmm'
	* abs() rule for TJ : level=-2 --> -text, +canonical
	* added dta-cab-eval.perl

v1.28 2012-07-23  moocow
	* SQLite changes: history now stored directly as json (TODO: move to version control)
	* improved Format/SQLite parsing -- throughput up from <100 tok/sec to >15k tok/sec
	* added CAB::Format::SQLite.pm for EvalCorpus

v1.27 2012-07-18  moocow
	* updated default.(base|type) chains in CAB/Chain/DTA.pm
	* map 'old' key to 'text' in Format::XmlTokWrap
	* v1.27: blockScan fixes for Format::XmlNative (and by inheritance Format::XmlTokWrapFast)
	  - fixes mantis bug #543 : disappearing pages
	  - this worked with negative lookahead regexes, but those crash perl on some inputs (grr....)

v1.26 2012-07-06  moocow
	* debug
	* cab-rc-update.sh: pull from dta2012/cab rather than ddc/cab
	* real new DTA-unknown-char U+FFFC (object replacement character), various bugfixes

v1.25 2012-07-04  moocow
	* cab improvements for dealing with unicode replacement character (U+FFFD) as unknown-text marker
	* workaround for blockScan() segfault: slower but works on plato
	* segfault bughunt / kaskade:
	  - dying at Format/XmlNative.pm line 146 (regex match in blockScanFoot) for
	    ddc/dta2012/build/xml_tok/campe_robinson02_1780.TEI-P5.chr.ddc.t.xml
	    in build/cab_corpus
	  - only dying under make (make -j , -blockSize don't matter)
	  - segfault backtrace:
	  0x00002b26f788ef77 in ?? () from /usr/lib/libperl.so.5.10
	  (gdb) bt
	  #0 0x00002b26f788ef77 in ?? () from /usr/lib/libperl.so.5.10
	  #1 0x00002b26f7896fd0 in ?? () from /usr/lib/libperl.so.5.10
	  #2 0x00002b26f789ad29 in Perl_regexec_flags () from
	  /usr/lib/libperl.so.5.10
	  #3 0x00002b26f7837e76 in Perl_pp_match () from
	  /usr/lib/libperl.so.5.10
	  #4 0x00002b26f7831392 in Perl_runops_standard () from
	  /usr/lib/libperl.so.5.10
	  #5 0x00002b26f782c5df in perl_run () from
	  /usr/lib/libperl.so.5.10
	  #6 0x0000000000400d0c in main ()
	* more choice stuff!
	* 'null' analyzer fix
	* add explicit 'null' analyzer (not just empty chain) to DTA
	* tei re-fix (revision 7415:7416 broke DTAQ)
	* added DTA pseudo-analyzer 'null'
	* tei fix
	* ner fix
	* added NER to DTA chain
	* moved nerec/ into tests/
	* added nerec/ test directory for syncope ne-recognition
	* added Analyzer::SynCoPe::NER : named-entity recognition via SynCoPe XML-RPC server

v1.24 2012-03-28  moocow
	* dta-cab-analyze.perl -fo option fix
	* even more msafe adaptation; use unicode class \p{Letter}
	* more msafe adaptation
	* typo fix
	* updated MorphSafe:
	  - all-non-alphabetic tokens are now considered "safe" (replaces /^[[:punct:][:digit:]]*$/ heuristic)
	* add U+A75B (r rotunda) to latin1x-safe symbols
	* added rudimentary query handling to cab demo.js, demo.html.tpl
	* improved lemmatization for XY (no lower-case bashing)
	* added canonical option to Format::TJ if level>=0
	* hack: remove ge\| prefixes in lemmatizer
	* added live javascript demo.js to taghx-http.plm
	* updated MANIFEST: remove CAB/Format/JSON/*.pm, CAB/Format/YAML/*.pm
	* fixed cab/moot bug 'nachgesucht->VVFIN'
	  - problem was inconsitency between model (uses TAGH tags for lex
	    classes e.g. VVPP2) and CAB-generated input (used translated
	    tags, VVPP2->VVPP)
	  - CAB now uses raw (tagh) tags for input and applies the tag
	    translation dict __after__ tagging (so lemmatization should still work
	* fixed utf-8 bug in dta-cab-http-client.perl

v1.23 2012-01-17  moocow
	* sysv-ified dta-cab.sh
	* improved demo: added arbitrary user options (JSON-encoded)
	* allow non-refs in JSON input
	  + also updated demo page to use backgrounded javascript-based queries a la cab error db

v1.22 2011-12-16  moocow
	* services fixes
	  + http server response logging option (srv->{logResponse})
	* fixed "'frobble' is not a HASH reference in Format/TT.pm" bug with eqlemma as array-of-strings

v1.21 2011-12-09  moocow
	* changed undef to 'off' in cab-http.plm (avoid unification glitch)
	* fixed rmlog actions on check-ok
	* improved cab-rc-update.sh cron script
	* added caberr1, norm1 chains
	* removed local ssh keys; use id_dsa by default
	* changed default actions for cab-rc-update.sh to 'check update': no implicit restart
	* fixed JSON format bug blowing up logs e.g. on services
	* updated cab-rc-update.sh script for resources.new->resources renaming
	* rc changes (services)
	* moved resources.new/ pointers to resources/
	* moved resources.new/ -> resources/
	* removed stale resources/ dir
	* turned up CAB_SLEEP to 3 in dta-cab-server.sh: auto-restart was failing
	* cabEval fix (global %::analyzeOpts)
	* added logResponse option to cab-http.plm
	* default re-starteable servers
	* TEI format fixes
	* updated cab-rc-update.sh (added basic actions to command-line)
	* added and tested CAB/Analyzer/EqRW/JsonCDB.pm
	* added and tested CAB/Analyzer/EqPho/JsonCDB.pm
	* added CAB/Analyzer/EqLemma/JsonCDB : new moot-only lemma-equivalence

v1.20 2011-09-15  moocow
	* explicitly set static type keys
	* static typeKeys fixes: auto-scan on prepareLoaded()
	  + MootSub bug fix
	* lemmatizer fixes
	* updated MootSub: now basically tomasotath-compatible
	* added stringsim/testme.perl : string similarity benchmarking



( run in 2.539 seconds using v1.01-cache-2.11-cpan-39bf76dae61 )