DTA-CAB
view release on metacpan or search on metacpan
* dta-cab.sh: merged changes from bogus for in dstar/cabx/
* added cgiwrap for version
* web-howto typos
* updated 'fliegen' example in web-howto
* clean Version.pm
* WebServiceHowto updates for XmlLing
* alias tweaks
* XmlLing for server mode
* added support for TEI att.linguistic features
- new formatter Format::XmlLing (flat att.linguistic features, with optional TokWrap compatibility for later spliceback)
- new TEI and TEIws options 'att.linguistic=bool' : force use of XmlLing sub-formatter with appropriate options
- new TEI and TEIws aliases (ltei ... ling-tei-xml, lteiws ... ling-tei-ws)
- updated Format SUBCLASSES docs and examples
- still TODO: integrate new formats into CAB demo web-GUI and HOWTO
* added format XmlLing: use TEI att.linguistic attributes
v1.102 2018-06-20 moocow
* howto updates
for spliced2ling
* added
spliced2ling xsl stuff
* HttpProtocol.pod: added explicit
'xpost' reference
* DSGVO stuff
* clean Version.pm
* attempt to ensure Listen=SOMAXCONN for
DTA::CAB::Server::HTTP::UNIX
v1.101 2018-04-13 moocow
* dta-cab-server.sh: handle tcp<->unix relay via new variables
+ added -verbose LEVEL option for debugging
+ added 'config|debug' action to view configuration variables
* system/xlit-unix.plm: test tcp relay handling by sysv-like dta-cab-server.sh
* more cab-v1.101 check tweaks
(icinga/pnp4nagios doesn't like floats in engineering notation)
* dta-cab-http-check.perl: v1.101 perfdata fixes
* status.html.tpl: compatibility fixes for transition
* added rss and exponential moving average query times to CAB status output
- implements mantis #26054
v1.100 2018-03-21 moocow
* dta-cab-server.sh:
- disable watchdog by default (let icinga do this)
- use administrative lock-files to avoid concurrent operations
* minor tempfile tweaks attempting to get at mantis #25739
v1.99 2018-03-07 moocow
* wd_verbose=1 after r27799 debugging left it at 2
* dta-cab-server.sh: tweaks for process groups (UNIX socket server + socat relay)
* clean Version.pm
* UNIX process group tweaks
* dta-cab-server.sh: kill whole process group on 'stop'
* clean Version.pm
* v1.99: improved handling for pathological Server::HTTP::UNIX conditions
(stale unix socket, stale relay process)
- server now only WARNs for stale relay sockets; dodgy 'fix' for
mantis bug #25326 (should be a valid fix for identical relay
command-lines as in bug #25326)
v1.98 2018-02-21 moocow
* moot langid FM.* pseudo-tags: keep CARD analyses too
* check for undef pid_cmd() output in Server::UNIX -- avoid heinous death in File::Basename::basename()
v1.97 2018-02-12 moocow
* v1.97: peerenv() optimization for DTA::CAB::Server::HTTP::UNIX::ClientConn
- only call peerenv() for peer command 'socat'
+ support http+unix:// scheme in DTA::CAB::Client::HTTP::lwpUrl()
v1.96 2018-02-09 moocow
* check for existing rc-file
* clean Version.pm
* tweaks for implicit creation of parent directories for unix sockets
* fixed Server::HTTP::UNIX destructor code
- was killing off relay process via signal for post-on-fork destruction
* documented new UNIX socket stuff
* added support for UNIX server sockets in CAB/Client/HTTP.pm, dta-cab-http-client.perl
* DTA::CAB::Server::HTTP::UNIX seems to be working
- built-in socat relay
- emulation of peerhost() and peerport() for relayed sockets via socat EXEC:'socat - UNIX-CLIENT:/socket/path' idiom + /proc/PEERPD/environ
* removed stale t.t
* xlit-http: disable cache again
* svn:ignore cleanup on plato
* started working on Server::HTTP::UNIX (should work more or less transparently with dta-cab-http-server.perl)
v1.95 2018-01-15 moocow
* Unicode::CharName version fix
* report memory usage in kB, not pages
v1.94 2017-11-13 moocow
* fix mantis bug #23127, introduced in v1.93
v1.93 2017-11-10 moocow
* dta-cab-analyze.perl: removed debug code
* db flags O_RDONLY fix for Dict::DBD
* don't include 'mhessen' in dmoot/morph
- if we've non-trivially normalized via dmoot, we probably don't want it
- plus, we're not sure if it's enabled anyways
* added Analyzer/Morph/Extra hacks; based on Morph/Latin/*, tested with Morph/Extra/OrtLexHessen
v1.92 2017-11-09 moocow
* *.cmdi-xml: added 'landing pages'
* added getcmdi.sh: fetch current CMDI record
* Raw::Waste utf8 handling woes
* check defined(ENV{HOME}) for Format::Raw::Waste (docker irritations)
* debugging for Format::Raw::Waste cache-clearance
* new default raw subclass=Raw::Waste; added shared model caching and auto-update to Format::Raw::Waste
* added support for environment variable DTA_CAB_FORMAT_RAW_DEFAULT_SUBCLASS
v1.91 2017-09-05 moocow
* removed stale test data cz.*
* cab-demo script cab.perl : updated target server to 194.95.188.42:9099 (data.dwds.de:9099)
* hack to allow global alternate default waste config dir (for cabx servers)
+ 'raw' input still uses default HTTP subclass
v1.90 2017-05-24 moocow
* blockscan debugging / kira
* cleaned up some debugging code
* fix optimization for Format::XmlNative::blockScanBody()
* optimization for Format::XmlNative::blockScanBody()
v1.89 2017-05-19 moocow
v1.51 2014-01-13 moocow
* Cab/Analyzer/MootSub
- fixed bug assigning lowercase lemma 'urteilen' to urteil/NN~urteil~en[VVIMP]
- CAB/Format/TT : fixed (d|m)oot analysis parsing
* TokPP/Waste: fixed again
* TokPP/Waste-related segfaults on services
* CAB/Analyzer/TokPP/Waste.pm : don't try to store annot key (avoid segfaults)
* basic redundancy handling for moot/analysis and dmoot/morph (mostly just aesthetic)
* TokPP analyzer re-factored to use Moot::Waste::Annotator by default
v1.50 2013-12-10 moocow
* dmoot fix for list-valued $w->{lang}
* new raw input modes
* improved raw-text input using moot/waste
- either locally (CAB::Format::Raw::Waste)
- or via http (CAB::Format::Raw::HTTP)
* added CAB::Format::Raw::Waste : waste tokenization
- currently only works by writing a temporary string buffer and passing to Format::TT for final document construction: UGLY
- we should probably use the waste buffer classes for this (making these visible to perl)
- better yet, this is a poster child for perl-level TokenWriter subclassing
* XmlTokWrapFast: read //w/moot/@* into $w->{moot}{$_}
v1.49 2013-12-09 moocow
* updated to v1.49
v1.48 2013-12-06 moocow
* added capsFallback automaton option; set by default for Analyzer::Morph
* cab automaton-based analyzers: set check_symbols=>0
v1.47 2013-12-05 moocow
* added system/dwds/ and system/init/dwds-http-9096.rc
* added dwds-http-9096.plm wrapper
- removed request-size limit (maxRequestSize=undef)
- disable autoclean modee
* fewer unknown-symbol warnings (once per symbol per object)
- XmlTokWrapFast: output //s/@pn
* CAB/Format/TEI: default tokenizer class back to http
* fix warning for missing content-length
* TCF: default to format level=1
* Moot:
- compatibility fix: apply tag-translation table BEFORE model lookup
* set global server maxRequestSize=512k for cab-http.plm
* added maxRequestSize key to CAB::Server::HTTP and CAB::Server::HTTP::Handler::Query
* allow TEI to support -fo=txmlfmt=XmlTokWrapFast
- 2x faster than default, but doesn't support all keys
* CAB/Chain.pm: propagate logTrace from opts if set there
v1.46 2013-10-10 moocow
* edited cab.cmdi-xml with local export (Edmund): sending to Frank
* removed bogus debug code from dta-cab-analyze.perl
* cab.plm: moot,dmoot use 'dtiger' infix instead of tiger
- centralized training source in moot-models/dta-dtiger
* Format/Raw.pm : handle U+00AD (SOFT HYPHEN)
* LangId::Simple : don't output lang_counts by default
* cab-rc-update.sh: update from kaskade
* Raw tokenizer: handle '[Formel]'
* improved LangId::Simple
- now counts number of stopword CHARACTERS (vs tokens)
- added better 'xy' rules, also added an xy 'stopword' list in
cab_automata/langid/data/xy.t
v1.45 2013-09-03 moocow
* CAB::Analyzer::LangId : got working again; results not very encouraging
* special handling for double-initial caps in Analyzer::Unicruft: updated version
* special handling for double-initial caps
* re-built logos using inkscape
* added new compatibility symlink cab-favicon.png
* removed old cab-favicon.png
* added new logos
* added caberr-64.png
* updated cab favicon
* MorphSafe badTypes map now maps (text=>isGood) rather than (text=>isBad)
- fixes bug in which badMorph heuristics were overriding a
__good__ entry in badTypes file (Gutherzigkeit)
v1.44 2013-07-22 moocow
* tcf / format fixes
v1.43 2013-07-11 moocow
* TCF format fix: reset temp variables ($pos,$lemma,$orth) between words
* added TCF to demo formats
* default TOKENIZE_CLASS='auto' for TEI via TokWrap
* checkin with updated Version.pm
* first version with TCF support
- how finicky do we need to be with offset-based tokens, sentences, etc?
- and how do we handle metadata?
* added basic TCF format (output only atm)
v1.42 2013-06-23 moocow
* -fc option added to dta-cab-splice-syncope.perl
* better version check
* TEI format debugging and tweaks
- can now set -fo=txmlfmt=XmlTokWrapFast for e.g. fast TEI-format input, but this slows down TEI-format output
- best results seem to be with -io=txmlfmt=XmlTokWrapFast
-oo=XmlTokWrap for plain convert; ymmv with actual analysis going on
* lots of debugging code
* better TEI format debugging with e.g. -fo teilog=debug
* removed Format::TEI debug flag
* fixed ugly regex-slowing $POSTMATCH in CAB::Format::XmlNative::blockScanFoot()
- use perl 5.10 /p modifier and ${^POSTMATCH} instead
v1.41 2013-06-05 moocow
* default xml format now resolves to tei
* cab.perl: read dirname($0)/.htcabrc for local overrides
* cab.perl: read cab.perl.rc
* demo.js: fix cab_url_base guessing regex if parameters are specified
- e.g. http://localhost:9099/?q=foo
* MootSub lemmatization: honor 'FM.*' tags
* cab demo: pass through 'file' parameter
* demo links seem to work now!
* demo init: fix links
* demo.js &-expansion woes
* workaround for Unify.pm choking on REGEXPs in Format::Registry
- implement STORABLE_(freeze|thaw) for Format::Registry
- allows rollback of Unify.pm changes in r9738 (explicit
DS-traversal with potential cycles, caused infinite allocation
loop and memory explosion in 'real' CAB servers)
* added /upload and /file paths to cab-http.plm
* demo/upload tweaks (don't call it 'upload')
* file upload updates
* merged in branch htdocs-1.41-upload -r9728:9736
* fixed YAML dispatch
* updated demo.js: make traffic-light frame work in proxy mode
* language guesser tests
* wrap various YAML implementations directly in YAML.pm (rather than subclass hacks)
* LangId::Simple: only use unicode character block hacks for words of length >= 2
* hasmorph for text-mode output
* updated DTAClean: added 'hasmorph' key
* prune analyzers in cab.perl wrapper
* dingler: try to enable autoclean
* cab-http-9099: auto-clean on
* trimmed cab-http-9099.plm to ignore authentication
* updates from kaskade2 for debian/wheezy
* lang-guesser updates: unicode hacks
* Morph::Latin : only analyze if isLatinExt
* Moot: use FM.$lang as tag for language-guesser hack
* XML formatting woes
* built in langid heuristics to Moot/Boltzmann and Moot
* added LangId::Simple analyzer, built into DTA chain as 'langid'
v1.40 2013-04-30 moocow
* smarter verbosity for cab-rc-update.sh
* updated to use (my own) GermaNet::Flat API module, rather than clunky google code variant
* added -begin and -end CODE options to dta-cab-analyze.perl
* Format::Raw : parse underscores as word-like
v1.39 2013-04-24 moocow
* removed xlemma stuff again
* MootSub: generate moot/xlemma field: raw TAGH segmentation for best lemma
* bugfix lemma(Christentum) -> Christenenum (cab lemmatizer ~e)
* lemmatizer: rename verb inflections
* GermaNet runs sentence-wise, in order to access moot/lemma
+ added GermanNet::Synonyms
+ changed GermaNet labels to:
- gn-syn (Synonyms)
- gn-isa (Hyperonyms~superclasses)
- gn-asi (Hyponyms~subclasses)
+ added GermaNet analyzer option LABEL_max_depth e.g. gn-syn_max_depth for some control of resolution
* oops: fixed multi-load of GermaNet and descendants
* added germanet hypoyms to DTA
* added and tested basic GermaNet relation closures
* added GermaNet/{RelationClosure,Hyperonyms,Hyponyms}.pm
* added Analyzer::GermaNet.pm
v1.38 2013-03-11 moocow
* added xlist format to demo
* ExpandList fix
* pretty-printing for ExpandList
* TokPP: replaced some bad [[:digit:]]* with [[:digit:]]+ regexes
- upshot: don't analyze empty string as CARD
* Analyzer::Morph::Latin::CDB : use _am_xlit rather than $_->{text} as key
- fixes caberr bug #66980 (PhaÅ¿mate -> FaÃmate != Phasmate) b/c utf8 variant isn't in latin lexicon
v1.37 2013-03-08 moocow
* added dingler server, running on kaskade @ port 9097
* added dingler server configs
* fix typo
* add FM,XY moot analyses for words with non-latin characters
* v1.37: dmoot: leave as-is if !isLatinExt
v1.36 2013-02-22 moocow
* syncope csv format: let "'s" be LOWERCASE_WORD (python regex compatibility hack)
* v1.36: fixed moot bug resulting in e.g. --/NE
- problem was bad propagation of tokeinizer (toka) tags of the form [$(] through _am_tagh_list2moota rsp _am_tagh_fst2moota
v1.35 2013-02-11 moocow
* updated lemmatization heuristics: punish orgnames
v1.34 2013-02-05 moocow
* format/syncope/csv: 'digit' type now includes dotted numerics
* ignore dta-syncope-ner.*
* remove debug code from dta-cab-convert.perl
* Format::TEI fix: include PID in tmpdir name so parallelization works
* morph fst: check_symbols=>0
* Format/XmlXsl gone
* removed some debug code from cab.plm
* resource changes (dta-cabopt.mak: eqphox_xocoef* -> eqp_xocoef_*)
* ignore dta-cabopt.mak
* set dta-cabopt.mak.v0
* more block-scanning tests: moving to tests/blockscan/
* added test xmlbscan.perl: try to get blockScan(), blockMerge() working for flat XML files
* got cab-analyze.perl working with new UNIX-socket based queue
- block scan & merge works with TT, TJ formats, even in -list mode
- TODO (?): extend blockScan() + blockAppend() API to other (e.g. xml-based) formats?
v1.19 2011-08-31 moocow
* revised CAB/Fork/Pool.pm to use new CAB/Queue/Server.pm rather than clunky Queue::File
- started working new Fork/Pool.pm stuff into dta-cab-analyze.perl
- continue at or around line 407 (post queue population)
* more queue tests in (increasingly poorly-named) tests/sysv
+ looks good: should be ready to integrate into command-line analyzer
* JobManager update
- todo: JobManger::Client (in JobManager.pm), update analyze script
* added CAB/Queue/JobManager.pm for block-savvy DTA::CAB::Analyze queue management
* got basic blockScan(), blockAppend() APIs in place for Format::TT
* added tt-blockscan.perl
* got dta-cab-analyze.perl working with new format semantics
+ todo: UNIX socket queue, better block handling
* got HTTP, XmlRpc server and client working with new format semantics
* updated dta-cab-(http|xmlrpc)-client.perl to use new format semantics
* removed stale dta-cab-xml-format.perl
* removed statle cachegen, compile, dict-convert scripts
* removed old YAML directory: stick to YAML::XS
* finished updating toString,toFile,toFh semantics in CAB formats
* re-working CAB::Format API: toFh(), toString()
- done formats: JSON, Null, Sotrable, ExpandList, TJ, Text, TT, Raw, CSV, Perl
- todo: YAML, Xml*
+ next: kludge a generic block-handling API into DTA::CAB::Format (@blocks=->block_scan(); ->block_append(,))
* re-factored CAB/Queue/(Socket|Client|Server) to CAB/Socket, CAB/Socket/UNIX, CAB/Queue/(Client|Server)
* more UNIX socket queue tests
* more tests: tests/sysv/cq(test|client).perl -- working again (it seems)
* broke things
* socket queue-server work
* more queue tests
- best candidate so far: qsrv.perl : dedicated 'master' queue server using UNIX sockets
- idea: separate scan- and process- fork-pools (like now)
- scan pool scans for block boundaries (test: blockscan.perl: use yte offsets, lengths, seek(), tell())
- process pool does actual processing
(like current dta-cab-analyze.perl, but must send data BACK to server; see qsrv.perl)
- master process maintains queue (qsrv.perl) and merges processed blocks into final output files (blockmerge.perl)
* added qtest.perl: works (single-file binary-safe message queue using flock)
* more bdb/cdb fixes
* added sysv tests: semaphores ought to work; message queues look a bit dodgy...
* added Cache::Static; moved bdb->cdb
* added Analyzer::Cache::Static sub-hierarchy
* bdb->cdb: system/cab.plm
* bdb->cdb: analyzer aliases
v1.18 2011-08-22 moocow
* split ExLex into {BDB,CDB} subclasses: todo: replace BDB by CDB for db-based lookups (ca 25% faster)
* removed stale BDB directory
* added Format::XmlTokWrapFast : quick+dirty fast output for feeding to dtatw-xml2ddc.perl
* more fixes (short format alias 'bin' for Storable)
* kaskade fixes for big dta build
* fixed wide-character bug in tj output
* update script debugging
* added documentation to README.update
* changed alias structure in Chain::DTA (default->norm rather than norm->default)
- no functional difference
* don't start langid server by default
* README: newline at EOF
* fixed CAB_RCDIR
* cab_corpus/ build: fixes & adjustments
* fixed TJ format bug for sentence attributes
* version, analyze verbosity for spawn
* got forked block-processing working
* pre-split blocks in dta-cab-analyze.perl
v1.17 2011-08-12 moocow
* work on new system/resources/ dir (as system/resources.new)
* default update from kaskade
* added ssh keypair cab-rc-update.dsa
- pubkey must be authorized for update user on build host
* added svnignore, update script
* re-added forced lower-case for mlatin db lookups
* added watchdog links and README in old system/watchdog/ directory>
* changed watchdog defaults to live in CAB_ROOT/(run|log) by default
* added cab-xlit-9099.rc for init-script debugging
+ added forkit, watchdog calls to dta-cab-server.sh (see CAB_WD_* options in dta-cab-server.sh)
+ old watchdog scripts should now be obsolete
* tt2tj fixes
* added c,b tokwrap attributes to Format::TT
* added dta-cab-convert -list option (list known formats)
* updated CAB/Format/TT : added new tokwrap/ddc attributes xr,xc,bb,pb,lb,...
* updated demo template
* typo fix
* added exlex checkbutton to demo.html.tpl
* added exlex checkbox
* TEI fixes
* runtime updates from services
* pathological fix for MootSub (undef prob)
* fixed annoying dmoot bug with temp-variable re-use in analysis closure
* startup logic fixes for watchdog-related race condition in dta-cab-http-server.perl
* added -guess option for dtatw-add-c.perl to TEI format
* TEI format tweaks & fixes
* got TEI format working with splice-back
* added format 'TEI': input from raw TEI-XML with or without //c; output as TokWrapXml
* fixed <a>-multiplication in TokWrapXml format
* dtaq optimization tests:
- looks like CAB client is the real bottleneck (1.8s cab / 2.6s total = 69% cab time for cab.sh script)
- problem doesn't seem immediately fixable
+ format is fixed by tokwrap and expected by dtaq
+ moving server to localhost shaves off some time occasionally, but not much
+ removing verbose messages gets us only a whopping 1% improvement
+ using curl instead of cab-http-client is actually slower (on kaskade)
* forking dta-cab-analyze.perl
* dta-cab-analyze.perl: fork maintainence polishing
+ added -keep , -nokeep args for queue management debugging
+ improved automatic queue deltion
+ added signal handlers for INT,HUP,TERM,ABRT to main process (aborts subprocesses)
+ changed JSON::XS utf8() flag to 0: expect and return wide strings (with utf8::is_utf8($str)==1)
* tested forks in dta-cab-analyze.perl: all seems good
* added File::Temp dependency to Makefile.PL
* more temp-related options
v1.16 2011-07-13 moocow
* more work on fork pool
- abstract queue-savvy fork pool now in CAB/Fork/Pool.pm
- uses CAB::Queue::File::Locked for queue
- some basic checking for abnormal exit status in children
( run in 1.245 second using v1.01-cache-2.11-cpan-5735350b133 )