DTA-CAB

 view release on metacpan or  search on metacpan

Changes  view on Meta::CPAN

	* dta-cab.sh: merged changes from bogus for in dstar/cabx/
	* added cgiwrap for version
	* web-howto typos
	* updated 'fliegen' example in web-howto
	* clean Version.pm
	* WebServiceHowto updates for XmlLing
	* alias tweaks
	* XmlLing for server mode
	* added support for TEI att.linguistic features
	  - new formatter Format::XmlLing (flat att.linguistic features, with optional TokWrap compatibility for later spliceback)
	  - new TEI and TEIws options 'att.linguistic=bool' : force use of XmlLing sub-formatter with appropriate options
	  - new TEI and TEIws aliases (ltei ... ling-tei-xml, lteiws ... ling-tei-ws)
	  - updated Format SUBCLASSES docs and examples
	  - still TODO: integrate new formats into CAB demo web-GUI and HOWTO
	* added format XmlLing: use TEI att.linguistic attributes

v1.102 2018-06-20  moocow
	* howto updates
	  for spliced2ling
	* added
	  spliced2ling xsl stuff
	* HttpProtocol.pod: added explicit
	  'xpost' reference
	* DSGVO stuff
	* clean Version.pm
	* attempt to ensure Listen=SOMAXCONN for
	  DTA::CAB::Server::HTTP::UNIX

v1.101 2018-04-13  moocow
	* dta-cab-server.sh: handle tcp<->unix relay via new variables
	  + added -verbose LEVEL option for debugging
	  + added 'config|debug' action to view configuration variables
	* system/xlit-unix.plm: test tcp relay handling by sysv-like dta-cab-server.sh
	* more cab-v1.101 check tweaks
	  (icinga/pnp4nagios doesn't like floats in engineering notation)
	* dta-cab-http-check.perl: v1.101 perfdata fixes
	* status.html.tpl: compatibility fixes for transition
	* added rss and exponential moving average query times to CAB status output
	  - implements mantis #26054

v1.100 2018-03-21  moocow
	* dta-cab-server.sh:
	  - disable watchdog by default (let icinga do this)
	  - use administrative lock-files to avoid concurrent operations
	* minor tempfile tweaks attempting to get at mantis #25739

v1.99 2018-03-07  moocow
	* wd_verbose=1 after r27799 debugging left it at 2
	* dta-cab-server.sh: tweaks for process groups (UNIX socket server + socat relay)
	* clean Version.pm
	* UNIX process group tweaks
	* dta-cab-server.sh: kill whole process group on 'stop'
	* clean Version.pm
	* v1.99: improved handling for pathological Server::HTTP::UNIX conditions
	  (stale unix socket, stale relay process)
	  - server now only WARNs for stale relay sockets; dodgy 'fix' for
	    mantis bug #25326 (should be a valid fix for identical relay
	    command-lines as in bug #25326)

v1.98 2018-02-21  moocow
	* moot langid FM.* pseudo-tags: keep CARD analyses too
	* check for undef pid_cmd() output in Server::UNIX -- avoid heinous death in File::Basename::basename()

v1.97 2018-02-12  moocow
	* v1.97: peerenv() optimization for DTA::CAB::Server::HTTP::UNIX::ClientConn
	  - only call peerenv() for peer command 'socat'
	  + support http+unix:// scheme in DTA::CAB::Client::HTTP::lwpUrl()

v1.96 2018-02-09  moocow
	* check for existing rc-file
	* clean Version.pm
	* tweaks for implicit creation of parent directories for unix sockets
	* fixed Server::HTTP::UNIX destructor code
	  - was killing off relay process via signal for post-on-fork destruction
	* documented new UNIX socket stuff
	* added support for UNIX server sockets in CAB/Client/HTTP.pm, dta-cab-http-client.perl
	* DTA::CAB::Server::HTTP::UNIX seems to be working
	  - built-in socat relay
	  - emulation of peerhost() and peerport() for relayed sockets via socat EXEC:'socat - UNIX-CLIENT:/socket/path' idiom + /proc/PEERPD/environ
	* removed stale t.t
	* xlit-http: disable cache again
	* svn:ignore cleanup on plato
	* started working on Server::HTTP::UNIX (should work more or less transparently with dta-cab-http-server.perl)

v1.95 2018-01-15  moocow
	* Unicode::CharName version fix
	* report memory usage in kB, not pages

v1.94 2017-11-13  moocow
	* fix mantis bug #23127, introduced in v1.93

v1.93 2017-11-10  moocow
	* dta-cab-analyze.perl: removed debug code
	* db flags O_RDONLY fix for Dict::DBD
	* don't include 'mhessen' in dmoot/morph
	  - if we've non-trivially normalized via dmoot, we probably don't want it
	  - plus, we're not sure if it's enabled anyways
	* added Analyzer/Morph/Extra hacks; based on Morph/Latin/*, tested with Morph/Extra/OrtLexHessen

v1.92 2017-11-09  moocow
	* *.cmdi-xml: added 'landing pages'
	* added getcmdi.sh: fetch current CMDI record
	* Raw::Waste utf8 handling woes
	* check defined(ENV{HOME}) for Format::Raw::Waste (docker irritations)
	* debugging for Format::Raw::Waste cache-clearance
	* new default raw subclass=Raw::Waste; added shared model caching and auto-update to Format::Raw::Waste
	* added support for environment variable DTA_CAB_FORMAT_RAW_DEFAULT_SUBCLASS

v1.91 2017-09-05  moocow
	* removed stale test data cz.*
	* cab-demo script cab.perl : updated target server to 194.95.188.42:9099 (data.dwds.de:9099)
	* hack to allow global alternate default waste config dir (for cabx servers)
	  + 'raw' input still uses default HTTP subclass

v1.90 2017-05-24  moocow
	* blockscan debugging / kira
	* cleaned up some debugging code
	* fix optimization for Format::XmlNative::blockScanBody()
	* optimization for Format::XmlNative::blockScanBody()

v1.89 2017-05-19  moocow

Changes  view on Meta::CPAN


v1.51 2014-01-13  moocow
	* Cab/Analyzer/MootSub
	  - fixed bug assigning lowercase lemma 'urteilen' to urteil/NN~urteil~en[VVIMP]
	  - CAB/Format/TT : fixed (d|m)oot analysis parsing
	* TokPP/Waste: fixed again
	* TokPP/Waste-related segfaults on services
	* CAB/Analyzer/TokPP/Waste.pm : don't try to store annot key (avoid segfaults)
	* basic redundancy handling for moot/analysis and dmoot/morph (mostly just aesthetic)
	* TokPP analyzer re-factored to use Moot::Waste::Annotator by default

v1.50 2013-12-10  moocow
	* dmoot fix for list-valued $w->{lang}
	* new raw input modes
	* improved raw-text input using moot/waste
	  - either locally (CAB::Format::Raw::Waste)
	  - or via http (CAB::Format::Raw::HTTP)
	* added CAB::Format::Raw::Waste : waste tokenization
	  - currently only works by writing a temporary string buffer and passing to Format::TT for final document construction: UGLY
	  - we should probably use the waste buffer classes for this (making these visible to perl)
	  - better yet, this is a poster child for perl-level TokenWriter subclassing
	* XmlTokWrapFast: read //w/moot/@* into $w->{moot}{$_}

v1.49 2013-12-09  moocow
	* updated to v1.49

v1.48 2013-12-06  moocow
	* added capsFallback automaton option; set by default for Analyzer::Morph
	* cab automaton-based analyzers: set check_symbols=>0

v1.47 2013-12-05  moocow
	* added system/dwds/ and system/init/dwds-http-9096.rc
	* added dwds-http-9096.plm wrapper
	  - removed request-size limit (maxRequestSize=undef)
	  - disable autoclean modee
	* fewer unknown-symbol warnings (once per symbol per object)
	  - XmlTokWrapFast: output //s/@pn
	* CAB/Format/TEI: default tokenizer class back to http
	* fix warning for missing content-length
	* TCF: default to format level=1
	* Moot:
	  - compatibility fix: apply tag-translation table BEFORE model lookup
	* set global server maxRequestSize=512k for cab-http.plm
	* added maxRequestSize key to CAB::Server::HTTP and CAB::Server::HTTP::Handler::Query
	* allow TEI to support -fo=txmlfmt=XmlTokWrapFast
	  - 2x faster than default, but doesn't support all keys
	* CAB/Chain.pm: propagate logTrace from opts if set there

v1.46 2013-10-10  moocow
	* edited cab.cmdi-xml with local export (Edmund): sending to Frank
	* removed bogus debug code from dta-cab-analyze.perl
	* cab.plm: moot,dmoot use 'dtiger' infix instead of tiger
	  - centralized training source in moot-models/dta-dtiger
	* Format/Raw.pm : handle U+00AD (SOFT HYPHEN)
	* LangId::Simple : don't output lang_counts by default
	* cab-rc-update.sh: update from kaskade
	* Raw tokenizer: handle '[Formel]'
	* improved LangId::Simple
	  - now counts number of stopword CHARACTERS (vs tokens)
	  - added better 'xy' rules, also added an xy 'stopword' list in
	    cab_automata/langid/data/xy.t

v1.45 2013-09-03  moocow
	* CAB::Analyzer::LangId : got working again; results not very encouraging
	* special handling for double-initial caps in Analyzer::Unicruft: updated version
	* special handling for double-initial caps
	* re-built logos using inkscape
	* added new compatibility symlink cab-favicon.png
	* removed old cab-favicon.png
	* added new logos
	* added caberr-64.png
	* updated cab favicon
	* MorphSafe badTypes map now maps (text=>isGood) rather than (text=>isBad)
	  - fixes bug in which badMorph heuristics were overriding a
	    __good__ entry in badTypes file (Gutherzigkeit)

v1.44 2013-07-22  moocow
	* tcf / format fixes

v1.43 2013-07-11  moocow
	* TCF format fix: reset temp variables ($pos,$lemma,$orth) between words
	* added TCF to demo formats
	* default TOKENIZE_CLASS='auto' for TEI via TokWrap
	* checkin with updated Version.pm
	* first version with TCF support
	  - how finicky do we need to be with offset-based tokens, sentences, etc?
	  - and how do we handle metadata?
	* added basic TCF format (output only atm)

v1.42 2013-06-23  moocow
	* -fc option added to dta-cab-splice-syncope.perl
	* better version check
	* TEI format debugging and tweaks
	  - can now set -fo=txmlfmt=XmlTokWrapFast for e.g. fast TEI-format input, but this slows down TEI-format output
	  - best results seem to be with -io=txmlfmt=XmlTokWrapFast
	  -oo=XmlTokWrap for plain convert; ymmv with actual analysis going on
	* lots of debugging code
	* better TEI format debugging with e.g. -fo teilog=debug
	* removed Format::TEI debug flag
	* fixed ugly regex-slowing $POSTMATCH in CAB::Format::XmlNative::blockScanFoot()
	  - use perl 5.10 /p modifier and ${^POSTMATCH} instead

v1.41 2013-06-05  moocow
	* default xml format now resolves to tei
	* cab.perl: read dirname($0)/.htcabrc for local overrides
	* cab.perl: read cab.perl.rc
	* demo.js: fix cab_url_base guessing regex if parameters are specified
	  - e.g. http://localhost:9099/?q=foo
	* MootSub lemmatization: honor 'FM.*' tags
	* cab demo: pass through 'file' parameter
	* demo links seem to work now!
	* demo init: fix links
	* demo.js &-expansion woes
	* workaround for Unify.pm choking on REGEXPs in Format::Registry
	  - implement STORABLE_(freeze|thaw) for Format::Registry
	  - allows rollback of Unify.pm changes in r9738 (explicit
	    DS-traversal with potential cycles, caused infinite allocation
	    loop and memory explosion in 'real' CAB servers)
	* added /upload and /file paths to cab-http.plm
	* demo/upload tweaks (don't call it 'upload')
	* file upload updates
	* merged in branch htdocs-1.41-upload -r9728:9736
	* fixed YAML dispatch
	* updated demo.js: make traffic-light frame work in proxy mode
	* language guesser tests
	* wrap various YAML implementations directly in YAML.pm (rather than subclass hacks)
	* LangId::Simple: only use unicode character block hacks for words of length >= 2
	* hasmorph for text-mode output
	* updated DTAClean: added 'hasmorph' key
	* prune analyzers in cab.perl wrapper
	* dingler: try to enable autoclean
	* cab-http-9099: auto-clean on
	* trimmed cab-http-9099.plm to ignore authentication
	* updates from kaskade2 for debian/wheezy
	* lang-guesser updates: unicode hacks
	* Morph::Latin : only analyze if isLatinExt
	* Moot: use FM.$lang as tag for language-guesser hack
	* XML formatting woes
	* built in langid heuristics to Moot/Boltzmann and Moot
	* added LangId::Simple analyzer, built into DTA chain as 'langid'

v1.40 2013-04-30  moocow
	* smarter verbosity for cab-rc-update.sh
	* updated to use (my own) GermaNet::Flat API module, rather than clunky google code variant
	* added -begin and -end CODE options to dta-cab-analyze.perl
	* Format::Raw : parse underscores as word-like

v1.39 2013-04-24  moocow
	* removed xlemma stuff again
	* MootSub: generate moot/xlemma field: raw TAGH segmentation for best lemma
	* bugfix lemma(Christentum) -> Christenenum (cab lemmatizer ~e)
	* lemmatizer: rename verb inflections
	* GermaNet runs sentence-wise, in order to access moot/lemma
	  + added GermanNet::Synonyms
	  + changed GermaNet labels to:
	    - gn-syn (Synonyms)
	    - gn-isa (Hyperonyms~superclasses)
	    - gn-asi (Hyponyms~subclasses)
	  + added GermaNet analyzer option LABEL_max_depth e.g. gn-syn_max_depth for some control of resolution
	* oops: fixed multi-load of GermaNet and descendants
	* added germanet hypoyms to DTA
	* added and tested basic GermaNet relation closures
	* added GermaNet/{RelationClosure,Hyperonyms,Hyponyms}.pm
	* added Analyzer::GermaNet.pm

v1.38 2013-03-11  moocow
	* added xlist format to demo
	* ExpandList fix
	* pretty-printing for ExpandList
	* TokPP: replaced some bad [[:digit:]]* with [[:digit:]]+ regexes
	  - upshot: don't analyze empty string as CARD
	* Analyzer::Morph::Latin::CDB : use _am_xlit rather than $_->{text} as key
	  - fixes caberr bug #66980 (Phaſmate -> Faßmate != Phasmate) b/c utf8 variant isn't in latin lexicon

v1.37 2013-03-08  moocow
	* added dingler server, running on kaskade @ port 9097
	* added dingler server configs
	* fix typo
	* add FM,XY moot analyses for words with non-latin characters
	* v1.37: dmoot: leave as-is if !isLatinExt

v1.36 2013-02-22  moocow
	* syncope csv format: let "'s" be LOWERCASE_WORD (python regex compatibility hack)
	* v1.36: fixed moot bug resulting in e.g. --/NE
	  - problem was bad propagation of tokeinizer (toka) tags of the form [$(] through _am_tagh_list2moota rsp _am_tagh_fst2moota

v1.35 2013-02-11  moocow
	* updated lemmatization heuristics: punish orgnames

v1.34 2013-02-05  moocow
	* format/syncope/csv: 'digit' type now includes dotted numerics
	* ignore dta-syncope-ner.*
	* remove debug code from dta-cab-convert.perl
	* Format::TEI fix: include PID in tmpdir name so parallelization works
	* morph fst: check_symbols=>0
	* Format/XmlXsl gone
	* removed some debug code from cab.plm
	* resource changes (dta-cabopt.mak: eqphox_xocoef* -> eqp_xocoef_*)
	* ignore dta-cabopt.mak
	* set dta-cabopt.mak.v0

Changes  view on Meta::CPAN

	* more block-scanning tests: moving to tests/blockscan/
	* added test xmlbscan.perl: try to get blockScan(), blockMerge() working for flat XML files
	* got cab-analyze.perl working with new UNIX-socket based queue
	  - block scan & merge works with TT, TJ formats, even in -list mode
	  - TODO (?): extend blockScan() + blockAppend() API to other (e.g. xml-based) formats?

v1.19 2011-08-31  moocow
	* revised CAB/Fork/Pool.pm to use new CAB/Queue/Server.pm rather than clunky Queue::File
	  - started working new Fork/Pool.pm stuff into dta-cab-analyze.perl
	  - continue at or around line 407 (post queue population)
	* more queue tests in (increasingly poorly-named) tests/sysv
	  + looks good: should be ready to integrate into command-line analyzer
	* JobManager update
	  - todo: JobManger::Client (in JobManager.pm), update analyze script
	* added CAB/Queue/JobManager.pm for block-savvy DTA::CAB::Analyze queue management
	* got basic blockScan(), blockAppend() APIs in place for Format::TT
	* added tt-blockscan.perl
	* got dta-cab-analyze.perl working with new format semantics
	  + todo: UNIX socket queue, better block handling
	* got HTTP, XmlRpc server and client working with new format semantics
	* updated dta-cab-(http|xmlrpc)-client.perl to use new format semantics
	* removed stale dta-cab-xml-format.perl
	* removed statle cachegen, compile, dict-convert scripts
	* removed old YAML directory: stick to YAML::XS
	* finished updating toString,toFile,toFh semantics in CAB formats
	* re-working CAB::Format API: toFh(), toString()
	  - done formats: JSON, Null, Sotrable, ExpandList, TJ, Text, TT, Raw, CSV, Perl
	  - todo: YAML, Xml*
	    + next: kludge a generic block-handling API into DTA::CAB::Format (@blocks=->block_scan(); ->block_append(,))
	* re-factored CAB/Queue/(Socket|Client|Server) to CAB/Socket, CAB/Socket/UNIX, CAB/Queue/(Client|Server)
	* more UNIX socket queue tests
	* more tests: tests/sysv/cq(test|client).perl -- working again (it seems)
	* broke things
	* socket queue-server work
	* more queue tests
	  - best candidate so far: qsrv.perl : dedicated 'master' queue server using UNIX sockets
	  - idea: separate scan- and process- fork-pools (like now)
	  - scan pool scans for block boundaries (test: blockscan.perl: use yte offsets, lengths, seek(), tell())
	  - process pool does actual processing
	    (like current dta-cab-analyze.perl, but must send data BACK to server; see qsrv.perl)
	  - master process maintains queue (qsrv.perl) and merges processed blocks into final output files (blockmerge.perl)
	* added qtest.perl: works (single-file binary-safe message queue using flock)
	* more bdb/cdb fixes
	* added sysv tests: semaphores ought to work; message queues look a bit dodgy...
	* added Cache::Static; moved bdb->cdb
	* added Analyzer::Cache::Static sub-hierarchy
	* bdb->cdb: system/cab.plm
	* bdb->cdb: analyzer aliases

v1.18 2011-08-22  moocow
	* split ExLex into {BDB,CDB} subclasses: todo: replace BDB by CDB for db-based lookups (ca 25% faster)
	* removed stale BDB directory
	* added Format::XmlTokWrapFast : quick+dirty fast output for feeding to dtatw-xml2ddc.perl
	* more fixes (short format alias 'bin' for Storable)
	* kaskade fixes for big dta build
	* fixed wide-character bug in tj output
	* update script debugging
	* added  documentation to README.update
	* changed alias structure in Chain::DTA (default->norm rather than norm->default)
	  - no functional difference
	* don't start langid server by default
	* README:  newline at EOF
	* fixed CAB_RCDIR
	* cab_corpus/ build: fixes & adjustments
	* fixed TJ format bug for sentence attributes
	* version, analyze verbosity for spawn
	* got forked block-processing working
	* pre-split blocks in dta-cab-analyze.perl

v1.17 2011-08-12  moocow
	* work on new system/resources/ dir (as system/resources.new)
	* default update from kaskade
	* added ssh keypair cab-rc-update.dsa
	  - pubkey must be authorized for update user on build host
	* added svnignore, update script
	* re-added forced lower-case for mlatin db lookups
	* added watchdog links and README in old system/watchdog/ directory>
	* changed watchdog defaults to live in CAB_ROOT/(run|log) by default
	* added cab-xlit-9099.rc for init-script debugging
	  + added forkit, watchdog calls to dta-cab-server.sh (see CAB_WD_* options in dta-cab-server.sh)
	  + old watchdog scripts should now be obsolete
	* tt2tj fixes
	* added c,b tokwrap attributes to Format::TT
	* added dta-cab-convert -list option (list known formats)
	* updated CAB/Format/TT : added new tokwrap/ddc attributes xr,xc,bb,pb,lb,...
	* updated demo template
	* typo fix
	* added exlex checkbutton to demo.html.tpl
	* added exlex checkbox
	* TEI fixes
	* runtime updates from services
	* pathological fix for MootSub (undef prob)
	* fixed annoying dmoot bug with temp-variable re-use in analysis closure
	* startup logic fixes for watchdog-related race condition in dta-cab-http-server.perl
	* added -guess option for dtatw-add-c.perl to TEI format
	* TEI format tweaks & fixes
	* got TEI format working with splice-back
	* added format 'TEI': input from raw TEI-XML with or without //c; output as TokWrapXml
	* fixed <a>-multiplication in TokWrapXml format
	* dtaq optimization tests:
	  - looks like CAB client is the real bottleneck (1.8s cab / 2.6s total = 69% cab time for cab.sh script)
	  - problem doesn't seem immediately fixable
	    + format is fixed by tokwrap and expected by dtaq
	    +  moving server to localhost shaves off some time occasionally, but not much
	    + removing verbose messages gets us only a whopping 1% improvement
	    +  using curl instead of cab-http-client is actually slower (on kaskade)
	* forking dta-cab-analyze.perl
	* dta-cab-analyze.perl: fork maintainence polishing
	  + added -keep , -nokeep args for queue management debugging
	  + improved automatic queue deltion
	  + added signal handlers for INT,HUP,TERM,ABRT to main process (aborts subprocesses)
	  + changed JSON::XS utf8() flag to 0: expect and return wide strings (with utf8::is_utf8($str)==1)
	* tested forks in dta-cab-analyze.perl: all seems good
	* added File::Temp dependency to Makefile.PL
	* more temp-related options

v1.16 2011-07-13  moocow
	* more work on fork pool
	  - abstract queue-savvy fork pool now in CAB/Fork/Pool.pm
	  - uses CAB::Queue::File::Locked for queue
	  - some basic checking for abnormal exit status in children



( run in 1.245 second using v1.01-cache-2.11-cpan-5735350b133 )