IO-Compress-Brotli

 view release on metacpan or  search on metacpan

brotli/tests/testdata/lcet10.txt  view on Meta::CPAN

humanities scholars is not new.  In fact, it has been underway for more
than forty years in the humanities, since Father Roberto Busa began
developing an electronic concordance of the works of Saint Thomas Aquinas
in 1949.  What we are witnessing today, MICHELSON contended, is not the
beginning of this on-line transition but, for at least some humanities
scholars, the turning point in the transition from a print to an
electronic working context.  Coinciding with the on-line transition, the
second striking change is the extent to which research and education
networks are becoming the new medium of scholarly communication.  The
existing Internet and the pending National Education and Research Network
(NREN) represent the new meeting ground where scholars are going for
bibliographic information, scholarly dialogue and feedback, the most
current publications in their field, and high-level educational
offerings.  Traditional scholarly practices are undergoing tremendous
transformations as a result of the emergence and growing prominence of
what is called network-mediated scholarship.

MICHELSON next turned to the second element of the framework she proposed
at the outset of her talk for evaluating the prospects for electronic
text, namely the key information technology trends affecting the conduct
of scholarly communication over the next decade:  1) end-user computing
and 2) connectivity.

End-user computing means that the person touching the keyboard, or
performing computations, is the same as the person who initiates or
consumes the computation.  The emergence of personal computers, along
with a host of other forces, such as ubiquitous computing, advances in
interface design, and the on-line transition, is prompting the consumers
of computation to do their own computing, and is thus rendering obsolete
the traditional distinction between end users and ultimate users.

The trend toward end-user computing is significant to consideration of
the prospects for electronic texts because it means that researchers are
becoming more adept at doing their own computations and, thus, more
competent in the use of electronic media.  By avoiding programmer
intermediaries, computation is becoming central to the researcher's
thought process.  This direct involvement in computing is changing the
researcher's perspective on the nature of research itself, that is, the
kinds of questions that can be posed, the analytical methodologies that
can be used, the types and amount of sources that are appropriate for
analyses, and the form in which findings are presented.  The trend toward
end-user computing means that, increasingly, electronic media and
computation are being infused into all processes of humanities
scholarship, inspiring remarkable transformations in scholarly
communication.

The trend toward greater connectivity suggests that researchers are using
computation increasingly in network environments.  Connectivity is
important to scholarship because it erases the distance that separates
students from teachers and scholars from their colleagues, while allowing
users to access remote databases, share information in many different
media, connect to their working context wherever they are, and
collaborate in all phases of research.

The combination of the trend toward end-user computing and the trend
toward connectivity suggests that the scholarly use of electronic
resources, already evident among some researchers, will soon become an
established feature of scholarship.  The effects of these trends, along
with ongoing changes in scholarly practices, point to a future in which
humanities researchers will use computation and electronic communication
to help them formulate ideas, access sources, perform research,
collaborate with colleagues, seek peer review, publish and disseminate
results, and engage in many other professional and educational activities.

In summary, MICHELSON emphasized four points:  1) A portion of humanities
scholars already consider electronic texts the preferred format for
analysis and dissemination.  2) Scholars are using these electronic
texts, in conjunction with other electronic resources, in all the
processes of scholarly communication.  3) The humanities scholars'
working context is in the process of changing from print technology to
electronic technology, in many ways mirroring transformations that have
occurred or are occurring within the scientific community.  4) These
changes are occurring in conjunction with the development of a new
communication medium:  research and education networks that are
characterized by their capacity to advance scholarship in a wholly unique
way.

MICHELSON also reiterated her three principal arguments:  l) Electronic
texts are best understood in terms of the relationship to other
electronic resources and the growing prominence of network-mediated
scholarship.  2) The prospects for electronic texts lie in their capacity
to be integrated into the on-line network of electronic resources that
comprise the new working context for scholars.  3) Retrospective conversion
of portions of the scholarly record should be a key strategy as information
providers respond to changes in scholarly communication practices.

                                 ******

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
VECCIA * AM's evaluation project and public users of electronic resources
* AM and its design * Site selection and evaluating the Macintosh
implementation of AM * Characteristics of the six public libraries
selected * Characteristics of AM's users in these libraries * Principal
ways AM is being used *
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Susan VECCIA, team leader, and Joanne FREEMAN, associate coordinator,
American Memory, Library of Congress, gave a joint presentation.  First,
by way of introduction, VECCIA explained her and FREEMAN's roles in
American Memory (AM).  Serving principally as an observer, VECCIA has
assisted with the evaluation project of AM, placing AM collections in a
variety of different sites around the country and helping to organize and
implement that project.  FREEMAN has been an associate coordinator of AM
and has been involved principally with the interpretative materials,
preparing some of the electronic exhibits and printed historical
information that accompanies AM and that is requested by users.  VECCIA
and FREEMAN shared anecdotal observations concerning AM with public users
of electronic resources.  Notwithstanding a fairly structured evaluation
in progress, both VECCIA and FREEMAN chose not to report on specifics in
terms of numbers, etc., because they felt it was too early in the
evaluation project to do so.

AM is an electronic archive of primary source materials from the Library
of Congress, selected collections representing a variety of formats--
photographs, graphic arts, recorded sound, motion pictures, broadsides,
and soon, pamphlets and books.  In terms of the design of this system,
the interpretative exhibits have been kept separate from the primary
resources, with good reason.  Accompanying this collection are printed
documentation and user guides, as well as guides that FREEMAN prepared for
teachers so that they may begin using the content of the system at once.

brotli/tests/testdata/lcet10.txt  view on Meta::CPAN


William HOOTON, vice president of operations, I-NET, moderated this session.

+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
KENNEY * Factors influencing development of CXP * Advantages of using
digital technology versus photocopy and microfilm * A primary goal of
CXP; publishing challenges * Characteristics of copies printed * Quality
of samples achieved in image capture * Several factors to be considered
in choosing scanning * Emphasis of CXP on timely and cost-effective
production of black-and-white printed facsimiles * Results of producing
microfilm from digital files * Advantages of creating microfilm * Details
concerning production * Costs * Role of digital technology in library
preservation *
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

Anne KENNEY, associate director, Department of Preservation and
Conservation, Cornell University, opened her talk by observing that the
Cornell Xerox Project (CXP) has been guided by the assumption that the
ability to produce printed facsimiles or to replace paper with paper
would be important, at least for the present generation of users and
equipment.  She described three factors that influenced development of
the project:  1) Because the project has emphasized the preservation of
deteriorating brittle books, the quality of what was produced had to be
sufficiently high to return a paper replacement to the shelf.  CXP was
only interested in using:  2) a system that was cost-effective, which
meant that it had to be cost-competitive with the processes currently
available, principally photocopy and microfilm, and 3) new or currently
available product hardware and software.

KENNEY described the advantages that using digital technology offers over
both photocopy and microfilm:  1) The potential exists to create a higher
quality reproduction of a deteriorating original than conventional
light-lens technology.  2) Because a digital image is an encoded
representation, it can be reproduced again and again with no resulting
loss of quality, as opposed to the situation with light-lens processes,
in which there is discernible difference between a second and a
subsequent generation of an image.  3) A digital image can be manipulated
in a number of ways to improve image capture; for example, Xerox has
developed a windowing application that enables one to capture a page
containing both text and illustrations in a manner that optimizes the
reproduction of both.  (With light-lens technology, one must choose which
to optimize, text or the illustration; in preservation microfilming, the
current practice is to shoot an illustrated page twice, once to highlight
the text and the second time to provide the best capture for the
illustration.)  4) A digital image can also be edited, density levels
adjusted to remove underlining and stains, and to increase legibility for
faint documents.  5) On-screen inspection can take place at the time of
initial setup and adjustments made prior to scanning, factors that
substantially reduce the number of retakes required in quality control.

A primary goal of CXP has been to evaluate the paper output printed on
the Xerox DocuTech, a high-speed printer that produces 600-dpi pages from
scanned images at a rate of 135 pages a minute.  KENNEY recounted several
publishing challenges to represent faithful and legible reproductions of
the originals that the 600-dpi copy for the most part successfully
captured.  For example, many of the deteriorating volumes in the project
were heavily illustrated with fine line drawings or halftones or came in
languages such as Japanese, in which the buildup of characters comprised
of varying strokes is difficult to reproduce at lower resolutions; a
surprising number of them came with annotations and mathematical
formulas, which it was critical to be able to duplicate exactly.

KENNEY noted that 1) the copies are being printed on paper that meets the
ANSI standards for performance, 2) the DocuTech printer meets the machine
and toner requirements for proper adhesion of print to page, as described
by the National Archives, and thus 3) paper product is considered to be
the archival equivalent of preservation photocopy.

KENNEY then discussed several samples of the quality achieved in the
project that had been distributed in a handout, for example, a copy of a
print-on-demand version of the 1911 Reed lecture on the steam turbine,
which contains halftones, line drawings, and illustrations embedded in
text; the first four loose pages in the volume compared the capture
capabilities of scanning to photocopy for a standard test target, the
IEEE standard 167A 1987 test chart.  In all instances scanning proved
superior to photocopy, though only slightly more so in one.

Conceding the simplistic nature of her review of the quality of scanning
to photocopy, KENNEY described it as one representation of the kinds of
settings that could be used with scanning capabilities on the equipment
CXP uses.  KENNEY also pointed out that CXP investigated the quality
achieved with binary scanning only, and noted the great promise in gray
scale and color scanning, whose advantages and disadvantages need to be
examined.  She argued further that scanning resolutions and file formats
can represent a complex trade-off between the time it takes to capture
material, file size, fidelity to the original, and on-screen display; and
printing and equipment availability.  All these factors must be taken
into consideration.

CXP placed primary emphasis on the production in a timely and
cost-effective manner of printed facsimiles that consisted largely of
black-and-white text.  With binary scanning, large files may be
compressed efficiently and in a lossless manner (i.e., no data is lost in
the process of compressing [and decompressing] an image--the exact
bit-representation is maintained) using Group 4 CCITT (i.e., the French
acronym for International Consultative Committee for Telegraph and
Telephone) compression.  CXP was getting compression ratios of about
forty to one.  Gray-scale compression, which primarily uses JPEG, is much
less economical and can represent a lossy compression (i.e., not
lossless), so that as one compresses and decompresses, the illustration
is subtly changed.  While binary files produce a high-quality printed
version, it appears 1) that other combinations of spatial resolution with
gray and/or color hold great promise as well, and 2) that gray scale can
represent a tremendous advantage for on-screen viewing.  The quality
associated with binary and gray scale also depends on the equipment used. 
For instance, binary scanning produces a much better copy on a binary
printer.

Among CXP's findings concerning the production of microfilm from digital
files, KENNEY reported that the digital files for the same Reed lecture
were used to produce sample film using an electron beam recorder.  The
resulting film was faithful to the image capture of the digital files,
and while CXP felt that the text and image pages represented in the Reed
lecture were superior to that of the light-lens film, the resolution
readings for the 600 dpi were not as high as standard microfilming. 
KENNEY argued that the standards defined for light-lens technology are
not totally transferable to a digital environment.  Moreover, they are
based on definition of quality for a preservation copy.  Although making
this case will prove to be a long, uphill struggle, CXP plans to continue
to investigate the issue over the course of the next year.

brotli/tests/testdata/lcet10.txt  view on Meta::CPAN

implicit and explicit, as applied to text, and their capabilities.  He
argued that these grammars correspond to different models of text that
different developers have.  For example, one implicit model of the text
is that there is no internal structure, but just one thing after another,
a few characters and then perhaps a start-title command, and then a few
more characters and an end-title command.  SPERBERG-McQUEEN also
distinguished several kinds of text that have a sort of hierarchical
structure that is not very well defined, which, typically, corresponds
to grammars that are not very well defined, as well as hierarchies that
are very well defined (e.g., the Thesaurus Linguae Graecae) and extremely
complicated things such as SGML, which handle strictly hierarchical data
very nicely.

SPERBERG-McQUEEN conceded that one other model not illustrated on his two
displays was the model of text as a bit-mapped image, an image of a page,
and confessed to having been converted to a limited extent by the
Workshop to the view that electronic images constitute a promising,
probably superior alternative to microfilming.  But he was not convinced
that electronic images represent a serious attempt to represent text in
electronic form.  Many of their problems stem from the fact that they are
not direct attempts to represent the text but attempts to represent the
page, thus making them representations of representations.

In this situation of increasingly complicated textual information and the
need to control that complexity in a useful way (which begs the question
of the need for good textual grammars), one has the introduction of SGML. 
With SGML, one can develop specific document-type declarations
for specific text types or, as with the TEI, attempts to generate
general document-type declarations that can handle all sorts of text.
The TEI is an attempt to develop formats for text representation that
will ensure the kind of reusability and longevity of data discussed earlier.
It offers a way to stay alive in the state of permanent technological
revolution.

It has been a continuing challenge in the TEI to create document grammars
that do some work in controlling the complexity of the textual object but
also allowing one to represent the real text that one will find. 
Fundamental to the notion of the TEI is that TEI conformance allows one
the ability to extend or modify the TEI tag set so that it fits the text
that one is attempting to represent.

SPERBERG-McQUEEN next outlined the administrative background of the TEI. 
The TEI is an international project to develop and disseminate guidelines
for the encoding and interchange of machine-readable text.  It is
sponsored by the Association for Computers in the Humanities, the
Association for Computational Linguistics, and the Association for
Literary and Linguistic Computing.  Representatives of numerous other
professional societies sit on its advisory board.  The TEI has a number
of affiliated projects that have provided assistance by testing drafts of
the guidelines.

Among the design goals for the TEI tag set, the scheme first of all must
meet the needs of research, because the TEI came out of the research
community, which did not feel adequately served by existing tag sets. 
The tag set must be extensive as well as compatible with existing and
emerging standards.  In 1990, version 1.0 of the Guidelines was released
(SPERBERG-McQUEEN illustrated their contents).

SPERBERG-McQUEEN noted that one problem besetting electronic text has
been the lack of adequate internal or external documentation for many
existing electronic texts.  The TEI guidelines as currently formulated
contain few fixed requirements, but one of them is this:  There must
always be a document header, an in-file SGML tag that provides
1) a bibliographic description of the electronic object one is talking
about (that is, who included it, when, what for, and under which title);
and 2) the copy text from which it was derived, if any.  If there was
no copy text or if the copy text is unknown, then one states as much.
Version 2.0 of the Guidelines was scheduled to be completed in fall 1992
and a revised third version is to be presented to the TEI advisory board
for its endorsement this coming winter.  The TEI itself exists to provide
a markup language, not a marked-up text.

Among the challenges the TEI has attempted to face is the need for a
markup language that will work for existing projects, that is, handle the
level of markup that people are using now to tag only chapter, section,
and paragraph divisions and not much else.  At the same time, such a
language also will be able to scale up gracefully to handle the highly
detailed markup which many people foresee as the future destination of
much electronic text, and which is not the future destination but the
present home of numerous electronic texts in specialized areas.

SPERBERG-McQUEEN dismissed the lowest-common-denominator approach as
unable to support the kind of applications that draw people who have
never been in the public library regularly before, and make them come
back.  He advocated more interesting text and more intelligent text. 
Asserting that it is not beyond economic feasibility to have good texts,
SPERBERG-McQUEEN noted that the TEI Guidelines listing 200-odd tags
contains tags that one is expected to enter every time the relevant
textual feature occurs.  It contains all the tags that people need now,
and it is not expected that everyone will tag things in the same way.

The question of how people will tag the text is in large part a function
of their reaction to what SPERBERG-McQUEEN termed the issue of
reproducibility.  What one needs to be able to reproduce are the things
one wants to work with.  Perhaps a more useful concept than that of
reproducibility or recoverability is that of processability, that is,
what can one get from an electronic text without reading it again
in the original.  He illustrated this contention with a page from
Jan Comenius's bilingual Introduction to Latin.

SPERBERG-McQUEEN returned at length to the issue of images as simulacra
for the text, in order to reiterate his belief that in the long run more
than images of pages of particular editions of the text are needed,
because just as second-generation photocopies and second-generation
microfilm degenerate, so second-generation representations tend to
degenerate, and one tends to overstress some relatively trivial aspects
of the text such as its layout on the page, which is not always
significant, despite what the text critics might say, and slight other
pieces of information such as the very important lexical ties between the
English and Latin versions of Comenius's bilingual text, for example. 
Moreover, in many crucial respects it is easy to fool oneself concerning
what a scanned image of the text will accomplish.  For example, in order
to study the transmission of texts, information concerning the text
carrier is necessary, which scanned images simply do not always handle. 
Further, even the high-quality materials being produced at Cornell use
much of the information that one would need if studying those books as
physical objects.  It is a choice that has been made.  It is an arguably
justifiable choice, but one does not know what color those pen strokes in
the margin are or whether there was a stain on the page, because it has
been filtered out.  One does not know whether there were rips in the page
because they do not show up, and on a couple of the marginal marks one



( run in 1.158 second using v1.01-cache-2.11-cpan-5735350b133 )