IO-Compress-Brotli
view release on metacpan or search on metacpan
brotli/tests/testdata/lcet10.txt view on Meta::CPAN
standards precipitously can inhibit creativity, but delay can result in
chaos, she advised.
In part, BATTIN's position reflected the unsettled nature of image-format
standards, and attendees could hear echoes of this unsettledness in the
comments of various speakers. For example, Jean BARONAS reviewed the
status of several formal standards moving through committees of experts;
and Clifford LYNCH encouraged the use of a new guideline for transmitting
document images on Internet. Testimony from participants in the National
Agricultural Library's (NAL) Text Digitization Program and LC's American
Memory project highlighted some of the challenges to the actual creation
or interchange of images, including difficulties in converting
preservation microfilm to digital form. Donald WATERS reported on the
progress of a master plan for a project at Yale University to convert
books on microfilm to digital image sets, Project Open Book (POB).
The Workshop offered rather less of an imaging practicum than planned,
but "how-to" hints emerge at various points, for example, throughout
KENNEY's presentation and in the discussion of arcana such as
thresholding and dithering offered by George THOMA and FLEISCHHAUER.
NOTES:
(3) Although there is a sense in which any reproductions of
historical materials preserve the human record, specialists in the
field have developed particular guidelines for the creation of
acceptable preservation copies.
(4) Titles and affiliations of presenters are given at the
beginning of their respective talks and in the Directory of
Participants (Appendix III).
THE MACHINE-READABLE TEXT: MARKUP AND USE
The sections of the Workshop that dealt with machine-readable text tended
to be more concerned with access and use than with preservation, at least
in the narrow technical sense. Michael SPERBERG-McQUEEN made a forceful
presentation on the Text Encoding Initiative's (TEI) implementation of
the Standard Generalized Markup Language (SGML). His ideas were echoed
by Susan HOCKEY, Elli MYLONAS, and Stuart WEIBEL. While the
presentations made by the TEI advocates contained no practicum, their
discussion focused on the value of the finished product, what the
European Community calls reusability, but what may also be termed
durability. They argued that marking up--that is, coding--a text in a
well-conceived way will permit it to be moved from one computer
environment to another, as well as to be used by various users. Two
kinds of markup were distinguished: 1) procedural markup, which
describes the features of a text (e.g., dots on a page), and 2)
descriptive markup, which describes the structure or elements of a
document (e.g., chapters, paragraphs, and front matter).
The TEI proponents emphasized the importance of texts to scholarship.
They explained how heavily coded (and thus analyzed and annotated) texts
can underlie research, play a role in scholarly communication, and
facilitate classroom teaching. SPERBERG-McQUEEN reminded listeners that
a written or printed item (e.g., a particular edition of a book) is
merely a representation of the abstraction we call a text. To concern
ourselves with faithfully reproducing a printed instance of the text,
SPERBERG-McQUEEN argued, is to concern ourselves with the representation
of a representation ("images as simulacra for the text"). The TEI proponents'
interest in images tends to focus on corollary materials for use in teaching,
for example, photographs of the Acropolis to accompany a Greek text.
By the end of the Workshop, SPERBERG-McQUEEN confessed to having been
converted to a limited extent to the view that electronic images
constitute a promising alternative to microfilming; indeed, an
alternative probably superior to microfilming. But he was not convinced
that electronic images constitute a serious attempt to represent text in
electronic form. HOCKEY and MYLONAS also conceded that their experience
at the Pierce Symposium the previous week at Georgetown University and
the present conference at the Library of Congress had compelled them to
reevaluate their perspective on the usefulness of text as images.
Attendees could see that the text and image advocates were in
constructive tension, so to say.
Three nonTEI presentations described approaches to preparing
machine-readable text that are less rigorous and thus less expensive. In
the case of the Papers of George Washington, Dorothy TWOHIG explained
that the digital version will provide a not-quite-perfect rendering of
the transcribed text--some 135,000 documents, available for research
during the decades while the perfect or print version is completed.
Members of the American Memory team and the staff of NAL's Text
Digitization Program (see below) also outlined a middle ground concerning
searchable texts. In the case of American Memory, contractors produce
texts with about 99-percent accuracy that serve as "browse" or
"reference" versions of written or printed originals. End users who need
faithful copies or perfect renditions must refer to accompanying sets of
digital facsimile images or consult copies of the originals in a nearby
library or archive. American Memory staff argued that the high cost of
producing 100-percent accurate copies would prevent LC from offering
access to large parts of its collections.
THE MACHINE-READABLE TEXT: METHODS OF CONVERSION
Although the Workshop did not include a systematic examination of the
methods for converting texts from paper (or from facsimile images) into
machine-readable form, nevertheless, various speakers touched upon this
matter. For example, WEIBEL reported that OCLC has experimented with a
merging of multiple optical character recognition systems that will
reduce errors from an unacceptable rate of 5 characters out of every
l,000 to an unacceptable rate of 2 characters out of every l,000.
Pamela ANDRE presented an overview of NAL's Text Digitization Program and
Judith ZIDAR discussed the technical details. ZIDAR explained how NAL
purchased hardware and software capable of performing optical character
recognition (OCR) and text conversion and used its own staff to convert
texts. The process, ZIDAR said, required extensive editing and project
staff found themselves considering alternatives, including rekeying
and/or creating abstracts or summaries of texts. NAL reckoned costs at
$7 per page. By way of contrast, Ricky ERWAY explained that American
Memory had decided from the start to contract out conversion to external
service bureaus. The criteria used to select these contractors were cost
and quality of results, as opposed to methods of conversion. ERWAY noted
that historical documents or books often do not lend themselves to OCR.
Bound materials represent a special problem. In her experience, quality
control--inspecting incoming materials, counting errors in samples--posed
the most time-consuming aspect of contracting out conversion. ERWAY
reckoned American Memory's costs at $4 per page, but cautioned that fewer
cost-elements had been included than in NAL's figure.
brotli/tests/testdata/lcet10.txt view on Meta::CPAN
networks, LYNCH, Howard BESSER, Ronald LARSEN, and Edwin BROWNRIGG
highlighted the virtues of Internet today and of the network that will
evolve from Internet. Listeners could discern in these narratives a
vision of an information democracy in which millions of citizens freely
find and use what they need. LYNCH noted that a lack of standards
inhibits disseminating multimedia on the network, a topic also discussed
by BESSER. LARSEN addressed the issues of network scalability and
modularity and commented upon the difficulty of anticipating the effects
of growth in orders of magnitude. BROWNRIGG talked about the ability of
packet radio to provide certain links in a network without the need for
wiring. However, the presenters also called attention to the
shortcomings and incongruities of present-day computer networks. For
example: 1) Network use is growing dramatically, but much network
traffic consists of personal communication (E-mail). 2) Large bodies of
information are available, but a user's ability to search across their
entirety is limited. 3) There are significant resources for science and
technology, but few network sources provide content in the humanities.
4) Machine-readable texts are commonplace, but the capability of the
system to deal with images (let alone other media formats) lags behind.
A glimpse of a multimedia future for networks, however, was provided by
Maria LEBRON in her overview of the Online Journal of Current Clinical
Trials (OJCCT), and the process of scholarly publishing on-line.
The contrasting form of the CD-ROM disk was never systematically
analyzed, but attendees could glean an impression from several of the
show-and-tell presentations. The Perseus and American Memory examples
demonstrated recently published disks, while the descriptions of the
IBYCUS version of the Papers of George Washington and Chadwyck-Healey's
Patrologia Latina Database (PLD) told of disks to come. According to
Eric CALALUCA, PLD's principal focus has been on converting Jacques-Paul
Migne's definitive collection of Latin texts to machine-readable form.
Although everyone could share the network advocates' enthusiasm for an
on-line future, the possibility of rolling up one's sleeves for a session
with a CD-ROM containing both textual materials and a powerful retrieval
engine made the disk seem an appealing vessel indeed. The overall
discussion suggested that the transition from CD-ROM to on-line networked
access may prove far slower and more difficult than has been anticipated.
WHO ARE THE USERS AND WHAT DO THEY DO?
Although concerned with the technicalities of production, the Workshop
never lost sight of the purposes and uses of electronic versions of
textual materials. As noted above, those interested in imaging discussed
the problematical matter of digital preservation, while the TEI proponents
described how machine-readable texts can be used in research. This latter
topic received thorough treatment in the paper read by Avra MICHELSON.
She placed the phenomenon of electronic texts within the context of
broader trends in information technology and scholarly communication.
Among other things, MICHELSON described on-line conferences that
represent a vigorous and important intellectual forum for certain
disciplines. Internet now carries more than 700 conferences, with about
80 percent of these devoted to topics in the social sciences and the
humanities. Other scholars use on-line networks for "distance learning."
Meanwhile, there has been a tremendous growth in end-user computing;
professors today are less likely than their predecessors to ask the
campus computer center to process their data. Electronic texts are one
key to these sophisticated applications, MICHELSON reported, and more and
more scholars in the humanities now work in an on-line environment.
Toward the end of the Workshop, Michael LESK presented a corollary to
MICHELSON's talk, reporting the results of an experiment that compared
the work of one group of chemistry students using traditional printed
texts and two groups using electronic sources. The experiment
demonstrated that in the event one does not know what to read, one needs
the electronic systems; the electronic systems hold no advantage at the
moment if one knows what to read, but neither do they impose a penalty.
DALY provided an anecdotal account of the revolutionizing impact of the
new technology on his previous methods of research in the field of classics.
His account, by extrapolation, served to illustrate in part the arguments
made by MICHELSON concerning the positive effects of the sudden and radical
transformation being wrought in the ways scholars work.
Susan VECCIA and Joanne FREEMAN delineated the use of electronic
materials outside the university. The most interesting aspect of their
use, FREEMAN said, could be seen as a paradox: teachers in elementary
and secondary schools requested access to primary source materials but,
at the same time, found that "primariness" itself made these materials
difficult for their students to use.
OTHER TOPICS
Marybeth PETERS reviewed copyright law in the United States and offered
advice during a lively discussion of this subject. But uncertainty
remains concerning the price of copyright in a digital medium, because a
solution remains to be worked out concerning management and synthesis of
copyrighted and out-of-copyright pieces of a database.
As moderator of the final session of the Workshop, Prosser GIFFORD directed
discussion to future courses of action and the potential role of LC in
advancing them. Among the recommendations that emerged were the following:
* Workshop participants should 1) begin to think about working
with image material, but structure and digitize it in such a
way that at a later stage it can be interpreted into text, and
2) find a common way to build text and images together so that
they can be used jointly at some stage in the future, with
appropriate network support, because that is how users will want
to access these materials. The Library might encourage attempts
to bring together people who are working on texts and images.
* A network version of American Memory should be developed or
consideration should be given to making the data in it
available to people interested in doing network multimedia.
Given the current dearth of digital data that is appealing and
unencumbered by extremely complex rights problems, developing a
network version of American Memory could do much to help make
network multimedia a reality.
* Concerning the thorny issue of electronic deposit, LC should
initiate a catalytic process in terms of distributed
responsibility, that is, bring together the distributed
organizations and set up a study group to look at all the
issues related to electronic deposit and see where we as a
nation should move. For example, LC might attempt to persuade
one major library in each state to deal with its state
equivalent publisher, which might produce a cooperative project
that would be equitably distributed around the country, and one
in which LC would be dealing with a minimal number of publishers
( run in 1.110 second using v1.01-cache-2.11-cpan-13bb782fe5a )