• Home
  • Line#
  • Scopes#
  • Navigate#
  • Raw
  • Download
1
2
3The Project Gutenberg Etext of LOC WORKSHOP ON ELECTRONIC TEXTS
4
5
6
7
8                      WORKSHOP ON ELECTRONIC TEXTS
9
10                               PROCEEDINGS
11
12
13
14                          Edited by James Daly
15
16
17
18
19
20
21
22                             9-10 June 1992
23
24
25                           Library of Congress
26                            Washington, D.C.
27
28
29
30    Supported by a Grant from the David and Lucile Packard Foundation
31
32
33               ***   ***   ***   ******   ***   ***   ***
34
35
36                            TABLE OF CONTENTS
37
38
39Acknowledgements
40
41Introduction
42
43Proceedings
44   Welcome
45      Prosser Gifford and Carl Fleischhauer
46
47   Session I.  Content in a New Form:  Who Will Use It and What Will They Do?
48      James Daly (Moderator)
49      Avra Michelson, Overview
50      Susan H. Veccia, User Evaluation
51      Joanne Freeman, Beyond the Scholar
52         Discussion
53
54   Session II.  Show and Tell
55      Jacqueline Hess (Moderator)
56      Elli Mylonas, Perseus Project
57         Discussion
58      Eric M. Calaluca, Patrologia Latina Database
59      Carl Fleischhauer and Ricky Erway, American Memory
60         Discussion
61      Dorothy Twohig, The Papers of George Washington
62         Discussion
63      Maria L. Lebron, The Online Journal of Current Clinical Trials
64         Discussion
65      Lynne K. Personius, Cornell mathematics books
66         Discussion
67
68   Session III.  Distribution, Networks, and Networking:
69                 Options for Dissemination
70      Robert G. Zich (Moderator)
71      Clifford A. Lynch
72         Discussion
73      Howard Besser
74         Discussion
75      Ronald L. Larsen
76      Edwin B. Brownrigg
77         Discussion
78
79   Session IV.  Image Capture, Text Capture, Overview of Text and
80                Image Storage Formats
81         William L. Hooton (Moderator)
82      A) Principal Methods for Image Capture of Text:
83            direct scanning, use of microform
84         Anne R. Kenney
85         Pamela Q.J. Andre
86         Judith A. Zidar
87         Donald J. Waters
88            Discussion
89      B) Special Problems:  bound volumes, conservation,
90                            reproducing printed halftones
91         George Thoma
92         Carl Fleischhauer
93            Discussion
94      C) Image Standards and Implications for Preservation
95         Jean Baronas
96         Patricia Battin
97            Discussion
98      D) Text Conversion:  OCR vs. rekeying, standards of accuracy
99                           and use of imperfect texts, service bureaus
100         Michael Lesk
101         Ricky Erway
102         Judith A. Zidar
103            Discussion
104
105   Session V.  Approaches to Preparing Electronic Texts
106      Susan Hockey (Moderator)
107      Stuart Weibel
108         Discussion
109      C.M. Sperberg-McQueen
110         Discussion
111      Eric M. Calaluca
112         Discussion
113
114   Session VI.  Copyright Issues
115      Marybeth Peters
116
117   Session VII.  Conclusion
118      Prosser Gifford (Moderator)
119      General discussion
120
121Appendix I:  Program
122
123Appendix II:  Abstracts
124
125Appendix III:  Directory of Participants
126
127
128               ***   ***   ***   ******   ***   ***   ***
129
130
131                            Acknowledgements
132
133I would like to thank Carl Fleischhauer and Prosser Gifford for the
134opportunity to learn about areas of human activity unknown to me a scant
135ten months ago, and the David and Lucile Packard Foundation for
136supporting that opportunity.  The help given by others is acknowledged on
137a separate page.
138
139                                                          19 October 1992
140
141
142               ***   ***   ***   ******   ***   ***   ***
143
144
145                              INTRODUCTION
146
147The Workshop on Electronic Texts (1) drew together representatives of
148various projects and interest groups to compare ideas, beliefs,
149experiences, and, in particular, methods of placing and presenting
150historical textual materials in computerized form.  Most attendees gained
151much in insight and outlook from the event.  But the assembly did not
152form a new nation, or, to put it another way, the diversity of projects
153and interests was too great to draw the representatives into a cohesive,
154action-oriented body.(2)
155
156Everyone attending the Workshop shared an interest in preserving and
157providing access to historical texts.  But within this broad field the
158attendees represented a variety of formal, informal, figurative, and
159literal groups, with many individuals belonging to more than one.  These
160groups may be defined roughly according to the following topics or
161activities:
162
163* Imaging
164* Searchable coded texts
165* National and international computer networks
166* CD-ROM production and dissemination
167* Methods and technology for converting older paper materials into
168electronic form
169* Study of the use of digital materials by scholars and others
170
171This summary is arranged thematically and does not follow the actual
172sequence of presentations.
173
174NOTES:
175     (1)  In this document, the phrase electronic text is used to mean
176     any computerized reproduction or version of a document, book,
177     article, or manuscript (including images), and not merely a machine-
178     readable or machine-searchable text.
179
180     (2)  The Workshop was held at the Library of Congress on 9-10 June
181     1992, with funding from the David and Lucile Packard Foundation.
182     The document that follows represents a summary of the presentations
183     made at the Workshop and was compiled by James DALY.  This
184     introduction was written by DALY and Carl FLEISCHHAUER.
185
186
187PRESERVATION AND IMAGING
188
189Preservation, as that term is used by archivists,(3) was most explicitly
190discussed in the context of imaging.  Anne KENNEY and Lynne PERSONIUS
191explained how the concept of a faithful copy and the user-friendliness of
192the traditional book have guided their project at Cornell University.(4)
193Although interested in computerized dissemination, participants in the
194Cornell project are creating digital image sets of older books in the
195public domain as a source for a fresh paper facsimile or, in a future
196phase, microfilm.  The books returned to the library shelves are
197high-quality and useful replacements on acid-free paper that should last
198a long time.  To date, the Cornell project has placed little or no
199emphasis on creating searchable texts; one would not be surprised to find
200that the project participants view such texts as new editions, and thus
201not as faithful reproductions.
202
203In her talk on preservation, Patricia BATTIN struck an ecumenical and
204flexible note as she endorsed the creation and dissemination of a variety
205of types of digital copies.  Do not be too narrow in defining what counts
206as a preservation element, BATTIN counseled; for the present, at least,
207digital copies made with preservation in mind cannot be as narrowly
208standardized as, say, microfilm copies with the same objective.  Setting
209standards precipitously can inhibit creativity, but delay can result in
210chaos, she advised.
211
212In part, BATTIN's position reflected the unsettled nature of image-format
213standards, and attendees could hear echoes of this unsettledness in the
214comments of various speakers.  For example, Jean BARONAS reviewed the
215status of several formal standards moving through committees of experts;
216and Clifford LYNCH encouraged the use of a new guideline for transmitting
217document images on Internet.  Testimony from participants in the National
218Agricultural Library's (NAL) Text Digitization Program and LC's American
219Memory project highlighted some of the challenges to the actual creation
220or interchange of images, including difficulties in converting
221preservation microfilm to digital form.  Donald WATERS reported on the
222progress of a master plan for a project at Yale University to convert
223books on microfilm to digital image sets, Project Open Book (POB).
224
225The Workshop offered rather less of an imaging practicum than planned,
226but "how-to" hints emerge at various points, for example, throughout
227KENNEY's presentation and in the discussion of arcana such as
228thresholding and dithering offered by George THOMA and FLEISCHHAUER.
229
230NOTES:
231     (3)  Although there is a sense in which any reproductions of
232     historical materials preserve the human record, specialists in the
233     field have developed particular guidelines for the creation of
234     acceptable preservation copies.
235
236     (4)  Titles and affiliations of presenters are given at the
237     beginning of their respective talks and in the Directory of
238     Participants (Appendix III).
239
240
241THE MACHINE-READABLE TEXT:  MARKUP AND USE
242
243The sections of the Workshop that dealt with machine-readable text tended
244to be more concerned with access and use than with preservation, at least
245in the narrow technical sense.  Michael SPERBERG-McQUEEN made a forceful
246presentation on the Text Encoding Initiative's (TEI) implementation of
247the Standard Generalized Markup Language (SGML).  His ideas were echoed
248by Susan HOCKEY, Elli MYLONAS, and Stuart WEIBEL.  While the
249presentations made by the TEI advocates contained no practicum, their
250discussion focused on the value of the finished product, what the
251European Community calls reusability, but what may also be termed
252durability.  They argued that marking up--that is, coding--a text in a
253well-conceived way will permit it to be moved from one computer
254environment to another, as well as to be used by various users.  Two
255kinds of markup were distinguished:  1) procedural markup, which
256describes the features of a text (e.g., dots on a page), and 2)
257descriptive markup, which describes the structure or elements of a
258document (e.g., chapters, paragraphs, and front matter).
259
260The TEI proponents emphasized the importance of texts to scholarship.
261They explained how heavily coded (and thus analyzed and annotated) texts
262can underlie research, play a role in scholarly communication, and
263facilitate classroom teaching.  SPERBERG-McQUEEN reminded listeners that
264a written or printed item (e.g., a particular edition of a book) is
265merely a representation of the abstraction we call a text.  To concern
266ourselves with faithfully reproducing a printed instance of the text,
267SPERBERG-McQUEEN argued, is to concern ourselves with the representation
268of a representation ("images as simulacra for the text").  The TEI proponents'
269interest in images tends to focus on corollary materials for use in teaching,
270for example, photographs of the Acropolis to accompany a Greek text.
271
272By the end of the Workshop, SPERBERG-McQUEEN confessed to having been
273converted to a limited extent to the view that electronic images
274constitute a promising alternative to microfilming; indeed, an
275alternative probably superior to microfilming.  But he was not convinced
276that electronic images constitute a serious attempt to represent text in
277electronic form.  HOCKEY and MYLONAS also conceded that their experience
278at the Pierce Symposium the previous week at Georgetown University and
279the present conference at the Library of Congress had compelled them to
280reevaluate their perspective on the usefulness of text as images.
281Attendees could see that the text and image advocates were in
282constructive tension, so to say.
283
284Three nonTEI presentations described approaches to preparing
285machine-readable text that are less rigorous and thus less expensive.  In
286the case of the Papers of George Washington, Dorothy TWOHIG explained
287that the digital version will provide a not-quite-perfect rendering of
288the transcribed text--some 135,000 documents, available for research
289during the decades while the perfect or print version is completed.
290Members of the American Memory team and the staff of NAL's Text
291Digitization Program (see below) also outlined a middle ground concerning
292searchable texts.  In the case of American Memory, contractors produce
293texts with about 99-percent accuracy that serve as "browse" or
294"reference" versions of written or printed originals.  End users who need
295faithful copies or perfect renditions must refer to accompanying sets of
296digital facsimile images or consult copies of the originals in a nearby
297library or archive.  American Memory staff argued that the high cost of
298producing 100-percent accurate copies would prevent LC from offering
299access to large parts of its collections.
300
301
302THE MACHINE-READABLE TEXT:  METHODS OF CONVERSION
303
304Although the Workshop did not include a systematic examination of the
305methods for converting texts from paper (or from facsimile images) into
306machine-readable form, nevertheless, various speakers touched upon this
307matter.  For example, WEIBEL reported that OCLC has experimented with a
308merging of multiple optical character recognition systems that will
309reduce errors from an unacceptable rate of 5 characters out of every
310l,000 to an unacceptable rate of 2 characters out of every l,000.
311
312Pamela ANDRE presented an overview of NAL's Text Digitization Program and
313Judith ZIDAR discussed the technical details.  ZIDAR explained how NAL
314purchased hardware and software capable of performing optical character
315recognition (OCR) and text conversion and used its own staff to convert
316texts.  The process, ZIDAR said, required extensive editing and project
317staff found themselves considering alternatives, including rekeying
318and/or creating abstracts or summaries of texts.  NAL reckoned costs at
319$7 per page.  By way of contrast, Ricky ERWAY explained that American
320Memory had decided from the start to contract out conversion to external
321service bureaus.  The criteria used to select these contractors were cost
322and quality of results, as opposed to methods of conversion.  ERWAY noted
323that historical documents or books often do not lend themselves to OCR.
324Bound materials represent a special problem.  In her experience, quality
325control--inspecting incoming materials, counting errors in samples--posed
326the most time-consuming aspect of contracting out conversion.  ERWAY
327reckoned American Memory's costs at $4 per page, but cautioned that fewer
328cost-elements had been included than in NAL's figure.
329
330
331OPTIONS FOR DISSEMINATION
332
333The topic of dissemination proper emerged at various points during the
334Workshop.  At the session devoted to national and international computer
335networks, LYNCH, Howard BESSER, Ronald LARSEN, and Edwin BROWNRIGG
336highlighted the virtues of Internet today and of the network that will
337evolve from Internet.  Listeners could discern in these narratives a
338vision of an information democracy in which millions of citizens freely
339find and use what they need.  LYNCH noted that a lack of standards
340inhibits disseminating multimedia on the network, a topic also discussed
341by BESSER.  LARSEN addressed the issues of network scalability and
342modularity and commented upon the difficulty of anticipating the effects
343of growth in orders of magnitude.  BROWNRIGG talked about the ability of
344packet radio to provide certain links in a network without the need for
345wiring.  However, the presenters also called attention to the
346shortcomings and incongruities of present-day computer networks.  For
347example:  1) Network use is growing dramatically, but much network
348traffic consists of personal communication (E-mail).  2) Large bodies of
349information are available, but a user's ability to search across their
350entirety is limited.  3) There are significant resources for science and
351technology, but few network sources provide content in the humanities.
3524) Machine-readable texts are commonplace, but the capability of the
353system to deal with images (let alone other media formats) lags behind.
354A glimpse of a multimedia future for networks, however, was provided by
355Maria LEBRON in her overview of the Online Journal of Current Clinical
356Trials (OJCCT), and the process of scholarly publishing on-line.
357
358The contrasting form of the CD-ROM disk was never systematically
359analyzed, but attendees could glean an impression from several of the
360show-and-tell presentations.  The Perseus and American Memory examples
361demonstrated recently published disks, while the descriptions of the
362IBYCUS version of the Papers of George Washington and Chadwyck-Healey's
363Patrologia Latina Database (PLD) told of disks to come.  According to
364Eric CALALUCA, PLD's principal focus has been on converting Jacques-Paul
365Migne's definitive collection of Latin texts to machine-readable form.
366Although everyone could share the network advocates' enthusiasm for an
367on-line future, the possibility of rolling up one's sleeves for a session
368with a CD-ROM containing both textual materials and a powerful retrieval
369engine made the disk seem an appealing vessel indeed.  The overall
370discussion suggested that the transition from CD-ROM to on-line networked
371access may prove far slower and more difficult than has been anticipated.
372
373
374WHO ARE THE USERS AND WHAT DO THEY DO?
375
376Although concerned with the technicalities of production, the Workshop
377never lost sight of the purposes and uses of electronic versions of
378textual materials.  As noted above, those interested in imaging discussed
379the problematical matter of digital preservation, while the TEI proponents
380described how machine-readable texts can be used in research.  This latter
381topic received thorough treatment in the paper read by Avra MICHELSON.
382She placed the phenomenon of electronic texts within the context of
383broader trends in information technology and scholarly communication.
384
385Among other things, MICHELSON described on-line conferences that
386represent a vigorous and important intellectual forum for certain
387disciplines.  Internet now carries more than 700 conferences, with about
38880 percent of these devoted to topics in the social sciences and the
389humanities.  Other scholars use on-line networks for "distance learning."
390Meanwhile, there has been a tremendous growth in end-user computing;
391professors today are less likely than their predecessors to ask the
392campus computer center to process their data.  Electronic texts are one
393key to these sophisticated applications, MICHELSON reported, and more and
394more scholars in the humanities now work in an on-line environment.
395Toward the end of the Workshop, Michael LESK presented a corollary to
396MICHELSON's talk, reporting the results of an experiment that compared
397the work of one group of chemistry students using traditional printed
398texts and two groups using electronic sources.  The experiment
399demonstrated that in the event one does not know what to read, one needs
400the electronic systems; the electronic systems hold no advantage at the
401moment if one knows what to read, but neither do they impose a penalty.
402
403DALY provided an anecdotal account of the revolutionizing impact of the
404new technology on his previous methods of research in the field of classics.
405His account, by extrapolation, served to illustrate in part the arguments
406made by MICHELSON concerning the positive effects of the sudden and radical
407transformation being wrought in the ways scholars work.
408
409Susan VECCIA and Joanne FREEMAN delineated the use of electronic
410materials outside the university.  The most interesting aspect of their
411use, FREEMAN said, could be seen as a paradox:  teachers in elementary
412and secondary schools requested access to primary source materials but,
413at the same time, found that "primariness" itself made these materials
414difficult for their students to use.
415
416
417OTHER TOPICS
418
419Marybeth PETERS reviewed copyright law in the United States and offered
420advice during a lively discussion of this subject.  But uncertainty
421remains concerning the price of copyright in a digital medium, because a
422solution remains to be worked out concerning management and synthesis of
423copyrighted and out-of-copyright pieces of a database.
424
425As moderator of the final session of the Workshop, Prosser GIFFORD directed
426discussion to future courses of action and the potential role of LC in
427advancing them.  Among the recommendations that emerged were the following:
428
429     * Workshop participants should 1) begin to think about working
430     with image material, but structure and digitize it in such a
431     way that at a later stage it can be interpreted into text, and
432     2) find a common way to build text and images together so that
433     they can be used jointly at some stage in the future, with
434     appropriate network support, because that is how users will want
435     to access these materials.  The Library might encourage attempts
436     to bring together people who are working on texts and images.
437
438     * A network version of American Memory should be developed or
439     consideration should be given to making the data in it
440     available to people interested in doing network multimedia.
441     Given the current dearth of digital data that is appealing and
442     unencumbered by extremely complex rights problems, developing a
443     network version of American Memory could do much to help make
444     network multimedia a reality.
445
446     * Concerning the thorny issue of electronic deposit, LC should
447     initiate a catalytic process in terms of distributed
448     responsibility, that is, bring together the distributed
449     organizations and set up a study group to look at all the
450     issues related to electronic deposit and see where we as a
451     nation should move.  For example, LC might attempt to persuade
452     one major library in each state to deal with its state
453     equivalent publisher, which might produce a cooperative project
454     that would be equitably distributed around the country, and one
455     in which LC would be dealing with a minimal number of publishers
456     and minimal copyright problems.  LC must also deal with the
457     concept of on-line publishing, determining, among other things,
458     how serials such as OJCCT might be deposited for copyright.
459
460     * Since a number of projects are planning to carry out
461     preservation by creating digital images that will end up in
462     on-line or near-line storage at some institution, LC might play
463     a helpful role, at least in the near term, by accelerating how
464     to catalog that information into the Research Library Information
465     Network (RLIN) and then into OCLC, so that it would be accessible.
466     This would reduce the possibility of multiple institutions digitizing
467     the same work.
468
469
470CONCLUSION
471
472The Workshop was valuable because it brought together partisans from
473various groups and provided an occasion to compare goals and methods.
474The more committed partisans frequently communicate with others in their
475groups, but less often across group boundaries.  The Workshop was also
476valuable to attendees--including those involved with American Memory--who
477came less committed to particular approaches or concepts.  These
478attendees learned a great deal, and plan to select and employ elements of
479imaging, text-coding, and networked distribution that suit their
480respective projects and purposes.
481
482Still, reality rears its ugly head:  no breakthrough has been achieved.
483On the imaging side, one confronts a proliferation of competing
484data-interchange standards and a lack of consensus on the role of digital
485facsimiles in preservation.  In the realm of machine-readable texts, one
486encounters a reasonably mature standard but methodological difficulties
487and high costs.  These latter problems, of course, represent a special
488impediment to the desire, as it is sometimes expressed in the popular
489press, "to put the [contents of the] Library of Congress on line."  In
490the words of one participant, there was "no solution to the economic
491problems--the projects that are out there are surviving, but it is going
492to be a lot of work to transform the information industry, and so far the
493investment to do that is not forthcoming" (LESK, per litteras).
494
495
496               ***   ***   ***   ******   ***   ***   ***
497
498
499                               PROCEEDINGS
500
501
502WELCOME
503
504+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
505GIFFORD * Origin of Workshop in current Librarian's desire to make LC's
506collections more widely available * Desiderata arising from the prospect
507of greater interconnectedness *
508+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
509
510After welcoming participants on behalf of the Library of Congress,
511American Memory (AM), and the National Demonstration Lab, Prosser
512GIFFORD, director for scholarly programs, Library of Congress, located
513the origin of the Workshop on Electronic Texts in a conversation he had
514had considerably more than a year ago with Carl FLEISCHHAUER concerning
515some of the issues faced by AM.  On the assumption that numerous other
516people were asking the same questions, the decision was made to bring
517together as many of these people as possible to ask the same questions
518together.  In a deeper sense, GIFFORD said, the origin of the Workshop
519lay in the desire of the current Librarian of Congress, James H.
520Billington, to make the collections of the Library, especially those
521offering unique or unusual testimony on aspects of the American
522experience, available to a much wider circle of users than those few
523people who can come to Washington to use them.  This meant that the
524emphasis of AM, from the outset, has been on archival collections of the
525basic material, and on making these collections themselves available,
526rather than selected or heavily edited products.
527
528From AM's emphasis followed the questions with which the Workshop began:
529who will use these materials, and in what form will they wish to use
530them.  But an even larger issue deserving mention, in GIFFORD's view, was
531the phenomenal growth in Internet connectivity.  He expressed the hope
532that the prospect of greater interconnectedness than ever before would
533lead to:  1) much more cooperative and mutually supportive endeavors; 2)
534development of systems of shared and distributed responsibilities to
535avoid duplication and to ensure accuracy and preservation of unique
536materials; and 3) agreement on the necessary standards and development of
537the appropriate directories and indices to make navigation
538straightforward among the varied resources that are, and increasingly
539will be, available.  In this connection, GIFFORD requested that
540participants reflect from the outset upon the sorts of outcomes they
541thought the Workshop might have.  Did those present constitute a group
542with sufficient common interests to propose a next step or next steps,
543and if so, what might those be?  They would return to these questions the
544following afternoon.
545
546                                 ******
547
548+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
549FLEISCHHAUER * Core of Workshop concerns preparation and production of
550materials * Special challenge in conversion of textual materials *
551Quality versus quantity * Do the several groups represented share common
552interests? *
553+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
554
555Carl FLEISCHHAUER, coordinator, American Memory, Library of Congress,
556emphasized that he would attempt to represent the people who perform some
557of the work of converting or preparing  materials and that the core of
558the Workshop had to do with preparation and production.  FLEISCHHAUER
559then drew a distinction between the long term, when many things would be
560available and connected in the ways that GIFFORD described, and the short
561term, in which AM not only has wrestled with the issue of what is the
562best course to pursue but also has faced a variety of technical
563challenges.
564
565FLEISCHHAUER remarked AM's endeavors to deal with a wide range of library
566formats, such as motion picture collections, sound-recording collections,
567and pictorial collections of various sorts, especially collections of
568photographs.  In the course of these efforts, AM kept coming back to
569textual materials--manuscripts or rare printed matter, bound materials,
570etc.  Text posed the greatest conversion challenge of all.  Thus, the
571genesis of the Workshop, which reflects the problems faced by AM.  These
572problems include physical problems.  For example, those in the library
573and archive business deal with collections made up of fragile and rare
574manuscript items, bound materials, especially the notoriously brittle
575bound materials of the late nineteenth century.  These are precious
576cultural artifacts, however, as well as interesting sources of
577information, and LC desires to retain and conserve them.  AM needs to
578handle things without damaging them.  Guillotining a book to run it
579through a sheet feeder must be avoided at all costs.
580
581Beyond physical problems, issues pertaining to quality arose.  For
582example, the desire to provide users with a searchable text is affected
583by the question of acceptable level of accuracy.  One hundred percent
584accuracy is tremendously expensive.  On the other hand, the output of
585optical character recognition (OCR) can be tremendously inaccurate.
586Although AM has attempted to find a middle ground, uncertainty persists
587as to whether or not it has discovered the right solution.
588
589Questions of quality arose concerning images as well.  FLEISCHHAUER
590contrasted the extremely high level of quality of the digital images in
591the Cornell Xerox Project with AM's efforts to provide a browse-quality
592or access-quality image, as opposed to an archival or preservation image.
593FLEISCHHAUER therefore welcomed the opportunity to compare notes.
594
595FLEISCHHAUER observed in passing that conversations he had had about
596networks have begun to signal that for various forms of media a
597determination may be made that there is a browse-quality item, or a
598distribution-and-access-quality item that may coexist in some systems
599with a higher quality archival item that would be inconvenient to send
600through the network because of its size.  FLEISCHHAUER referred, of
601course, to images more than to searchable text.
602
603As AM considered those questions, several conceptual issues arose:  ought
604AM occasionally to reproduce materials entirely through an image set, at
605other times, entirely through a text set, and in some cases, a mix?
606There probably would be times when the historical authenticity of an
607artifact would require that its image be used.  An image might be
608desirable as a recourse for users if one could not provide 100-percent
609accurate text.  Again, AM wondered, as a practical matter, if a
610distinction could be drawn between rare printed matter that might exist
611in multiple collections--that is, in ten or fifteen libraries.  In such
612cases, the need for perfect reproduction would be less than for unique
613items.  Implicit in his remarks, FLEISCHHAUER conceded, was the admission
614that AM has been tilting strongly towards quantity and drawing back a
615little from perfect quality.  That is, it seemed to AM that society would
616be better served if more things were distributed by LC--even if they were
617not quite perfect--than if fewer things, perfectly represented, were
618distributed.  This was stated as a proposition to be tested, with
619responses to be gathered from users.
620
621In thinking about issues related to reproduction of materials and seeing
622other people engaged in parallel activities, AM deemed it useful to
623convene a conference.  Hence, the Workshop.  FLEISCHHAUER thereupon
624surveyed the several groups represented:  1) the world of images (image
625users and image makers); 2) the world of text and scholarship and, within
626this group, those concerned with language--FLEISCHHAUER confessed to finding
627delightful irony in the fact that some of the most advanced thinkers on
628computerized texts are those dealing with ancient Greek and Roman materials;
6293) the network world; and 4) the general world of library science, which
630includes people interested in preservation and cataloging.
631
632FLEISCHHAUER concluded his remarks with special thanks to the David and
633Lucile Packard Foundation for its support of the meeting, the American
634Memory group, the Office for Scholarly Programs, the National
635Demonstration Lab, and the Office of Special Events.  He expressed the
636hope that David Woodley Packard might be able to attend, noting that
637Packard's work and the work of the foundation had sponsored a number of
638projects in the text area.
639
640                                 ******
641
642SESSION I.  CONTENT IN A NEW FORM:   WHO WILL USE IT AND WHAT WILL THEY DO?
643
644+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
645DALY * Acknowledgements * A new Latin authors disk *  Effects of the new
646technology on previous methods of research *
647+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
648
649Serving as moderator, James DALY acknowledged the generosity of all the
650presenters for giving of their time, counsel, and patience in planning
651the Workshop, as well as of members of the American Memory project and
652other Library of Congress staff, and the David and Lucile Packard
653Foundation and its executive director, Colburn S. Wilbur.
654
655DALY then recounted his visit in March to the Center for Electronic Texts
656in the Humanities (CETH) and the Department of Classics at Rutgers
657University, where an old friend, Lowell Edmunds, introduced him to the
658department's IBYCUS scholarly personal computer, and, in particular, the
659new Latin CD-ROM, containing, among other things, almost all classical
660Latin literary texts through A.D. 200.  Packard Humanities Institute
661(PHI), Los Altos, California, released this disk late in 1991, with a
662nominal triennial licensing fee.
663
664Playing with the disk for an hour or so at Rutgers brought home to DALY
665at once the revolutionizing impact of the new technology on his previous
666methods of research.  Had this disk been available two or three years
667earlier, DALY contended, when he was engaged in preparing a commentary on
668Book 10 of Virgil's Aeneid for Cambridge University Press, he would not
669have required a forty-eight-square-foot table on which to spread the
670numerous, most frequently consulted items, including some ten or twelve
671concordances to key Latin authors, an almost equal number of lexica to
672authors who lacked concordances, and where either lexica or concordances
673were lacking, numerous editions of authors antedating and postdating Virgil.
674
675Nor, when checking each of the average six to seven words contained in
676the Virgilian hexameter for its usage elsewhere in Virgil's works or
677other Latin authors, would DALY have had to maintain the laborious
678mechanical process of flipping through these concordances, lexica, and
679editions each time.  Nor would he have had to frequent as often the
680Milton S. Eisenhower Library at the Johns Hopkins University to consult
681the Thesaurus Linguae Latinae.  Instead of devoting countless hours, or
682the bulk of his research time, to gathering data concerning Virgil's use
683of words, DALY--now freed by PHI's Latin authors disk from the
684tyrannical, yet in some ways paradoxically happy scholarly drudgery--
685would have been able to devote that same bulk of time to analyzing and
686interpreting Virgilian verbal usage.
687
688Citing Theodore Brunner, Gregory Crane, Elli MYLONAS, and Avra MICHELSON,
689DALY argued that this reversal in his style of work, made possible by the
690new technology, would perhaps have resulted in better, more productive
691research.  Indeed, even in the course of his browsing the Latin authors
692disk at Rutgers, its powerful search, retrieval, and highlighting
693capabilities suggested to him several new avenues of research into
694Virgil's use of sound effects.  This anecdotal account, DALY maintained,
695may serve to illustrate in part the sudden and radical transformation
696being wrought in the ways scholars work.
697
698                                 ******
699
700++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
701MICHELSON * Elements related to scholarship and technology * Electronic
702texts within the context of broader trends within information technology
703and scholarly communication * Evaluation of the prospects for the use of
704electronic texts * Relationship of electronic texts to processes of
705scholarly communication in humanities research * New exchange formats
706created by scholars * Projects initiated to increase scholarly access to
707converted text * Trend toward making electronic resources available
708through research and education networks * Changes taking place in
709scholarly communication among humanities scholars * Network-mediated
710scholarship transforming traditional scholarly practices * Key
711information technology trends affecting the conduct of scholarly
712communication over the next decade * The trend toward end-user computing
713* The trend toward greater connectivity * Effects of these trends * Key
714transformations taking place * Summary of principal arguments *
715++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
716
717Avra MICHELSON, Archival Research and Evaluation Staff, National Archives
718and Records Administration (NARA), argued that establishing who will use
719electronic texts and what they will use them for involves a consideration
720of both information technology and scholarship trends.  This
721consideration includes several elements related to scholarship and
722technology:  1) the key trends in information technology that are most
723relevant to scholarship; 2) the key trends in the use of currently
724available technology by scholars in the nonscientific community; and 3)
725the relationship between these two very distinct but interrelated trends.
726The investment in understanding this relationship being made by
727information providers, technologists, and public policy developers, as
728well as by scholars themselves, seems to be pervasive and growing,
729MICHELSON contended.  She drew on collaborative work with Jeff Rothenberg
730on the scholarly use of technology.
731
732MICHELSON sought to place the phenomenon of electronic texts within the
733context of broader trends within information technology and scholarly
734communication.  She argued that electronic texts are of most use to
735researchers to the extent that the researchers' working context (i.e.,
736their relevant bibliographic sources, collegial feedback, analytic tools,
737notes, drafts, etc.), along with their field's primary and secondary
738sources, also is accessible in electronic form and can be integrated in
739ways that are unique to the on-line environment.
740
741Evaluation of the prospects for the use of electronic texts includes two
742elements:  1) an examination of the ways in which researchers currently
743are using electronic texts along with other electronic resources, and 2)
744an analysis of key information technology trends that are affecting the
745long-term conduct of scholarly communication.  MICHELSON limited her
746discussion of the use of electronic texts to the practices of humanists
747and noted that the scientific community was outside the panel's overview.
748
749MICHELSON examined the nature of the current relationship of electronic
750texts in particular, and electronic resources in general, to what she
751maintained were, essentially, five processes of scholarly communication
752in humanities research.  Researchers 1) identify sources, 2) communicate
753with their colleagues, 3) interpret and analyze data, 4) disseminate
754their research findings, and 5) prepare curricula to instruct the next
755generation of scholars and students.  This examination would produce a
756clearer understanding of the synergy among these five processes that
757fuels the tendency of the use of electronic resources for one process to
758stimulate its use for other processes of scholarly communication.
759
760For the first process of scholarly communication, the identification of
761sources, MICHELSON remarked the opportunity scholars now enjoy to
762supplement traditional word-of-mouth searches for sources among their
763colleagues with new forms of electronic searching.  So, for example,
764instead of having to visit the library, researchers are able to explore
765descriptions of holdings in their offices.  Furthermore, if their own
766institutions' holdings prove insufficient, scholars can access more than
767200 major American library catalogues over Internet, including the
768universities of California, Michigan, Pennsylvania, and Wisconsin.
769Direct access to the bibliographic databases offers intellectual
770empowerment to scholars by presenting a comprehensive means of browsing
771through libraries from their homes and offices at their convenience.
772
773The second process of communication involves communication among
774scholars.  Beyond the most common methods of communication, scholars are
775using E-mail and a variety of new electronic communications formats
776derived from it for further academic interchange.  E-mail exchanges are
777growing at an astonishing rate, reportedly 15 percent a month.  They
778currently constitute approximately half the traffic on research and
779education networks.  Moreover, the global spread of E-mail has been so
780rapid that it is now possible for American scholars to use it to
781communicate with colleagues in close to 140 other countries.
782
783Other new exchange formats created by scholars and operating on Internet
784include more than 700 conferences, with about 80 percent of these devoted
785to topics in the social sciences and humanities.  The rate of growth of
786these scholarly electronic conferences also is astonishing.  From l990 to
787l991, 200 new conferences were identified on Internet.  From October 1991
788to June 1992, an additional 150 conferences in the social sciences and
789humanities were added to this directory of listings.  Scholars have
790established conferences in virtually every field, within every different
791discipline.  For example, there are currently close to 600 active social
792science and humanities  conferences on topics such as art and
793architecture, ethnomusicology, folklore, Japanese culture, medical
794education, and gifted and talented education.  The appeal to scholars of
795communicating through these conferences is that, unlike any other medium,
796electronic conferences today provide a forum for global communication
797with peers at the front end of the research process.
798
799Interpretation and analysis of sources constitutes the third process of
800scholarly communication that MICHELSON discussed in terms of texts and
801textual resources.  The methods used to analyze sources fall somewhere on
802a continuum from quantitative analysis to qualitative analysis.
803Typically, evidence is culled and evaluated using methods drawn from both
804ends of this continuum.  At one end, quantitative analysis involves the
805use of mathematical processes such as a count of frequencies and
806distributions of occurrences or, on a higher level, regression analysis.
807At the other end of the continuum, qualitative analysis typically
808involves nonmathematical processes oriented toward language
809interpretation or the building of theory.  Aspects of this work involve
810the processing--either manual or computational--of large and sometimes
811massive amounts of textual sources, although the use of nontextual
812sources as evidence, such as photographs, sound recordings, film footage,
813and artifacts, is significant as well.
814
815Scholars have discovered that many of the methods of interpretation and
816analysis that are related to both quantitative and qualitative methods
817are processes that can be performed by computers.  For example, computers
818can count.  They can count brush strokes used in a Rembrandt painting or
819perform regression analysis for understanding cause and effect.  By means
820of advanced technologies, computers can recognize patterns, analyze text,
821and model concepts.  Furthermore, computers can complete these processes
822faster with more sources and with greater precision than scholars who
823must rely on manual interpretation of data.  But if scholars are to use
824computers for these processes, source materials must be in a form
825amenable to computer-assisted analysis.  For this reason many scholars,
826once they have identified the sources that are key to their research, are
827converting them to machine-readable form.  Thus, a representative example
828of the numerous textual conversion projects organized by scholars around
829the world in recent years to support computational text analysis is the
830TLG, the Thesaurus Linguae Graecae.  This project is devoted to
831converting the extant ancient texts of classical Greece.  (Editor's note:
832according to the TLG Newsletter of May l992, TLG was in use in thirty-two
833different countries.  This figure updates MICHELSON's previous count by one.)
834
835The scholars performing these conversions have been asked to recognize
836that the electronic sources they are converting for one use possess value
837for other research purposes as well.  As a result, during the past few
838years, humanities scholars have initiated a number of projects to
839increase scholarly access to converted text.  So, for example, the Text
840Encoding Initiative (TEI), about which more is said later in the program,
841was established as an effort by scholars to determine standard elements
842and methods for encoding machine-readable text for electronic exchange.
843In a second effort to facilitate the sharing of converted text, scholars
844have created a new institution, the Center for Electronic Texts in the
845Humanities (CETH).  The center estimates that there are 8,000 series of
846source texts in the humanities that have been converted to
847machine-readable form worldwide.  CETH is undertaking an international
848search for converted text in the humanities, compiling it into an
849electronic library, and preparing bibliographic descriptions of the
850sources for the Research Libraries Information Network's (RLIN)
851machine-readable data file.  The library profession has begun to initiate
852large conversion projects as well, such as American Memory.
853
854While scholars have been making converted text available to one another,
855typically on disk or on CD-ROM, the clear trend is toward making these
856resources available through research and education networks.  Thus, the
857American and French Research on the Treasury of the French Language
858(ARTFL) and the Dante Project are already available on Internet.
859MICHELSON summarized this section on interpretation and analysis by
860noting that:  1) increasing numbers of humanities scholars in the library
861community are recognizing the importance to the advancement of
862scholarship of retrospective conversion of source materials in the arts
863and humanities; and 2) there is a growing realization that making the
864sources available on research and education networks maximizes their
865usefulness for the analysis performed by humanities scholars.
866
867The fourth process of scholarly communication is dissemination of
868research findings, that is, publication.  Scholars are using existing
869research and education networks to engineer a new type of publication:
870scholarly-controlled journals that are electronically produced and
871disseminated.  Although such journals are still emerging as a
872communication format, their number has grown, from approximately twelve
873to thirty-six during the past year (July 1991 to June 1992).  Most of
874these electronic scholarly journals are devoted to topics in the
875humanities.  As with network conferences, scholarly enthusiasm for these
876electronic journals stems from the medium's unique ability to advance
877scholarship in a way that no other medium can do by supporting global
878feedback and interchange, practically in real time, early in the research
879process.  Beyond scholarly journals, MICHELSON remarked the delivery of
880commercial full-text products, such as articles in professional journals,
881newsletters, magazines, wire services, and reference sources.  These are
882being delivered via on-line local library catalogues, especially through
883CD-ROMs.  Furthermore, according to MICHELSON, there is general optimism
884that the copyright and fees issues impeding the delivery of full text on
885existing research and education networks soon will be resolved.
886
887The final process of scholarly communication is curriculum development
888and instruction, and this involves the use of computer information
889technologies in two areas.  The first is the development of
890computer-oriented instructional tools, which includes simulations,
891multimedia applications, and computer tools that are used to assist in
892the analysis of sources in the classroom, etc.  The Perseus Project, a
893database that provides a multimedia curriculum on classical Greek
894civilization, is a good example of the way in which entire curricula are
895being recast using information technologies.  It is anticipated that the
896current difficulty in exchanging electronically computer-based
897instructional software, which in turn makes it difficult for one scholar
898to build upon the work of others, will be resolved before too long.
899Stand-alone curricular applications that involve electronic text will be
900sharable through networks, reinforcing their significance as intellectual
901products as well as instructional tools.
902
903The second aspect of electronic learning involves the use of research and
904education networks for distance education programs.  Such programs
905interactively link teachers with students in geographically scattered
906locations and rely on the availability of electronic instructional
907resources.  Distance education programs are gaining wide appeal among
908state departments of education because of their demonstrated capacity to
909bring advanced specialized course work and an array of experts to many
910classrooms.  A recent report found that at least 32 states operated at
911least one statewide network for education in 1991, with networks under
912development in many of the remaining states.
913
914MICHELSON summarized this section by noting two striking changes taking
915place in scholarly communication among humanities scholars.  First is the
916extent to which electronic text in particular, and electronic resources
917in general, are being infused into each of the five processes described
918above.  As mentioned earlier, there is a certain synergy at work here.
919The use of electronic resources for one process tends to stimulate its
920use for other processes, because the chief course of movement is toward a
921comprehensive on-line working context for humanities scholars that
922includes on-line availability of key bibliographies, scholarly feedback,
923sources, analytical tools, and publications.  MICHELSON noted further
924that the movement toward a comprehensive on-line working context for
925humanities scholars is not new.  In fact, it has been underway for more
926than forty years in the humanities, since Father Roberto Busa began
927developing an electronic concordance of the works of Saint Thomas Aquinas
928in 1949.  What we are witnessing today, MICHELSON contended, is not the
929beginning of this on-line transition but, for at least some humanities
930scholars, the turning point in the transition from a print to an
931electronic working context.  Coinciding with the on-line transition, the
932second striking change is the extent to which research and education
933networks are becoming the new medium of scholarly communication.  The
934existing Internet and the pending National Education and Research Network
935(NREN) represent the new meeting ground where scholars are going for
936bibliographic information, scholarly dialogue and feedback, the most
937current publications in their field, and high-level educational
938offerings.  Traditional scholarly practices are undergoing tremendous
939transformations as a result of the emergence and growing prominence of
940what is called network-mediated scholarship.
941
942MICHELSON next turned to the second element of the framework she proposed
943at the outset of her talk for evaluating the prospects for electronic
944text, namely the key information technology trends affecting the conduct
945of scholarly communication over the next decade:  1) end-user computing
946and 2) connectivity.
947
948End-user computing means that the person touching the keyboard, or
949performing computations, is the same as the person who initiates or
950consumes the computation.  The emergence of personal computers, along
951with a host of other forces, such as ubiquitous computing, advances in
952interface design, and the on-line transition, is prompting the consumers
953of computation to do their own computing, and is thus rendering obsolete
954the traditional distinction between end users and ultimate users.
955
956The trend toward end-user computing is significant to consideration of
957the prospects for electronic texts because it means that researchers are
958becoming more adept at doing their own computations and, thus, more
959competent in the use of electronic media.  By avoiding programmer
960intermediaries, computation is becoming central to the researcher's
961thought process.  This direct involvement in computing is changing the
962researcher's perspective on the nature of research itself, that is, the
963kinds of questions that can be posed, the analytical methodologies that
964can be used, the types and amount of sources that are appropriate for
965analyses, and the form in which findings are presented.  The trend toward
966end-user computing means that, increasingly, electronic media and
967computation are being infused into all processes of humanities
968scholarship, inspiring remarkable transformations in scholarly
969communication.
970
971The trend toward greater connectivity suggests that researchers are using
972computation increasingly in network environments.  Connectivity is
973important to scholarship because it erases the distance that separates
974students from teachers and scholars from their colleagues, while allowing
975users to access remote databases, share information in many different
976media, connect to their working context wherever they are, and
977collaborate in all phases of research.
978
979The combination of the trend toward end-user computing and the trend
980toward connectivity suggests that the scholarly use of electronic
981resources, already evident among some researchers, will soon become an
982established feature of scholarship.  The effects of these trends, along
983with ongoing changes in scholarly practices, point to a future in which
984humanities researchers will use computation and electronic communication
985to help them formulate ideas, access sources, perform research,
986collaborate with colleagues, seek peer review, publish and disseminate
987results, and engage in many other professional and educational activities.
988
989In summary, MICHELSON emphasized four points:  1) A portion of humanities
990scholars already consider electronic texts the preferred format for
991analysis and dissemination.  2) Scholars are using these electronic
992texts, in conjunction with other electronic resources, in all the
993processes of scholarly communication.  3) The humanities scholars'
994working context is in the process of changing from print technology to
995electronic technology, in many ways mirroring transformations that have
996occurred or are occurring within the scientific community.  4) These
997changes are occurring in conjunction with the development of a new
998communication medium:  research and education networks that are
999characterized by their capacity to advance scholarship in a wholly unique
1000way.
1001
1002MICHELSON also reiterated her three principal arguments:  l) Electronic
1003texts are best understood in terms of the relationship to other
1004electronic resources and the growing prominence of network-mediated
1005scholarship.  2) The prospects for electronic texts lie in their capacity
1006to be integrated into the on-line network of electronic resources that
1007comprise the new working context for scholars.  3) Retrospective conversion
1008of portions of the scholarly record should be a key strategy as information
1009providers respond to changes in scholarly communication practices.
1010
1011                                 ******
1012
1013+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1014VECCIA * AM's evaluation project and public users of electronic resources
1015* AM and its design * Site selection and evaluating the Macintosh
1016implementation of AM * Characteristics of the six public libraries
1017selected * Characteristics of AM's users in these libraries * Principal
1018ways AM is being used *
1019+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1020
1021Susan VECCIA, team leader, and Joanne FREEMAN, associate coordinator,
1022American Memory, Library of Congress, gave a joint presentation.  First,
1023by way of introduction, VECCIA explained her and FREEMAN's roles in
1024American Memory (AM).  Serving principally as an observer, VECCIA has
1025assisted with the evaluation project of AM, placing AM collections in a
1026variety of different sites around the country and helping to organize and
1027implement that project.  FREEMAN has been an associate coordinator of AM
1028and has been involved principally with the interpretative materials,
1029preparing some of the electronic exhibits and printed historical
1030information that accompanies AM and that is requested by users.  VECCIA
1031and FREEMAN shared anecdotal observations concerning AM with public users
1032of electronic resources.  Notwithstanding a fairly structured evaluation
1033in progress, both VECCIA and FREEMAN chose not to report on specifics in
1034terms of numbers, etc., because they felt it was too early in the
1035evaluation project to do so.
1036
1037AM is an electronic archive of primary source materials from the Library
1038of Congress, selected collections representing a variety of formats--
1039photographs, graphic arts, recorded sound, motion pictures, broadsides,
1040and soon, pamphlets and books.  In terms of the design of this system,
1041the interpretative exhibits have been kept separate from the primary
1042resources, with good reason.  Accompanying this collection are printed
1043documentation and user guides, as well as guides that FREEMAN prepared for
1044teachers so that they may begin using the content of the system at once.
1045
1046VECCIA described the evaluation project before talking about the public
1047users of AM, limiting her remarks to public libraries, because FREEMAN
1048would talk more specifically about schools from kindergarten to twelfth
1049grade (K-12).   Having started in spring 1991, the evaluation currently
1050involves testing of the Macintosh implementation of AM.  Since the
1051primary goal of this evaluation is to determine the most appropriate
1052audience or audiences for AM, very different sites were selected.  This
1053makes evaluation difficult because of the varying degrees of technology
1054literacy among the sites.  AM is situated in forty-four locations, of
1055which six are public libraries and sixteen are schools.  Represented
1056among the schools are elementary, junior high, and high schools.
1057District offices also are involved in the evaluation, which will
1058conclude in summer 1993.
1059
1060VECCIA focused the remainder of her talk on the six public libraries, one
1061of which doubles as a state library.  They represent a range of
1062geographic areas and a range of demographic characteristics.  For
1063example, three are located in urban settings, two in rural settings, and
1064one in a suburban setting.  A range of technical expertise is to be found
1065among these facilities as well.  For example, one is an "Apple library of
1066the future," while two others are rural one-room libraries--in one, AM
1067sits at the front desk next to a tractor manual.
1068
1069All public libraries have been extremely enthusiastic, supportive, and
1070appreciative of the work that AM has been doing.  VECCIA characterized
1071various users:  Most users in public libraries describe themselves as
1072general readers; of the students who use AM in the public libraries,
1073those in fourth grade and above seem most interested.  Public libraries
1074in rural sites tend to attract retired people, who have been highly
1075receptive to AM.  Users tend to fall into two additional categories:
1076people interested in the content and historical connotations of these
1077primary resources, and those fascinated by the technology.  The format
1078receiving the most comments has been motion pictures.  The adult users in
1079public libraries are more comfortable with IBM computers, whereas young
1080people seem comfortable with either IBM or Macintosh, although most of
1081them seem to come from a Macintosh background.  This same tendency is
1082found in the schools.
1083
1084What kinds of things do users do with AM?  In a public library there are
1085two main goals or ways that AM is being used:  as an individual learning
1086tool, and as a leisure activity.  Adult learning was one area that VECCIA
1087would highlight as a possible application for a tool such as AM.  She
1088described a patron of a rural public library who comes in every day on
1089his lunch hour and literally reads AM, methodically going through the
1090collection image by image.  At the end of his hour he makes an electronic
1091bookmark, puts it in his pocket, and returns to work.  The next day he
1092comes in and resumes where he left off.  Interestingly, this man had
1093never been in the library before he used AM.  In another small, rural
1094library, the coordinator reports that AM is a popular activity for some
1095of the older, retired people in the community, who ordinarily would not
1096use "those things,"--computers.  Another example of adult learning in
1097public libraries is book groups, one of which, in particular, is using AM
1098as part of its reading on industrialization, integration, and urbanization
1099in the early 1900s.
1100
1101One library reports that a family is using AM to help educate their
1102children.  In another instance, individuals from a local museum came in
1103to use AM to prepare an exhibit on toys of the past.  These two examples
1104emphasize the mission of the public library as a cultural institution,
1105reaching out to people who do not have the same resources available to
1106those who live in a metropolitan area or have access to a major library.
1107One rural library reports that junior high school students in large
1108numbers came in one afternoon to use AM for entertainment.  A number of
1109public libraries reported great interest among postcard collectors in the
1110Detroit collection, which was essentially a collection of images used on
1111postcards around the turn of the century.  Train buffs are similarly
1112interested because that was a time of great interest in railroading.
1113People, it was found, relate to things that they know of firsthand.  For
1114example, in both rural public libraries where AM was made available,
1115observers reported that the older people with personal remembrances of
1116the turn of the century were gravitating to the Detroit collection.
1117These examples served to underscore MICHELSON's observation re the
1118integration of electronic tools and ideas--that people learn best when
1119the material relates to something they know.
1120
1121VECCIA made the final point that in many cases AM serves as a
1122public-relations tool for the public libraries that are testing it.  In
1123one case, AM is being used as a vehicle to secure additional funding for
1124the library.  In another case, AM has served as an inspiration to the
1125staff of a major local public library in the South to think about ways to
1126make its own collection of photographs more accessible to the public.
1127
1128                                  ******
1129
1130+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1131FREEMAN * AM and archival electronic resources in a school environment *
1132Questions concerning context * Questions concerning the electronic format
1133itself * Computer anxiety * Access and availability of the system *
1134Hardware * Strengths gained through the use of archival resources in
1135schools *
1136+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1137
1138Reiterating an observation made by VECCIA, that AM is an archival
1139resource made up of primary materials with very little interpretation,
1140FREEMAN stated that the project has attempted to bridge the gap between
1141these bare primary materials and a school environment, and in that cause
1142has created guided introductions to AM collections.  Loud demand from the
1143educational community,  chiefly from teachers working with the upper
1144grades of elementary school through high school, greeted the announcement
1145that AM would be tested around the country.
1146
1147FREEMAN reported not only on what was learned about AM in a school
1148environment, but also on several universal questions that were raised
1149concerning archival electronic resources in schools.  She discussed
1150several strengths of this type of material in a school environment as
1151opposed to a highly structured resource that offers a limited number of
1152paths to follow.
1153
1154FREEMAN first raised several questions about using AM in a school
1155environment.  There is often some difficulty in developing a sense of
1156what the system contains.  Many students sit down at a computer resource
1157and assume that, because AM comes from the Library of Congress, all of
1158American history is now at their fingertips.  As a result of that sort of
1159mistaken judgment, some students are known to conclude that AM contains
1160nothing of use to them when they look for one or two things and do not
1161find them.  It is difficult to discover that middle ground where one has
1162a sense of what the system contains.  Some students grope toward the idea
1163of an archive, a new idea to them, since they have not previously
1164experienced what it means to have access to a vast body of somewhat
1165random information.
1166
1167Other questions raised by FREEMAN concerned the electronic format itself.
1168For instance, in a school environment it is often difficult both for
1169teachers and students to gain a sense of what it is they are viewing.
1170They understand that it is a visual image, but they do not necessarily
1171know that it is a postcard from the turn of the century, a panoramic
1172photograph, or even machine-readable text of an eighteenth-century
1173broadside, a twentieth-century printed book, or a nineteenth-century
1174diary.  That distinction is often difficult for people in a school
1175environment to grasp.  Because of that, it occasionally becomes difficult
1176to draw conclusions from what one is viewing.
1177
1178FREEMAN also noted the obvious fear of the computer, which constitutes a
1179difficulty in using an electronic resource.  Though students in general
1180did not suffer from this anxiety, several older students feared that they
1181were computer-illiterate, an assumption that became self-fulfilling when
1182they searched for something but failed to find it.  FREEMAN said she
1183believed that some teachers also fear computer resources, because they
1184believe they lack complete control.  FREEMAN related the example of
1185teachers shooing away students because it was not their time to use the
1186system.  This was a case in which the situation had to be extremely
1187structured so that the teachers would not feel that they had lost their
1188grasp on what the system contained.
1189
1190A final question raised by FREEMAN concerned access and availability of
1191the system.  She noted the occasional existence of a gap in communication
1192between school librarians and teachers.  Often AM sits in a school
1193library and the librarian is the person responsible for monitoring the
1194system.  Teachers do not always take into their world new library
1195resources about which the librarian is excited.  Indeed, at the sites
1196where AM had been used most effectively within a library, the librarian
1197was required to go to specific teachers and instruct them in its use.  As
1198a result, several AM sites will have in-service sessions over a summer,
1199in the hope that perhaps, with a more individualized link, teachers will
1200be more likely to use the resource.
1201
1202A related issue in the school context concerned the number of
1203workstations available at any one location.  Centralization of equipment
1204at the district level, with teachers invited to download things and walk
1205away with them, proved unsuccessful because the hours these offices were
1206open were also school hours.
1207
1208Another issue was hardware.  As VECCIA observed, a range of sites exists,
1209some technologically advanced and others essentially acquiring their
1210first computer for the primary purpose of using it in conjunction with
1211AM's testing.  Users at technologically sophisticated sites want even
1212more sophisticated hardware, so that they can perform even more
1213sophisticated tasks with the materials in AM.  But once they acquire a
1214newer piece of hardware, they must learn how to use that also; at an
1215unsophisticated site it takes an extremely long time simply to become
1216accustomed to the computer, not to mention the program offered with the
1217computer.  All of these small issues raise one large question, namely,
1218are systems like AM truly rewarding in a school environment, or do they
1219simply act as innovative toys that do little more than spark interest?
1220
1221FREEMAN contended that the evaluation project has revealed several strengths
1222that were gained through the use of archival resources in schools, including:
1223
1224     * Psychic rewards from using AM as a vast, rich database, with
1225     teachers assigning various projects to students--oral presentations,
1226     written reports, a documentary, a turn-of-the-century newspaper--
1227     projects that start with the materials in AM but are completed using
1228     other resources; AM thus is used as a research tool in conjunction
1229     with other electronic resources, as well as with books and items in
1230     the library where the system is set up.
1231
1232     * Students are acquiring computer literacy in a humanities context.
1233
1234     * This sort of system is overcoming the isolation between disciplines
1235     that often exists in schools.  For example, many English teachers are
1236     requiring their students to write papers on historical topics
1237     represented in AM.  Numerous teachers have reported that their
1238     students are learning critical thinking skills using the system.
1239
1240     * On a broader level, AM is introducing primary materials, not only
1241     to students but also to teachers, in an environment where often
1242     simply none exist--an exciting thing for the students because it
1243     helps them learn to conduct research, to interpret, and to draw
1244     their own conclusions.  In learning to conduct research and what it
1245     means, students are motivated to seek knowledge.  That relates to
1246     another positive outcome--a high level of personal involvement of
1247     students with the materials in this system and greater motivation to
1248     conduct their own research and draw their own conclusions.
1249
1250     * Perhaps the most ironic strength of these kinds of archival
1251     electronic resources is that many of the teachers AM interviewed
1252     were desperate, it is no exaggeration to say, not only for primary
1253     materials but for unstructured primary materials.  These would, they
1254     thought, foster personally motivated research, exploration, and
1255     excitement in their students.  Indeed, these materials have done
1256     just that.  Ironically, however, this lack of structure produces
1257     some of the confusion to which the newness of these kinds of
1258     resources may also contribute.  The key to effective use of archival
1259     products in a school environment is a clear, effective introduction
1260     to the system and to what it contains.
1261
1262                                 ******
1263
1264+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1265DISCUSSION * Nothing known, quantitatively, about the number of
1266humanities scholars who must see the original versus those who would
1267settle for an edited transcript, or about the ways in which humanities
1268scholars are using information technology * Firm conclusions concerning
1269the manner and extent of the use of supporting materials in print
1270provided by AM to await completion of evaluative study * A listener's
1271reflections on additional applications of electronic texts * Role of
1272electronic resources in teaching elementary research skills to students *
1273+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1274
1275During the discussion that followed the presentations by MICHELSON,
1276VECCIA, and FREEMAN, additional points emerged.
1277
1278LESK asked if MICHELSON could give any quantitative estimate of the
1279number of humanities scholars who must see or want to see the original,
1280or the best possible version of the material, versus those who typically
1281would settle for an edited transcript.  While unable to provide a figure,
1282she offered her impressions as an archivist who has done some reference
1283work and has discussed this issue with other archivists who perform
1284reference, that those who use archives and those who use primary sources
1285for what would be considered very high-level scholarly research, as
1286opposed to, say, undergraduate papers, were few in number, especially
1287given the public interest in using primary sources to conduct
1288genealogical or avocational research and the kind of professional
1289research done by people in private industry or the federal government.
1290More important in MICHELSON's view was that, quantitatively, nothing is
1291known about the ways in which, for example, humanities scholars are using
1292information technology.  No studies exist to offer guidance in creating
1293strategies.  The most recent study was conducted in 1985 by the American
1294Council of Learned Societies (ACLS), and what it showed was that 50
1295percent of humanities scholars at that time were using computers.  That
1296constitutes the extent of our knowledge.
1297
1298Concerning AM's strategy for orienting people toward the scope of
1299electronic resources, FREEMAN could offer no hard conclusions at this
1300point, because she and her colleagues were still waiting to see,
1301particularly in the schools, what has been made of their efforts.  Within
1302the system, however, AM has provided what are called electronic exhibits-
1303-such as introductions to time periods and materials--and these are
1304intended to offer a student user a sense of what a broadside is  and what
1305it might tell her or him.  But FREEMAN conceded that the project staff
1306would have to talk with students next year, after teachers have had a
1307summer to use the materials, and attempt to discover what the students
1308were learning from the materials.  In addition, FREEMAN described
1309supporting materials in print provided by AM at the request of local
1310teachers during a meeting held at LC.  These included time lines,
1311bibliographies, and other materials that could be reproduced on a
1312photocopier in a classroom.  Teachers could walk away with and use these,
1313and in this way gain a better understanding of the contents.  But again,
1314reaching firm conclusions concerning the manner and extent of their use
1315would have to wait until next year.
1316
1317As to the changes she saw occurring at the National Archives and Records
1318Administration (NARA) as a result of the increasing emphasis on
1319technology in scholarly research, MICHELSON stated that NARA at this
1320point was absorbing the report by her and Jeff Rothenberg addressing
1321strategies for the archival profession in general, although not for the
1322National Archives specifically.  NARA is just beginning to establish its
1323role and what it can do.  In terms of changes and initiatives that NARA
1324can take, no clear response could be given at this time.
1325
1326GREENFIELD remarked two trends mentioned in the session.  Reflecting on
1327DALY's opening comments on how he could have used a Latin collection of
1328text in an electronic form, he said that at first he thought most scholars
1329would be unwilling to do that.  But as he thought of that in terms of the
1330original meaning of research--that is, having already mastered these texts,
1331researching them for critical and comparative purposes--for the first time,
1332the electronic format made a lot of sense.  GREENFIELD could envision
1333growing numbers of scholars learning the new technologies for that very
1334aspect of their scholarship and for convenience's sake.
1335
1336Listening to VECCIA and FREEMAN, GREENFIELD thought of an additional
1337application of electronic texts.  He realized that AM could be used as a
1338guide to lead someone to original sources.  Students cannot be expected
1339to have mastered these sources, things they have never known about
1340before.  Thus, AM is leading them, in theory, to a vast body of
1341information and giving them a superficial overview of it, enabling them
1342to select parts of it.  GREENFIELD asked if any evidence exists that this
1343resource will indeed teach the new user, the K-12 students, how to do
1344research.  Scholars already know how to do research and are applying
1345these new tools.  But he wondered why students would go beyond picking
1346out things that were most exciting to them.
1347
1348FREEMAN conceded the correctness of GREENFIELD's observation as applied
1349to a school environment.  The risk is that a student would sit down at a
1350system, play with it, find some things of interest, and then walk away.
1351But in the relatively controlled situation of a school library, much will
1352depend on the instructions a teacher or a librarian gives a student.  She
1353viewed the situation not as one of fine-tuning research skills but of
1354involving students at a personal level in understanding and researching
1355things.  Given the guidance one can receive at school, it then becomes
1356possible to teach elementary research skills to students, which in fact
1357one particular librarian said she was teaching her fifth graders.
1358FREEMAN concluded that introducing the idea of following one's own path
1359of inquiry, which is essentially what research entails, involves more
1360than teaching specific skills.  To these comments VECCIA added the
1361observation that the individual teacher and the use of a creative
1362resource, rather than AM itself, seemed to make the key difference.
1363Some schools and some teachers are making excellent use of the nature
1364of critical thinking and teaching skills, she said.
1365
1366Concurring with these remarks, DALY closed the session with the thought that
1367the more that producers produced for teachers and for scholars to use with
1368their students, the more successful their electronic products would prove.
1369
1370                                 ******
1371
1372SESSION II.  SHOW AND TELL
1373
1374Jacqueline HESS, director, National Demonstration Laboratory, served as
1375moderator of the "show-and-tell" session.  She noted that a
1376question-and-answer period would follow each presentation.
1377
1378+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1379MYLONAS * Overview and content of Perseus * Perseus' primary materials
1380exist in a system-independent, archival form * A concession * Textual
1381aspects of Perseus * Tools to use with the Greek text * Prepared indices
1382and full-text searches in Perseus * English-Greek word search leads to
1383close study of words and concepts * Navigating Perseus by tracing down
1384indices * Using the iconography to perform research *
1385+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1386
1387Elli MYLONAS, managing editor, Perseus Project, Harvard University, first
1388gave an overview of Perseus, a large, collaborative effort based at
1389Harvard University but with contributors and collaborators located at
1390numerous universities and colleges in the United States (e.g., Bowdoin,
1391Maryland, Pomona, Chicago, Virginia).  Funded primarily by the
1392Annenberg/CPB Project, with additional funding from Apple, Harvard, and
1393the Packard Humanities Institute, among others, Perseus is a multimedia,
1394hypertextual database for teaching and research on classical Greek
1395civilization, which was released in February 1992 in version 1.0 and
1396distributed by Yale University Press.
1397
1398Consisting entirely of primary materials, Perseus includes ancient Greek
1399texts and translations of those texts; catalog entries--that is, museum
1400catalog entries, not library catalog entries--on vases, sites, coins,
1401sculpture, and archaeological objects; maps; and a dictionary, among
1402other sources.  The number of objects and the objects for which catalog
1403entries exist are accompanied by thousands of color images, which
1404constitute a major feature of the database.  Perseus contains
1405approximately 30 megabytes of text, an amount that will double in
1406subsequent versions.  In addition to these primary materials, the Perseus
1407Project has been building tools for using them, making access and
1408navigation easier, the goal being to build part of the electronic
1409environment discussed earlier in the morning in which students or
1410scholars can work with their sources.
1411
1412The demonstration of Perseus will show only a fraction of the real work
1413that has gone into it, because the project had to face the dilemma of
1414what to enter when putting something into machine-readable form:  should
1415one aim for very high quality or make concessions in order to get the
1416material in?  Since Perseus decided to opt for very high quality, all of
1417its primary materials exist in a system-independent--insofar as it is
1418possible to be system-independent--archival form.  Deciding what that
1419archival form would be and attaining it required much work and thought.
1420For example, all the texts are marked up in SGML, which will be made
1421compatible with the guidelines of the Text Encoding Initiative (TEI) when
1422they are issued.
1423
1424Drawings are postscript files, not meeting international standards, but
1425at least designed to go across platforms.  Images, or rather the real
1426archival forms, consist of the best available slides, which are being
1427digitized.  Much of the catalog material exists in database form--a form
1428that the average user could use, manipulate, and display on a personal
1429computer, but only at great cost.  Thus, this is where the concession
1430comes in:  All of this rich, well-marked-up information is stripped of
1431much of its content; the images are converted into bit-maps and the text
1432into small formatted chunks.  All this information can then be imported
1433into HyperCard and run on a mid-range Macintosh, which is what Perseus
1434users have.  This fact has made it possible for Perseus to attain wide
1435use fairly rapidly.  Without those archival forms the HyperCard version
1436being demonstrated could not be made easily, and the project could not
1437have the potential to move to other forms and machines and software as
1438they appear, none of which information is in Perseus on the CD.
1439
1440Of the numerous multimedia aspects of Perseus, MYLONAS focused on the
1441textual.  Part of what makes Perseus such a pleasure to use, MYLONAS
1442said, is this effort at seamless integration and the ability to move
1443around both visual and textual material.  Perseus also made the decision
1444not to attempt to interpret its material any more than one interprets by
1445selecting.  But, MYLONAS emphasized, Perseus is not courseware:  No
1446syllabus exists.  There is no effort to define how one teaches a topic
1447using Perseus, although the project may eventually collect papers by
1448people who have used it to teach.  Rather, Perseus aims to provide
1449primary material in a kind of electronic library, an electronic sandbox,
1450so to say, in which students and scholars who are working on this
1451material can explore by themselves.  With that, MYLONAS demonstrated
1452Perseus, beginning with the Perseus gateway, the first thing one sees
1453upon opening Perseus--an effort in part to solve the contextualizing
1454problem--which tells the user what the system contains.
1455
1456MYLONAS demonstrated only a very small portion, beginning with primary
1457texts and running off the CD-ROM.  Having selected Aeschylus' Prometheus
1458Bound, which was viewable in Greek and English pretty much in the same
1459segments together, MYLONAS demonstrated tools to use with the Greek text,
1460something not possible with a book:  looking up the dictionary entry form
1461of an unfamiliar word in Greek after subjecting it to Perseus'
1462morphological analysis for all the texts.  After finding out about a
1463word, a user may then decide to see if it is used anywhere else in Greek.
1464Because vast amounts of indexing support all of the primary material, one
1465can find out where else all forms of a particular Greek word appear--
1466often not a trivial matter because Greek is highly inflected.  Further,
1467since the story of Prometheus has to do with the origins of sacrifice, a
1468user may wish to study and explore sacrifice in Greek literature; by
1469typing sacrifice into a small window, a user goes to the English-Greek
1470word list--something one cannot do without the computer (Perseus has
1471indexed the definitions of its dictionary)--the string sacrifice appears
1472in the definitions of these sixty-five words.  One may then find out
1473where any of those words is used in the work(s) of a particular author.
1474The English definitions are not lemmatized.
1475
1476All of the indices driving this kind of usage were originally devised for
1477speed, MYLONAS observed; in other words, all that kind of information--
1478all forms of all words, where they exist, the dictionary form they belong
1479to--were collected into databases, which will expedite searching.  Then
1480it was discovered that one can do things searching in these databases
1481that could not be done searching in the full texts.  Thus, although there
1482are full-text searches in Perseus, much of the work is done behind the
1483scenes, using prepared indices.  Re the indexing that is done behind the
1484scenes, MYLONAS pointed out that without the SGML forms of the text, it
1485could not be done effectively.  Much of this indexing is based on the
1486structures that are made explicit by the SGML tagging.
1487
1488It was found that one of the things many of Perseus' non-Greek-reading
1489users do is start from the dictionary and then move into the close study
1490of words and concepts via this kind of English-Greek word search, by which
1491means they might select a concept.  This exercise has been assigned to
1492students in core courses at Harvard--to study a concept by looking for the
1493English word in the dictionary, finding the Greek words, and then finding
1494the words in the Greek but, of course, reading across in the English.
1495That tells them a great deal about what a translation means as well.
1496
1497Should one also wish to see images that have to do with sacrifice, that
1498person would go to the object key word search, which allows one to
1499perform a similar kind of index retrieval on the database of
1500archaeological objects.  Without words, pictures are useless; Perseus has
1501not reached the point where it can do much with images that are not
1502cataloged.  Thus, although it is possible in Perseus with text and images
1503to navigate by knowing where one wants to end up--for example, a
1504red-figure vase from the Boston Museum of Fine Arts--one can perform this
1505kind of navigation very easily by tracing down indices.  MYLONAS
1506illustrated several generic scenes of sacrifice on vases.  The features
1507demonstrated derived from Perseus 1.0; version 2.0 will implement even
1508better means of retrieval.
1509
1510MYLONAS closed by looking at one of the pictures and noting again that
1511one can do a great deal of research using the iconography as well as the
1512texts.  For instance, students in a core course at Harvard this year were
1513highly interested in Greek concepts of foreigners and representations of
1514non-Greeks.  So they performed a great deal of research, both with texts
1515(e.g., Herodotus) and with iconography on vases and coins, on how the
1516Greeks portrayed non-Greeks.  At the same time, art historians who study
1517iconography were also interested, and were able to use this material.
1518
1519                                 ******
1520
1521+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1522DISCUSSION * Indexing and searchability of all English words in Perseus *
1523Several features of Perseus 1.0 * Several levels of customization
1524possible * Perseus used for general education * Perseus' effects on
1525education * Contextual information in Perseus * Main challenge and
1526emphasis of Perseus *
1527+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1528
1529Several points emerged in the discussion that followed MYLONAS's presentation.
1530
1531Although MYLONAS had not demonstrated Perseus' ability to cross-search
1532documents, she confirmed that all English words in Perseus are indexed
1533and can be searched.  So, for example, sacrifice could have been searched
1534in all texts, the historical essay, and all the catalogue entries with
1535their descriptions--in short, in all of Perseus.
1536
1537Boolean logic is not in Perseus 1.0 but will be added to the next
1538version, although an effort is being made not to restrict Perseus to a
1539database in which one just performs searching, Boolean or otherwise.  It
1540is possible to move laterally through the documents by selecting a word
1541one is interested in and selecting an area of information one is
1542interested in and trying to look that word up in that area.
1543
1544Since Perseus was developed in HyperCard, several levels of customization
1545are possible.  Simple authoring tools exist that allow one to create
1546annotated paths through the information, which are useful for note-taking
1547and for guided tours for teaching purposes and for expository writing.
1548With a little more ingenuity it is possible to begin to add or substitute
1549material in Perseus.
1550
1551Perseus has not been used so much for classics education as for general
1552education, where it seemed to have an impact on the students in the core
1553course at Harvard (a general required course that students must take in
1554certain areas).  Students were able to use primary material much more.
1555
1556The Perseus Project has an evaluation team at the University of Maryland
1557that has been documenting Perseus' effects on education.  Perseus is very
1558popular, and anecdotal evidence indicates that it is having an effect at
1559places other than Harvard, for example, test sites at Ball State
1560University, Drury College, and numerous small places where opportunities
1561to use vast amounts of primary data may not exist.  One documented effect
1562is that archaeological, anthropological, and philological research is
1563being done by the same person instead of by three different people.
1564
1565The contextual information in Perseus includes an overview essay, a
1566fairly linear historical essay on the fifth century B.C. that provides
1567links into the primary material (e.g., Herodotus, Thucydides, and
1568Plutarch), via small gray underscoring (on the screen) of linked
1569passages.  These are handmade links into other material.
1570
1571To different extents, most of the production work was done at Harvard,
1572where the people and the equipment are located.  Much of the
1573collaborative activity involved data collection and structuring, because
1574the main challenge and the emphasis of Perseus is the gathering of
1575primary material, that is, building a useful environment for studying
1576classical Greece, collecting data, and making it useful.
1577Systems-building is definitely not the main concern.  Thus, much of the
1578work has involved writing essays, collecting information, rewriting it,
1579and tagging it.  That can be done off site.  The creative link for the
1580overview essay as well as for both systems and data was collaborative,
1581and was forged via E-mail and paper mail with professors at Pomona and
1582Bowdoin.
1583
1584                                 ******
1585
1586+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1587CALALUCA * PLD's principal focus and contribution to scholarship *
1588Various questions preparatory to beginning the project * Basis for
1589project * Basic rule in converting PLD * Concerning the images in PLD *
1590Running PLD under a variety of retrieval softwares * Encoding the
1591database a hard-fought issue * Various features demonstrated * Importance
1592of user documentation * Limitations of the CD-ROM version *
1593+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1594
1595Eric CALALUCA, vice president, Chadwyck-Healey, Inc., demonstrated a
1596software interpretation of the Patrologia Latina Database (PLD).  PLD's
1597principal focus from the beginning of the project about three-and-a-half
1598years ago was on converting Migne's Latin series, and in the end,
1599CALALUCA suggested, conversion of the text will be the major contribution
1600to scholarship.  CALALUCA stressed that, as possibly the only private
1601publishing organization at the Workshop, Chadwyck-Healey had sought no
1602federal funds or national foundation support before embarking upon the
1603project, but instead had relied upon a great deal of homework and
1604marketing to accomplish the task of conversion.
1605
1606Ever since the possibilities of computer-searching have emerged, scholars
1607in the field of late ancient and early medieval studies (philosophers,
1608theologians, classicists, and those studying the history of natural law
1609and the history of the legal development of Western civilization) have
1610been longing for a fully searchable version of Western literature, for
1611example, all the texts of Augustine and Bernard of Clairvaux and
1612Boethius, not to mention all the secondary and tertiary authors.
1613
1614Various questions arose, CALALUCA said.  Should one convert Migne?
1615Should the database be encoded?  Is it necessary to do that?  How should
1616it be delivered?  What about CD-ROM?  Since this is a transitional
1617medium, why even bother to create software to run on a CD-ROM?  Since
1618everybody knows people will be networking information, why go to the
1619trouble--which is far greater with CD-ROM than with the production of
1620magnetic data?  Finally, how does one make the data available?  Can many
1621of the hurdles to using electronic information that some publishers have
1622imposed upon databases be eliminated?
1623
1624The PLD project was based on the principle that computer-searching of
1625texts is most effective when it is done with a large database.  Because
1626PLD represented a collection that serves so many disciplines across so
1627many periods, it was irresistible.
1628
1629The basic rule in converting PLD was to do no harm, to avoid the sins of
1630intrusion in such a database:  no introduction of newer editions, no
1631on-the-spot changes, no eradicating of all possible falsehoods from an
1632edition.  Thus, PLD is not the final act in electronic publishing for
1633this discipline, but simply the beginning.  The conversion of PLD has
1634evoked numerous unanticipated questions:  How will information be used?
1635What about networking?  Can the rights of a database be protected?
1636Should one protect the rights of a database?  How can it be made
1637available?
1638
1639Those converting PLD also tried to avoid the sins of omission, that is,
1640excluding portions of the collections or whole sections.  What about the
1641images?  PLD is full of images, some are extremely pious
1642nineteenth-century representations of the Fathers, while others contain
1643highly interesting elements.  The goal was to cover all the text of Migne
1644(including notes, in Greek and in Hebrew, the latter of which, in
1645particular, causes problems in creating a search structure), all the
1646indices, and even the images, which are being scanned in separately
1647searchable files.
1648
1649Several North American institutions that have placed acquisition requests
1650for the PLD database have requested it in magnetic form without software,
1651which means they are already running it without software, without
1652anything demonstrated at the Workshop.
1653
1654What cannot practically be done is go back and reconvert and re-encode
1655data, a time-consuming and extremely costly enterprise.  CALALUCA sees
1656PLD as a database that can, and should, be run under a variety of
1657retrieval softwares.  This will permit the widest possible searches.
1658Consequently, the need to produce a CD-ROM of PLD, as well as to develop
1659software that could handle some 1.3 gigabyte of heavily encoded text,
1660developed out of conversations with collection development and reference
1661librarians who wanted software both compassionate enough for the
1662pedestrian but also capable of incorporating the most detailed
1663lexicographical studies that a user desires to conduct.  In the end, the
1664encoding and conversion of the data will prove the most enduring
1665testament to the value of the project.
1666
1667The encoding of the database was also a hard-fought issue:  Did the
1668database need to be encoded? Were there normative structures for encoding
1669humanist texts?  Should it be SGML?  What about the TEI--will it last,
1670will it prove useful?  CALALUCA expressed some minor doubts as to whether
1671a data bank can be fully TEI-conformant.  Every effort can be made, but
1672in the end to be TEI-conformant means to accept the need to make some
1673firm encoding decisions that can, indeed, be disputed.  The TEI points
1674the publisher in a proper direction but does not presume to make all the
1675decisions for him or her.  Essentially, the goal of encoding was to
1676eliminate, as much as possible, the hindrances to information-networking,
1677so that if an institution acquires a database, everybody associated with
1678the institution can have access to it.
1679
1680CALALUCA demonstrated a portion of Volume 160, because it had the most
1681anomalies in it.  The software was created by Electronic Book
1682Technologies of Providence, RI, and is called Dynatext.  The software
1683works only with SGML-coded data.
1684
1685Viewing a table of contents on the screen, the audience saw how Dynatext
1686treats each element as a book and attempts to simplify movement through a
1687volume.  Familiarity with the Patrologia in print (i.e., the text, its
1688source, and the editions) will make the machine-readable versions highly
1689useful.  (Software with a Windows application was sought for PLD,
1690CALALUCA said, because this was the main trend for scholarly use.)
1691
1692CALALUCA also demonstrated how a user can perform a variety of searches
1693and quickly move to any part of a volume; the look-up screen provides
1694some basic, simple word-searching.
1695
1696CALALUCA argued that one of the major difficulties is not the software.
1697Rather, in creating a product that will be used by scholars representing
1698a broad spectrum of computer sophistication,  user documentation proves
1699to be the most important service one can provide.
1700
1701CALALUCA next illustrated a truncated search under mysterium within ten
1702words of virtus and how one would be able to find its contents throughout
1703the entire database.  He said that the exciting thing about PLD is that
1704many of the applications in the retrieval software being written for it
1705will exceed the capabilities of the software employed now for the CD-ROM
1706version.  The CD-ROM faces genuine limitations, in terms of speed and
1707comprehensiveness, in the creation of a retrieval software to run it.
1708CALALUCA said he hoped that individual scholars will download the data,
1709if they wish, to their personal computers, and have ready access to
1710important texts on a constant basis, which they will be able to use in
1711their research and from which they might even be able to publish.
1712
1713(CALALUCA explained that the blue numbers represented Migne's column numbers,
1714which are the standard scholarly references.  Pulling up a note, he stated
1715that these texts were heavily edited and the image files would appear simply
1716as a note as well, so that one could quickly access an image.)
1717
1718                                 ******
1719
1720+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1721FLEISCHHAUER/ERWAY * Several problems with which AM is still wrestling *
1722Various search and retrieval capabilities * Illustration of automatic
1723stemming and a truncated search * AM's attempt to find ways to connect
1724cataloging to the texts * AM's gravitation towards SGML * Striking a
1725balance between quantity and quality * How AM furnishes users recourse to
1726images * Conducting a search in a full-text environment * Macintosh and
1727IBM prototypes of AM * Multimedia aspects of AM *
1728+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1729
1730A demonstration of American Memory by its coordinator, Carl FLEISCHHAUER,
1731and Ricky ERWAY, associate coordinator, Library of Congress, concluded
1732the morning session.  Beginning with a collection of broadsides from the
1733Continental Congress and the Constitutional Convention, the only text
1734collection in a presentable form at the time of the Workshop, FLEISCHHAUER
1735highlighted several of the problems with which AM is still wrestling.
1736(In its final form, the disk will contain two collections, not only the
1737broadsides but also the full text with illustrations of a set of
1738approximately 300 African-American pamphlets from the period 1870 to 1910.)
1739
1740As FREEMAN had explained earlier, AM has attempted to use a small amount
1741of interpretation to introduce collections.  In the present case, the
1742contractor, a company named Quick Source, in Silver Spring, MD., used
1743software called Toolbook and put together a modestly interactive
1744introduction to the collection.  Like the two preceding speakers,
1745FLEISCHHAUER argued that the real asset was the underlying collection.
1746
1747FLEISCHHAUER proceeded to describe various search and retrieval
1748capabilities while ERWAY worked the computer.  In this particular package
1749the "go to" pull-down allowed the user in effect to jump out of Toolbook,
1750where the interactive program was located, and enter the third-party
1751software used by AM for this text collection, which is called Personal
1752Librarian.  This was the Windows version of Personal Librarian, a
1753software application put together by a company in Rockville, Md.
1754
1755Since the broadsides came from the Revolutionary War period, a search was
1756conducted using the words British or war, with the default operator reset
1757as or.  FLEISCHHAUER demonstrated both automatic stemming (which finds
1758other forms of the same root) and a truncated search.  One of Personal
1759Librarian's strongest features, the relevance ranking, was represented by
1760a chart that indicated how often words being sought appeared in
1761documents, with the one receiving the most "hits" obtaining the highest
1762score.  The "hit list" that is supplied takes the relevance ranking into
1763account, making the first hit, in effect, the one the software has
1764selected as the most relevant example.
1765
1766While in the text of one of the broadside documents, FLEISCHHAUER
1767remarked AM's attempt to find ways to connect cataloging to the texts,
1768which it does in different ways in different manifestations.  In the case
1769shown, the cataloging was pasted on:  AM took MARC records that were
1770written as on-line records right into one of the Library's mainframe
1771retrieval programs, pulled them out, and handed them off to the contractor,
1772who massaged them somewhat to display them in the manner shown.  One of
1773AM's questions is, Does the cataloguing normally performed in the mainframe
1774work in this context, or had AM ought to think through adjustments?
1775
1776FLEISCHHAUER made the additional point that, as far as the text goes, AM
1777has gravitated towards SGML (he pointed to the boldface in the upper part
1778of the screen).  Although extremely limited in its ability to translate
1779or interpret SGML, Personal Librarian will furnish both bold and italics
1780on screen; a fairly easy thing to do, but it is one of the ways in which
1781SGML is useful.
1782
1783Striking a balance between quantity and quality has been a major concern
1784of AM, with accuracy being one of the places where project staff have
1785felt that less than 100-percent accuracy was not unacceptable.
1786FLEISCHHAUER cited the example of the standard of the rekeying industry,
1787namely 99.95 percent; as one service bureau informed him, to go from
178899.95 to 100 percent would double the cost.
1789
1790FLEISCHHAUER next demonstrated how AM furnishes users recourse to images,
1791and at the same time recalled LESK's pointed question concerning the
1792number of people who would look at those images and the number who would
1793work only with the text.  If the implication of LESK's question was
1794sound, FLEISCHHAUER said, it raised the stakes for text accuracy and
1795reduced the value of the strategy for images.
1796
1797Contending that preservation is always a bugaboo, FLEISCHHAUER
1798demonstrated several images derived from a scan of a preservation
1799microfilm that AM had made.  He awarded a grade of C at best, perhaps a
1800C minus or a C plus, for how well it worked out.  Indeed, the matter of
1801learning if other people had better ideas about scanning in general, and,
1802in particular, scanning from microfilm, was one of the factors that drove
1803AM to attempt to think through the agenda for the Workshop.  Skew, for
1804example, was one of the issues that AM in its ignorance had not reckoned
1805would prove so difficult.
1806
1807Further, the handling of images of the sort shown, in a desktop computer
1808environment, involved a considerable amount of zooming and scrolling.
1809Ultimately, AM staff feel that perhaps the paper copy that is printed out
1810might be the most useful one, but they remain uncertain as to how much
1811on-screen reading users will do.
1812
1813Returning to the text, FLEISCHHAUER asked viewers to imagine a person who
1814might be conducting a search in a full-text environment.  With this
1815scenario, he proceeded to illustrate other features of Personal Librarian
1816that he considered helpful; for example, it provides the ability to
1817notice words as one reads.  Clicking the "include" button on the bottom
1818of the search window pops the words that have been highlighted into the
1819search.  Thus, a user can refine the search as he or she reads,
1820re-executing the search and continuing to find things in the quest for
1821materials.  This software not only contains relevance ranking, Boolean
1822operators, and truncation, it also permits one to perform word algebra,
1823so to say, where one puts two or three words in parentheses and links
1824them with one Boolean operator and then a couple of words in another set
1825of parentheses and asks for things within so many words of others.
1826
1827Until they became acquainted recently with some of the work being done in
1828classics, the AM staff had not realized that a large number of the
1829projects that involve electronic texts were being done by people with a
1830profound interest in language and linguistics.  Their search strategies
1831and thinking are oriented to those fields, as is shown in particular by
1832the Perseus example.  As amateur historians, the AM staff were thinking
1833more of searching for concepts and ideas than for particular words.
1834Obviously, FLEISCHHAUER conceded, searching for concepts and ideas and
1835searching for words may be two rather closely related things.
1836
1837While displaying several images, FLEISCHHAUER observed that the Macintosh
1838prototype built by AM contains a greater diversity of formats.  Echoing a
1839previous speaker, he said that it was easier to stitch things together in
1840the Macintosh, though it tended to be a little more anemic in search and
1841retrieval.  AM, therefore, increasingly has been investigating
1842sophisticated retrieval engines in the IBM format.
1843
1844FLEISCHHAUER demonstrated several additional examples of the prototype
1845interfaces:  One was AM's metaphor for the network future, in which a
1846kind of reading-room graphic suggests how one would be able to go around
1847to different materials.  AM contains a large number of photographs in
1848analog video form worked up from a videodisc, which enable users to make
1849copies to print or incorporate in digital documents.  A frame-grabber is
1850built into the system, making it possible to bring an image into a window
1851and digitize or print it out.
1852
1853FLEISCHHAUER next demonstrated sound recording, which included texts.
1854Recycled from a previous project, the collection included sixty 78-rpm
1855phonograph records of political speeches that were made during and
1856immediately after World War I.  These constituted approximately three
1857hours of audio, as AM has digitized it, which occupy 150 megabytes on a
1858CD.  Thus, they are considerably compressed.  From the catalogue card,
1859FLEISCHHAUER proceeded to a transcript of a speech with the audio
1860available and with highlighted text following it as it played.
1861A photograph has been added and a transcription made.
1862
1863Considerable value has been added beyond what the Library of Congress
1864normally would do in cataloguing a sound recording, which raises several
1865questions for AM concerning where to draw lines about how much value it can
1866afford to add and at what point, perhaps, this becomes more than AM could
1867reasonably do or reasonably wish to do.  FLEISCHHAUER also demonstrated
1868a motion picture.  As FREEMAN had reported earlier, the motion picture
1869materials have proved the most popular, not surprisingly.  This says more
1870about the medium, he thought, than about AM's presentation of it.
1871
1872Because AM's goal was to bring together things that could be used by
1873historians or by people who were curious about history,
1874turn-of-the-century footage seemed to represent the most appropriate
1875collections from the Library of Congress in motion pictures. These were
1876the very first films made by Thomas Edison's company and some others at
1877that time.  The particular example illustrated was a Biograph film,
1878brought in with a frame-grabber into a window.  A single videodisc
1879contains about fifty titles and pieces of film from that period, all of
1880New York City.  Taken together, AM believes, they provide an interesting
1881documentary resource.
1882
1883                                 ******
1884
1885+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1886DISCUSSION * Using the frame-grabber in AM * Volume of material processed
1887and to be processed * Purpose of AM within LC * Cataloguing and the
1888nature of AM's material * SGML coding and the question of quality versus
1889quantity *
1890+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1891
1892During the question-and-answer period that followed FLEISCHHAUER's
1893presentation, several clarifications were made.
1894
1895AM is bringing in motion pictures from a videodisc.  The frame-grabber
1896devices create a window on a computer screen, which permits users to
1897digitize a single frame of the movie or one of the photographs.  It
1898produces a crude, rough-and-ready image that high school students can
1899incorporate into papers, and that has worked very nicely in this way.
1900
1901Commenting on FLEISCHHAUER's assertion that AM was looking more at
1902searching ideas than words, MYLONAS argued that without words an idea
1903does not exist.  FLEISCHHAUER conceded that he ought to have articulated
1904his point more clearly.  MYLONAS stated that they were in fact both
1905talking about the same thing.  By searching for words and by forcing
1906people to focus on the word, the Perseus Project felt that they would get
1907them to the idea.  The way one reviews results is tailored more to one
1908kind of user than another.
1909
1910Concerning the total volume of material that has been processed in this
1911way, AM at this point has in retrievable form seven or eight collections,
1912all of them photographic.  In the Macintosh environment, for example,
1913there probably are 35,000-40,000 photographs.  The sound recordings
1914number sixty items.  The broadsides number about 300 items.  There are
1915500 political cartoons in the form of drawings.  The motion pictures, as
1916individual items, number sixty to seventy.
1917
1918AM also has a manuscript collection, the life history portion of one of
1919the federal project series, which will contain 2,900 individual
1920documents, all first-person narratives.  AM has in process about 350
1921African-American pamphlets, or about 12,000 printed pages for the period
19221870-1910.  Also in the works are some 4,000 panoramic photographs.  AM
1923has recycled a fair amount of the work done by LC's Prints and
1924Photographs Division during the Library's optical disk pilot project in
1925the 1980s.  For example, a special division of LC has tooled up and
1926thought through all the ramifications of electronic presentation of
1927photographs.  Indeed, they are wheeling them out in great barrel loads.
1928The purpose of AM within the Library, it is hoped, is to catalyze several
1929of the other special collection divisions which have no particular
1930experience with, in some cases, mixed feelings about, an activity such as
1931AM.  Moreover, in many cases the divisions may be characterized as not
1932only lacking experience in "electronifying" things but also in automated
1933cataloguing.  MARC cataloguing as practiced in the United States is
1934heavily weighted toward the description of monograph and serial
1935materials, but is much thinner when one enters the world of manuscripts
1936and things that are held in the Library's music collection and other
1937units.  In response to a comment by LESK, that AM's material is very
1938heavily photographic, and is so primarily because individual records have
1939been made for each photograph, FLEISCHHAUER observed that an item-level
1940catalog record exists, for example, for each photograph in the Detroit
1941Publishing collection of 25,000 pictures.  In the case of the Federal
1942Writers Project, for which nearly 3,000 documents exist, representing
1943information from twenty-six different states, AM with the assistance of
1944Karen STUART of the Manuscript Division will attempt to find some way not
1945only to have a collection-level record but perhaps a MARC record for each
1946state, which will then serve as an umbrella for the 100-200 documents
1947that come under it.  But that drama remains to be enacted.  The AM staff
1948is conservative and clings to cataloguing, though of course visitors tout
1949artificial intelligence and neural networks in a manner that suggests that
1950perhaps one need not have cataloguing or that much of it could be put aside.
1951
1952The matter of SGML coding, FLEISCHHAUER conceded, returned the discussion
1953to the earlier treated question of quality versus quantity in the Library
1954of Congress.  Of course, text conversion can be done with 100-percent
1955accuracy, but it means that when one's holdings are as vast as LC's only
1956a tiny amount will be exposed, whereas permitting lower levels of
1957accuracy can lead to exposing or sharing larger amounts, but with the
1958quality correspondingly impaired.
1959
1960                                 ******
1961
1962+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1963TWOHIG * A contrary experience concerning electronic options * Volume of
1964material in the Washington papers and a suggestion of David Packard *
1965Implications of Packard's suggestion * Transcribing the documents for the
1966CD-ROM * Accuracy of transcriptions * The CD-ROM edition of the Founding
1967Fathers documents *
1968+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1969
1970Finding encouragement in a comment of MICHELSON's from the morning
1971session--that numerous people in the humanities were choosing electronic
1972options to do their work--Dorothy TWOHIG, editor, The Papers of George
1973Washington, opened her illustrated talk by noting that her experience
1974with literary scholars and numerous people in editing was contrary to
1975MICHELSON's.  TWOHIG emphasized literary scholars' complete ignorance of
1976the technological options available to them or their reluctance or, in
1977some cases, their downright hostility toward these options.
1978
1979After providing an overview of the five Founding Fathers projects
1980(Jefferson at Princeton, Franklin at Yale, John Adams at the
1981Massachusetts Historical Society, and Madison down the hall from her at
1982the University of Virginia), TWOHIG observed that the Washington papers,
1983like all of the projects, include both sides of the Washington
1984correspondence and deal with some 135,000 documents to be published with
1985extensive annotation in eighty to eighty-five volumes, a project that
1986will not be completed until well into the next century.  Thus, it was
1987with considerable enthusiasm several years ago that the Washington Papers
1988Project (WPP) greeted David Packard's suggestion that the papers of the
1989Founding Fathers could be published easily and inexpensively, and to the
1990great benefit of American scholarship, via CD-ROM.
1991
1992In pragmatic terms, funding from the Packard Foundation would expedite
1993the transcription of thousands of documents waiting to be put on disk in
1994the WPP offices.  Further, since the costs of collecting, editing, and
1995converting the Founding Fathers documents into letterpress editions were
1996running into the millions of dollars, and the considerable staffs
1997involved in all of these projects were devoting their careers to
1998producing the work, the Packard Foundation's suggestion had a
1999revolutionary aspect:  Transcriptions of the entire corpus of the
2000Founding Fathers papers would be available on CD-ROM to public and
2001college libraries, even high schools, at a fraction of the cost--
2002$100-$150 for the annual license fee--to produce a limited university
2003press run of 1,000 of each volume of the published papers at $45-$150 per
2004printed volume.  Given the current budget crunch in educational systems
2005and the corresponding constraints on librarians in smaller institutions
2006who wish to add these volumes to their collections, producing the
2007documents on CD-ROM would likely open a greatly expanded audience for the
2008papers.  TWOHIG stressed, however, that development of the Founding
2009Fathers CD-ROM is still in its infancy.  Serious software problems remain
2010to be resolved before the material can be put into readable form.
2011
2012Funding from the Packard Foundation resulted in a major push to
2013transcribe the 75,000 or so documents of the Washington papers remaining
2014to be transcribed onto computer disks.  Slides illustrated several of the
2015problems encountered, for example, the present inability of CD-ROM to
2016indicate the cross-outs (deleted material) in eighteenth century
2017documents.  TWOHIG next described documents from various periods in the
2018eighteenth century that have been transcribed in chronological order and
2019delivered to the Packard offices in California, where they are converted
2020to the CD-ROM, a process that is expected to consume five years to
2021complete (that is, reckoning from David Packard's suggestion made several
2022years ago, until about July 1994).  TWOHIG found an encouraging
2023indication of the project's benefits in the ongoing use made by scholars
2024of the search functions of the CD-ROM, particularly in reducing the time
2025spent in manually turning the pages of the Washington papers.
2026
2027TWOHIG next furnished details concerning the accuracy of transcriptions.
2028For instance, the insertion of thousands of documents on the CD-ROM
2029currently does not permit each document to be verified against the
2030original manuscript several times as in the case of documents that appear
2031in the published edition.  However, the transcriptions receive a cursory
2032check for obvious typos, the misspellings of proper names, and other
2033errors from the WPP CD-ROM editor.  Eventually, all documents that appear
2034in the electronic version will be checked by project editors.  Although
2035this process has met with opposition from some of the editors on the
2036grounds that imperfect work may leave their offices, the advantages in
2037making this material available as a research tool outweigh  fears about the
2038misspelling of proper names and other relatively minor editorial matters.
2039
2040Completion of all five Founding Fathers projects (i.e., retrievability
2041and searchability of all of the documents by proper names, alternate
2042spellings, or varieties of subjects) will provide one of the richest
2043sources of this size for the history of the United States in the latter
2044part of the eighteenth century.  Further, publication on CD-ROM will
2045allow editors to include even minutiae, such as laundry lists, not
2046included in the printed volumes.
2047
2048It seems possible that the extensive annotation provided in the printed
2049volumes eventually will be added to the CD-ROM edition, pending
2050negotiations with the publishers of the papers.  At the moment, the
2051Founding Fathers CD-ROM is accessible only on the IBYCUS, a computer
2052developed out of the Thesaurus Linguae Graecae project and designed for
2053the use of classical scholars.  There are perhaps 400 IBYCUS computers in
2054the country, most of which are in university classics departments.
2055Ultimately, it is anticipated that the CD-ROM edition of the Founding
2056Fathers documents will run on any IBM-compatible or Macintosh computer
2057with a CD-ROM drive.  Numerous changes in the software will also occur
2058before the project is completed.  (Editor's note: an IBYCUS was
2059unavailable to demonstrate the CD-ROM.)
2060
2061                                 ******
2062
2063+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
2064DISCUSSION * Several additional features of WPP clarified *
2065+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
2066
2067Discussion following TWOHIG's presentation served to clarify several
2068additional features, including (1) that the project's primary
2069intellectual product consists in the electronic transcription of the
2070material; (2) that the text transmitted to the CD-ROM people is not
2071marked up; (3) that cataloging and subject-indexing of the material
2072remain to be worked out (though at this point material can be retrieved
2073by name); and (4) that because all the searching is done in the hardware,
2074the IBYCUS is designed to read a CD-ROM which contains only sequential
2075text files.  Technically, it then becomes very easy to read the material
2076off and put it on another device.
2077
2078                                 ******
2079
2080+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
2081LEBRON * Overview of the history of the joint project between AAAS and
2082OCLC * Several practices the on-line environment shares with traditional
2083publishing on hard copy * Several technical and behavioral barriers to
2084electronic publishing * How AAAS and OCLC arrived at the subject of
2085clinical trials * Advantages of the electronic format and other features
2086of OJCCT * An illustrated tour of the journal *
2087+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
2088
2089Maria LEBRON, managing editor, The Online Journal of Current Clinical
2090Trials (OJCCT), presented an illustrated overview of the history of the
2091joint project between the American Association for the Advancement of
2092Science (AAAS) and the Online Computer Library Center, Inc. (OCLC).  The
2093joint venture between AAAS and OCLC owes its beginning to a
2094reorganization launched by the new chief executive officer at OCLC about
2095three years ago and combines the strengths of these two disparate
2096organizations.  In short, OJCCT represents the process of scholarly
2097publishing on line.
2098
2099LEBRON next discussed several practices the on-line environment shares
2100with traditional publishing on hard copy--for example, peer review of
2101manuscripts--that are highly important in the academic world.  LEBRON
2102noted in particular the implications of citation counts for tenure
2103committees and grants committees.  In the traditional hard-copy
2104environment, citation counts are readily demonstrable, whereas the
2105on-line environment represents an ethereal medium to most academics.
2106
2107LEBRON remarked several technical and behavioral barriers to electronic
2108publishing, for instance, the problems in transmission created by special
2109characters or by complex graphics and halftones.  In addition, she noted
2110economic limitations such as the storage costs of maintaining back issues
2111and market or audience education.
2112
2113Manuscripts cannot be uploaded to OJCCT, LEBRON explained, because it is
2114not a bulletin board or E-mail, forms of electronic transmission of
2115information that have created an ambience clouding people's understanding
2116of what the journal is attempting to do.  OJCCT, which publishes
2117peer-reviewed medical articles dealing with the subject of clinical
2118trials, includes text, tabular material, and graphics, although at this
2119time it can transmit only line illustrations.
2120
2121Next, LEBRON described how AAAS and OCLC arrived at the subject of
2122clinical trials:  It is 1) a highly statistical discipline that 2) does
2123not require halftones but can satisfy the needs of its audience with line
2124illustrations and graphic material, and 3) there is a need for the speedy
2125dissemination of high-quality research results.  Clinical trials are
2126research activities that involve the administration of a test treatment
2127to some experimental unit in order to test its usefulness before it is
2128made available to the general population.  LEBRON proceeded to give
2129additional information on OJCCT concerning its editor-in-chief, editorial
2130board, editorial content, and the types of articles it publishes
2131(including peer-reviewed research reports and reviews), as well as
2132features shared by other traditional hard-copy journals.
2133
2134Among the advantages of the electronic format are faster dissemination of
2135information, including raw data, and the absence of space constraints
2136because pages do not exist.  (This latter fact creates an interesting
2137situation when it comes to citations.)  Nor are there any issues.  AAAS's
2138capacity to download materials directly from the journal to a
2139subscriber's printer, hard drive, or floppy disk helps ensure highly
2140accurate transcription.  Other features of OJCCT include on-screen alerts
2141that allow linkage of subsequently published documents to the original
2142documents; on-line searching by subject, author, title, etc.; indexing of
2143every single word that appears in an article; viewing access to an
2144article by component (abstract, full text, or graphs); numbered
2145paragraphs to replace page counts; publication in Science every thirty
2146days of indexing of all articles published in the journal;
2147typeset-quality screens; and Hypertext links that enable subscribers to
2148bring up Medline abstracts directly without leaving the journal.
2149
2150After detailing the two primary ways to gain access to the journal,
2151through the OCLC network and Compuserv if one desires graphics or through
2152the Internet if just an ASCII file is desired, LEBRON illustrated the
2153speedy editorial process and the coding of the document using SGML tags
2154after it has been accepted for publication.  She also gave an illustrated
2155tour of the journal, its search-and-retrieval capabilities in particular,
2156but also including problems associated with scanning in illustrations,
2157and the importance of on-screen alerts to the medical profession re
2158retractions or corrections, or more frequently, editorials, letters to
2159the editors, or follow-up reports.  She closed by inviting the audience
2160to join AAAS on 1 July, when OJCCT was scheduled to go on-line.
2161
2162                                 ******
2163
2164+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
2165DISCUSSION * Additional features of OJCCT *
2166+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
2167
2168In the lengthy discussion that followed LEBRON's presentation, these
2169points emerged:
2170
2171     * The SGML text can be tailored as users wish.
2172
2173     * All these articles have a fairly simple document definition.
2174
2175     * Document-type definitions (DTDs) were developed and given to OJCCT
2176     for coding.
2177
2178     * No articles will be removed from the journal.  (Because there are
2179     no back issues, there are no lost issues either.  Once a subscriber
2180     logs onto the journal he or she has access not only to the currently
2181     published materials, but retrospectively to everything that has been
2182     published in it.  Thus the table of contents grows bigger.  The date
2183     of publication serves to distinguish between currently published
2184     materials and older materials.)
2185
2186     * The pricing system for the journal resembles that for most medical
2187     journals:  for 1992, $95 for a year, plus telecommunications charges
2188     (there are no connect time charges);    for 1993, $110 for the
2189     entire year for single users, though the journal can be put on a
2190     local area network (LAN).  However, only one person can access the
2191     journal at a time.  Site licenses may come in the future.
2192
2193     * AAAS is working closely with colleagues at OCLC to display
2194     mathematical equations on screen.
2195
2196     * Without compromising any steps in the editorial process, the
2197     technology has reduced the time lag between when a manuscript is
2198     originally submitted and the time it is accepted; the review process
2199     does not differ greatly from the standard six-to-eight weeks
2200     employed by many of the hard-copy journals.  The process still
2201     depends on people.
2202
2203     * As far as a preservation copy is concerned, articles will be
2204     maintained on the computer permanently and subscribers, as part of
2205     their subscription, will receive a microfiche-quality archival copy
2206     of everything published during that year; in addition, reprints can
2207     be purchased in much the same way as in a hard-copy environment.
2208     Hard copies are prepared but are not the primary medium for the
2209     dissemination of the information.
2210
2211     * Because OJCCT is not yet on line, it is difficult to know how many
2212     people would simply browse through the journal on the screen as
2213     opposed to downloading the whole thing and printing it out; a mix of
2214     both types of users likely will result.
2215
2216                                 ******
2217
2218+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
2219PERSONIUS * Developments in technology over the past decade * The CLASS
2220Project * Advantages for technology and for the CLASS Project *
2221Developing a network application an underlying assumption of the project
2222* Details of the scanning process * Print-on-demand copies of books *
2223Future plans include development of a browsing tool *
2224+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
2225
2226Lynne PERSONIUS, assistant director, Cornell Information Technologies for
2227Scholarly Information Services, Cornell University, first commented on
2228the tremendous impact that developments in technology over the past ten
2229years--networking, in particular--have had on the way information is
2230handled, and how, in her own case, these developments have counterbalanced
2231Cornell's relative geographical isolation.  Other significant technologies
2232include scanners, which are much more sophisticated than they were ten years
2233ago; mass storage and the dramatic savings that result from it in terms of
2234both space and money relative to twenty or thirty years ago; new and
2235improved printing technologies, which have greatly affected the distribution
2236of information; and, of course, digital technologies, whose applicability to
2237library preservation remains at issue.
2238
2239Given that context, PERSONIUS described the College Library Access and
2240Storage System (CLASS) Project, a library preservation project,
2241primarily, and what has been accomplished.  Directly funded by the
2242Commission on Preservation and Access and by the Xerox Corporation, which
2243has provided a significant amount of hardware, the CLASS Project has been
2244working with a development team at Xerox to develop a software
2245application tailored to library preservation requirements.  Within
2246Cornell, participants in the project have been working jointly with both
2247library and information technologies.  The focus of the project has been
2248on reformatting and saving books that are in brittle condition.
2249PERSONIUS showed Workshop participants a brittle book, and described how
2250such books were the result of developments in papermaking around the
2251beginning of the Industrial Revolution.  The papermaking process was
2252changed so that a significant amount of acid was introduced into the
2253actual paper itself, which deteriorates as it sits on library shelves.
2254
2255One of the advantages for technology and for the CLASS Project is that
2256the information in brittle books is mostly out of copyright and thus
2257offers an opportunity to work with material that requires library
2258preservation, and to create and work on an infrastructure to save the
2259material.  Acknowledging the familiarity of those working in preservation
2260with this information, PERSONIUS noted that several things are being
2261done:  the primary preservation technology used today is photocopying of
2262brittle material.  Saving the intellectual content of the material is the
2263main goal.  With microfilm copy, the intellectual content is preserved on
2264the assumption that in the future the image can be reformatted in any
2265other way that then exists.
2266
2267An underlying assumption of the CLASS Project from the beginning was
2268that it would develop a network application.  Project staff scan books
2269at a workstation located in the library, near the brittle material.
2270An image-server filing system is located at a distance from that
2271workstation, and a printer is located in another building.  All of the
2272materials digitized and stored on the image-filing system are cataloged
2273in the on-line catalogue.  In fact, a record for each of these electronic
2274books is stored in the RLIN database so that a record exists of what is
2275in the digital library throughout standard catalogue procedures.  In the
2276future, researchers working from their own workstations in their offices,
2277or their networks, will have access--wherever they might be--through a
2278request server being built into the new digital library.  A second
2279assumption is that the preferred means of finding the material will be by
2280looking through a catalogue.  PERSONIUS described the scanning process,
2281which uses a prototype scanner being developed by Xerox and which scans a
2282very high resolution image at great speed.  Another significant feature,
2283because this is a preservation application, is the placing of the pages
2284that fall apart one for one on the platen.  Ordinarily, a scanner could
2285be used with some sort of a document feeder, but because of this
2286application that is not feasible.  Further, because CLASS is a
2287preservation application, after the paper replacement is made there, a
2288very careful quality control check is performed.  An original book is
2289compared to the printed copy and verification is made, before proceeding,
2290that all of the image, all of the information, has been captured.  Then,
2291a new library book is produced:  The printed images are rebound by a
2292commercial binder and a new book is returned to the shelf.
2293Significantly, the books returned to the library shelves are beautiful
2294and useful replacements on acid-free paper that should last a long time,
2295in effect, the equivalent of preservation photocopies.  Thus, the project
2296has a library of digital books.  In essence, CLASS is scanning and
2297storing books as 600 dot-per-inch bit-mapped images, compressed using
2298Group 4 CCITT (i.e., the French acronym for International Consultative
2299Committee for Telegraph and Telephone) compression.  They are stored as
2300TIFF files on an optical filing system that is composed of a database
2301used for searching and locating the books and an optical jukebox that
2302stores 64 twelve-inch platters.  A very-high-resolution printed copy of
2303these books at 600 dots per inch is created, using a Xerox DocuTech
2304printer to make the paper replacements on acid-free paper.
2305
2306PERSONIUS maintained that the CLASS Project presents an opportunity to
2307introduce people to books as digital images by using a paper medium.
2308Books are returned to the shelves while people are also given the ability
2309to print on demand--to make their own copies of books.  (PERSONIUS
2310distributed copies of an engineering journal published by engineering
2311students at Cornell around 1900 as an example of what a print-on-demand
2312copy of material might be like.  This very cheap copy would be available
2313to people to use for their own research purposes and would bridge the gap
2314between an electronic work and the paper that readers like to have.)
2315PERSONIUS then attempted to illustrate a very early prototype of
2316networked access to this digital library.  Xerox Corporation has
2317developed a prototype of a view station that can send images across the
2318network to be viewed.
2319
2320The particular library brought down for demonstration contained two
2321mathematics books.  CLASS is developing and will spend the next year
2322developing an application that allows people at workstations to browse
2323the books.  Thus, CLASS is developing a browsing tool, on the assumption
2324that users do not want to read an entire book from a workstation, but
2325would prefer to be able to look through and decide if they would like to
2326have a printed copy of it.
2327
2328                                 ******
2329
2330+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
2331DISCUSSION * Re retrieval software * "Digital file copyright" * Scanning
2332rate during production * Autosegmentation * Criteria employed in
2333selecting books for scanning * Compression and decompression of images *
2334OCR not precluded *
2335+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
2336
2337During the question-and-answer period that followed her presentation,
2338PERSONIUS made these additional points:
2339
2340     * Re retrieval software, Cornell is developing a Unix-based server
2341     as well as clients for the server that support multiple platforms
2342     (Macintosh, IBM and Sun workstations), in the hope that people from
2343     any of those platforms will retrieve books; a further operating
2344     assumption is that standard interfaces will be used as much as
2345     possible, where standards can be put in place, because CLASS
2346     considers this retrieval software a library application and would
2347     like to be able to look at material not only at Cornell but at other
2348     institutions.
2349
2350     * The phrase "digital file copyright by Cornell University" was
2351     added at the advice of Cornell's legal staff with the caveat that it
2352     probably would not hold up in court.  Cornell does not want people
2353     to copy its books and sell them but would like to keep them
2354     available for use in a library environment for library purposes.
2355
2356     * In production the scanner can scan about 300 pages per hour,
2357     capturing 600 dots per inch.
2358
2359     * The Xerox software has filters to scan halftone material and avoid
2360     the moire patterns that occur when halftone material is scanned.
2361     Xerox has been working on hardware and software that would enable
2362     the scanner itself to recognize this situation and deal with it
2363     appropriately--a kind of autosegmentation that would enable the
2364     scanner to handle halftone material as well as text on a single page.
2365
2366     * The books subjected to the elaborate process described above were
2367     selected because CLASS is a preservation project, with the first 500
2368     books selected coming from Cornell's mathematics collection, because
2369     they were still being heavily used and because, although they were
2370     in need of preservation, the mathematics library and the mathematics
2371     faculty were uncomfortable having them microfilmed.  (They wanted a
2372     printed copy.)  Thus, these books became a logical choice for this
2373     project.  Other books were chosen by the project's selection committees
2374     for experiments with the technology, as well as to meet a demand or need.
2375
2376     * Images will be decompressed before they are sent over the line; at
2377     this time they are compressed and sent to the image filing system
2378     and then sent to the printer as compressed images; they are returned
2379     to the workstation as compressed 600-dpi images and the workstation
2380     decompresses and scales them for display--an inefficient way to
2381     access the material though it works quite well for printing and
2382     other purposes.
2383
2384     * CLASS is also decompressing on Macintosh and IBM, a slow process
2385     right now.  Eventually, compression and decompression will take
2386     place on an image conversion server.  Trade-offs will be made, based
2387     on future performance testing, concerning where the file is
2388     compressed and what resolution image is sent.
2389
2390     * OCR has not been precluded; images are being stored that have been
2391     scanned at a high resolution, which presumably would suit them well
2392     to an OCR process.  Because the material being scanned is about 100
2393     years old and was printed with less-than-ideal technologies, very
2394     early and preliminary tests have not produced good results.  But the
2395     project is capturing an image that is of sufficient resolution to be
2396     subjected to OCR in the future.  Moreover, the system architecture
2397     and the system plan have a logical place to store an OCR image if it
2398     has been captured.  But that is not being done now.
2399
2400                                 ******
2401
2402SESSION III.  DISTRIBUTION, NETWORKS, AND NETWORKING:  OPTIONS FOR
2403DISSEMINATION
2404
2405+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
2406ZICH * Issues pertaining to CD-ROMs * Options for publishing in CD-ROM *
2407+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
2408
2409Robert ZICH, special assistant to the associate librarian for special
2410projects, Library of Congress, and moderator of this session, first noted
2411the blessed but somewhat awkward circumstance of having four very
2412distinguished people representing networks and networking or at least
2413leaning in that direction, while lacking anyone to speak from the
2414strongest possible background in CD-ROMs.  ZICH expressed the hope that
2415members of the audience would join the discussion.  He stressed the
2416subtitle of this particular session, "Options for Dissemination," and,
2417concerning CD-ROMs, the importance of determining when it would be wise
2418to consider dissemination in CD-ROM versus networks.  A shopping list of
2419issues pertaining to CD-ROMs included:  the grounds for selecting
2420commercial publishers, and in-house publication where possible versus
2421nonprofit or government publication.  A similar list for networks
2422included:  determining when one should consider dissemination through a
2423network, identifying the mechanisms or entities that exist to place items
2424on networks, identifying the pool of existing networks, determining how a
2425producer  would choose between networks, and identifying the elements of
2426a business arrangement in a network.
2427
2428Options for publishing in CD-ROM:  an outside publisher versus
2429self-publication.  If an outside publisher is used, it can be nonprofit,
2430such as the Government Printing Office (GPO) or the National Technical
2431Information Service (NTIS), in the case of government.  The pros and cons
2432associated with employing an outside publisher are obvious.  Among the
2433pros, there is no trouble getting accepted.  One pays the bill and, in
2434effect, goes one's way.  Among the cons, when one pays an outside
2435publisher to perform the work, that publisher will perform the work it is
2436obliged to do, but perhaps without the production expertise and skill in
2437marketing and dissemination that some would seek.  There is the body of
2438commercial publishers that do possess that kind of expertise in
2439distribution and marketing but that obviously are selective.  In
2440self-publication, one exercises full control, but then one must handle
2441matters such as distribution and marketing.  Such are some of the options
2442for publishing in the case of CD-ROM.
2443
2444In the case of technical and design issues, which are also important,
2445there are many matters which many at the Workshop already knew a good
2446deal about:  retrieval system requirements and costs, what to do about
2447images, the various capabilities and platforms, the trade-offs between
2448cost and performance, concerns about local-area networkability,
2449interoperability, etc.
2450
2451                                 ******
2452
2453+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
2454LYNCH * Creating networked information is different from using networks
2455as an access or dissemination vehicle * Networked multimedia on a large
2456scale does not yet work * Typical CD-ROM publication model a two-edged
2457sword * Publishing information on a CD-ROM in the present world of
2458immature standards * Contrast between CD-ROM and network pricing *
2459Examples demonstrated earlier in the day as a set of insular information
2460gems * Paramount need to link databases * Layering to become increasingly
2461necessary * Project NEEDS and the issues of information reuse and active
2462versus passive use * X-Windows as a way of differentiating between
2463network access and networked information * Barriers to the distribution
2464of networked multimedia information * Need for good, real-time delivery
2465protocols * The question of presentation integrity in client-server
2466computing in the academic world * Recommendations for producing multimedia
2467+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
2468
2469Clifford LYNCH, director, Library Automation, University of California,
2470opened his talk with the general observation that networked information
2471constituted a difficult and elusive topic because it is something just
2472starting to develop and not yet fully understood.  LYNCH contended that
2473creating genuinely networked information was different from using
2474networks as an access or dissemination vehicle and was more sophisticated
2475and more subtle.  He invited the members of the audience to extrapolate,
2476from what they heard about the preceding demonstration projects, to what
2477sort of a world of electronics information--scholarly, archival,
2478cultural, etc.--they wished to end up with ten or fifteen years from now.
2479LYNCH suggested that to extrapolate directly from these projects would
2480produce unpleasant results.
2481
2482Putting the issue of CD-ROM in perspective before getting into
2483generalities on networked information, LYNCH observed that those engaged
2484in multimedia today who wish to ship a product, so to say, probably do
2485not have much choice except to use CD-ROM:  networked multimedia on a
2486large scale basically does not yet work because the technology does not
2487exist.  For example, anybody who has tried moving images around over the
2488Internet knows that this is an exciting touch-and-go process, a
2489fascinating and fertile area for experimentation, research, and
2490development, but not something that one can become deeply enthusiastic
2491about committing to production systems at this time.
2492
2493This situation will change, LYNCH said.  He differentiated CD-ROM from
2494the practices that have been followed up to now in distributing data on
2495CD-ROM.  For LYNCH the problem with CD-ROM is not its portability or its
2496slowness but the two-edged sword of having the retrieval application and
2497the user interface inextricably bound up with the data, which is the
2498typical CD-ROM publication model.  It is not a case of publishing data
2499but of distributing a typically stand-alone, typically closed system,
2500all--software, user interface, and data--on a little disk.  Hence, all
2501the between-disk navigational issues as well as the impossibility in most
2502cases of integrating data on one disk with that on another.  Most CD-ROM
2503retrieval software does not network very gracefully at present.  However,
2504in the present world of immature standards and lack of understanding of
2505what network information is or what the ground rules are for creating or
2506using it, publishing information on a CD-ROM does add value in a very
2507real sense.
2508
2509LYNCH drew a contrast between CD-ROM and network pricing and in doing so
2510highlighted something bizarre in information pricing.  A large
2511institution such as the University of California has vendors who will
2512offer to sell information on CD-ROM for a price per year in four digits,
2513but for the same data (e.g., an abstracting and indexing database) on
2514magnetic tape, regardless of how many people may use it concurrently,
2515will quote a price in six digits.
2516
2517What is packaged with the CD-ROM in one sense adds value--a complete
2518access system, not just raw, unrefined information--although it is not
2519generally perceived that way.  This is because the access software,
2520although it adds value, is viewed by some people, particularly in the
2521university environment where there is a very heavy commitment to
2522networking, as being developed in the wrong direction.
2523
2524Given that context, LYNCH described the examples demonstrated as a set of
2525insular information gems--Perseus, for example, offers nicely linked
2526information, but would be very difficult to integrate with other
2527databases, that is, to link together seamlessly with other source files
2528from other sources.  It resembles an island, and in this respect is
2529similar to numerous stand-alone projects that are based on videodiscs,
2530that is, on the single-workstation concept.
2531
2532As scholarship evolves in a network environment, the paramount need will
2533be to link databases.  We must link personal databases to public
2534databases, to group databases, in fairly seamless ways--which is
2535extremely difficult in the environments under discussion with copies of
2536databases proliferating all over the place.
2537
2538The notion of layering also struck LYNCH as lurking in several of the
2539projects demonstrated.  Several databases in a sense constitute
2540information archives without a significant amount of navigation built in.
2541Educators, critics, and others will want a layered structure--one that
2542defines or links paths through the layers to allow users to reach
2543specific points.  In LYNCH's view, layering will become increasingly
2544necessary, and not just within a single resource but across resources
2545(e.g., tracing mythology and cultural themes across several classics
2546databases as well as a database of Renaissance culture).  This ability to
2547organize resources, to build things out of multiple other things on the
2548network or select pieces of it, represented for LYNCH one of the key
2549aspects of network information.
2550
2551Contending that information reuse constituted another significant issue,
2552LYNCH commended to the audience's attention Project NEEDS (i.e., National
2553Engineering Education Delivery System).  This project's objective is to
2554produce a database of engineering courseware as well as the components
2555that can be used to develop new courseware.  In a number of the existing
2556applications, LYNCH said, the issue of reuse (how much one can take apart
2557and reuse in other applications) was not being well considered.  He also
2558raised the issue of active versus passive use, one aspect of which  is
2559how much information will be manipulated locally by users.  Most people,
2560he argued, may do a little browsing and then will wish to print.  LYNCH
2561was uncertain how these resources would be used by the vast majority of
2562users in the network environment.
2563
2564LYNCH next said a few words about X-Windows as a way of differentiating
2565between network access and networked information.  A number of the
2566applications demonstrated at the Workshop could be rewritten to use X
2567across the network, so that one could run them from any X-capable device-
2568-a workstation, an X terminal--and transact with a database across the
2569network.  Although this opens up access a little, assuming one has enough
2570network to handle it, it does not provide an interface to develop a
2571program that conveniently integrates information from multiple databases.
2572X is a viewing technology that has limits.  In a real sense, it is just a
2573graphical version of remote log-in across the network.  X-type applications
2574represent only one step in the progression towards real access.
2575
2576LYNCH next discussed barriers to the distribution of networked multimedia
2577information.  The heart of the problem is a lack of standards to provide
2578the ability for computers to talk to each other, retrieve information,
2579and shuffle it around fairly casually.  At the moment, little progress is
2580being made on standards for networked information; for example, present
2581standards do not cover images, digital voice, and digital video.  A
2582useful tool kit of exchange formats for basic texts is only now being
2583assembled.  The synchronization of content streams (i.e., synchronizing a
2584voice track to a video track, establishing temporal relations between
2585different components in a multimedia object) constitutes another issue
2586for networked multimedia that is just beginning to receive attention.
2587
2588Underlying network protocols also need some work; good, real-time
2589delivery protocols on the Internet do not yet exist.  In LYNCH's view,
2590highly important in this context is the notion of networked digital
2591object IDs, the ability of one object on the network to point to another
2592object (or component thereof) on the network.  Serious bandwidth issues
2593also exist.  LYNCH was uncertain if billion-bit-per-second networks would
2594prove sufficient if numerous people ran video in parallel.
2595
2596LYNCH concluded by offering an issue for database creators to consider,
2597as well as several comments about what might constitute good trial
2598multimedia experiments.  In a networked information world the database
2599builder or service builder (publisher) does not exercise the same
2600extensive control over the integrity of the presentation; strange
2601programs "munge" with one's data before the user sees it.  Serious
2602thought must be given to what guarantees integrity of presentation.  Part
2603of that is related to where one draws the boundaries around a networked
2604information service.  This question of presentation integrity in
2605client-server computing has not been stressed enough in the academic
2606world, LYNCH argued, though commercial service providers deal with it
2607regularly.
2608
2609Concerning multimedia, LYNCH observed that good multimedia at the moment
2610is hideously expensive to produce.  He recommended producing multimedia
2611with either very high sale value, or multimedia with a very long life
2612span, or multimedia that will have a very broad usage base and whose
2613costs therefore can be amortized among large numbers of users.  In this
2614connection, historical and humanistically oriented material may be a good
2615place to start, because it tends to have a longer life span than much of
2616the scientific material, as well as a wider user base.  LYNCH noted, for
2617example, that American Memory fits many of the criteria outlined.  He
2618remarked the extensive discussion about bringing the Internet or the
2619National Research and Education Network (NREN) into the K-12 environment
2620as a way of helping the American educational system.
2621
2622LYNCH closed by noting that the kinds of applications demonstrated struck
2623him as excellent justifications of broad-scale networking for K-12, but
2624that at this time no "killer" application exists to mobilize the K-12
2625community to obtain connectivity.
2626
2627                                 ******
2628
2629+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
2630DISCUSSION * Dearth of genuinely interesting applications on the network
2631a slow-changing situation * The issue of the integrity of presentation in
2632a networked environment * Several reasons why CD-ROM software does not
2633network *
2634+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
2635
2636During the discussion period that followed LYNCH's presentation, several
2637additional points were made.
2638
2639LYNCH reiterated even more strongly his contention that, historically,
2640once one goes outside high-end science and the group of those who need
2641access to supercomputers, there is a great dearth of genuinely
2642interesting applications on the network.  He saw this situation changing
2643slowly, with some of the scientific databases and scholarly discussion
2644groups and electronic journals coming on as well as with the availability
2645of Wide Area Information Servers (WAIS) and some of the databases that
2646are being mounted there.  However, many of those things do not seem to
2647have piqued great popular interest.  For instance, most high school
2648students of LYNCH's acquaintance would not qualify as devotees of serious
2649molecular biology.
2650
2651Concerning the issue of the integrity of presentation, LYNCH believed
2652that a couple of information providers have laid down the law at least on
2653certain things.  For example, his recollection was that the National
2654Library of Medicine feels strongly that one needs to employ the
2655identifier field if he or she is to mount a database commercially.  The
2656problem with a real networked environment is that one does not know who
2657is reformatting and reprocessing one's data when one enters a client
2658server mode.  It becomes anybody's guess, for example, if the network
2659uses a Z39.50 server, or what clients are doing with one's data.  A data
2660provider can say that his contract will only permit clients to have
2661access to his data after he vets them and their presentation and makes
2662certain it suits him.  But LYNCH held out little expectation that the
2663network marketplace would evolve in that way, because it required too
2664much prior negotiation.
2665
2666CD-ROM software does not network for a variety of reasons, LYNCH said.
2667He speculated that CD-ROM publishers are not eager to have their products
2668really hook into wide area networks, because they fear it will make their
2669data suppliers nervous.  Moreover, until relatively recently, one had to
2670be rather adroit to run a full TCP/IP stack plus applications on a
2671PC-size machine, whereas nowadays it is becoming easier as PCs grow
2672bigger and faster.  LYNCH also speculated that software providers had not
2673heard from their customers until the last year or so, or had not heard
2674from enough of their customers.
2675
2676                                 ******
2677
2678+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
2679BESSER * Implications of disseminating images on the network; planning
2680the distribution of multimedia documents poses two critical
2681implementation problems * Layered approach represents the way to deal
2682with users' capabilities * Problems in platform design; file size and its
2683implications for networking * Transmission of megabyte size images
2684impractical * Compression and decompression at the user's end * Promising
2685trends for compression * A disadvantage of using X-Windows * A project at
2686the Smithsonian that mounts images on several networks *
2687+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
2688
2689Howard BESSER, School of Library and Information Science, University of
2690Pittsburgh, spoke primarily about multimedia, focusing on images and the
2691broad implications of disseminating them on the network.  He argued that
2692planning the distribution of multimedia documents posed two critical
2693implementation problems, which he framed in the form of two questions:
26941) What platform will one use and what hardware and software will users
2695have for viewing of the material?  and 2) How can one deliver a
2696sufficiently robust set of information in an accessible format in a
2697reasonable amount of time?  Depending on whether network or CD-ROM is the
2698medium used, this question raises different issues of storage,
2699compression, and transmission.
2700
2701Concerning the design of platforms (e.g., sound, gray scale, simple
2702color, etc.) and the various capabilities users may have, BESSER
2703maintained that a layered approach was the way to deal with users'
2704capabilities.  A result would be that users with less powerful
2705workstations would simply have less functionality.  He urged members of
2706the audience to advocate standards and accompanying software that handle
2707layered functionality across a wide variety of platforms.
2708
2709BESSER also addressed problems in platform design, namely, deciding how
2710large a machine to design for situations when the largest number of users
2711have the lowest level of the machine, and one desires higher
2712functionality.  BESSER then proceeded to the question of file size and
2713its implications for networking.  He discussed still images in the main.
2714For example, a digital color image that fills the screen of a standard
2715mega-pel workstation (Sun or Next) will require one megabyte of storage
2716for an eight-bit image or three megabytes of storage for a true color or
2717twenty-four-bit image.  Lossless compression algorithms (that is,
2718computational procedures in which no data is lost in the process of
2719compressing [and decompressing] an image--the exact bit-representation is
2720maintained) might bring storage down to a third of a megabyte per image,
2721but not much further than that.  The question of size makes it difficult
2722to fit an appropriately sized set of these images on a single disk or to
2723transmit them quickly enough on a network.
2724
2725With these full screen mega-pel images that constitute a third of a
2726megabyte, one gets 1,000-3,000 full-screen images on a one-gigabyte disk;
2727a standard CD-ROM represents approximately 60 percent of that.  Storing
2728images the size of a PC screen (just 8 bit color) increases storage
2729capacity to 4,000-12,000 images per gigabyte; 60 percent of that gives
2730one the size of a CD-ROM, which in turn creates a major problem.  One
2731cannot have full-screen, full-color images with lossless compression; one
2732must compress them or use a lower resolution.  For megabyte-size images,
2733anything slower than a T-1 speed is impractical.  For example, on a
2734fifty-six-kilobaud line, it takes three minutes to transfer a
2735one-megabyte file, if it is not compressed; and this speed assumes ideal
2736circumstances (no other user contending for network bandwidth).  Thus,
2737questions of disk access, remote display, and current telephone
2738connection speed make transmission of megabyte-size images impractical.
2739
2740BESSER then discussed ways to deal with these large images, for example,
2741compression and decompression at the user's end.  In this connection, the
2742issues of how much one is willing to lose in the compression process and
2743what image quality one needs in the first place are unknown.  But what is
2744known is that compression entails some loss of data.  BESSER urged that
2745more studies be conducted on image quality in different situations, for
2746example, what kind of images are needed for what kind of disciplines, and
2747what kind of image quality is needed for a browsing tool, an intermediate
2748viewing tool, and archiving.
2749
2750BESSER remarked two promising trends for compression:  from a technical
2751perspective, algorithms that use what is called subjective redundancy
2752employ principles from visual psycho-physics to identify and remove
2753information from the image that the human eye cannot perceive; from an
2754interchange and interoperability perspective, the JPEG (i.e., Joint
2755Photographic Experts Group, an ISO standard) compression algorithms also
2756offer promise.  These issues of compression and decompression, BESSER
2757argued, resembled those raised earlier concerning the design of different
2758platforms.  Gauging the capabilities of potential users constitutes a
2759primary goal.  BESSER advocated layering or separating the images from
2760the applications that retrieve and display them, to avoid tying them to
2761particular software.
2762
2763BESSER detailed several lessons learned from his work at Berkeley with
2764Imagequery, especially the advantages and disadvantages of using
2765X-Windows.  In the latter category, for example, retrieval is tied
2766directly to one's data, an intolerable situation in the long run on a
2767networked system.  Finally, BESSER described a project of Jim Wallace at
2768the Smithsonian Institution, who is mounting images in a extremely
2769rudimentary way on the Compuserv and Genie networks and is preparing to
2770mount them on America On Line.  Although the average user takes over
2771thirty minutes to download these images (assuming a fairly fast modem),
2772nevertheless, images have been downloaded 25,000 times.
2773
2774BESSER concluded his talk with several comments on the business
2775arrangement between the Smithsonian and Compuserv.  He contended that not
2776enough is known concerning the value of images.
2777
2778                                 ******
2779
2780+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
2781DISCUSSION * Creating digitized photographic collections nearly
2782impossible except with large organizations like museums * Need for study
2783to determine quality of images users will tolerate *
2784+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
2785
2786During the brief exchange between LESK and BESSER that followed, several
2787clarifications emerged.
2788
2789LESK argued that the photographers were far ahead of BESSER:  It is
2790almost impossible to create such digitized photographic collections
2791except with large organizations like museums, because all the
2792photographic agencies have been going crazy about this and will not sign
2793licensing agreements on any sort of reasonable terms.  LESK had heard
2794that National Geographic, for example, had tried to buy the right to use
2795some image in some kind of educational production for $100 per image, but
2796the photographers will not touch it.  They want accounting and payment
2797for each use, which cannot be accomplished within the system.  BESSER
2798responded that a consortium of photographers, headed by a former National
2799Geographic photographer, had started assembling its own collection of
2800electronic reproductions of images, with the money going back to the
2801cooperative.
2802
2803LESK contended that BESSER was unnecessarily pessimistic about multimedia
2804images, because people are accustomed to low-quality images, particularly
2805from video.  BESSER urged the launching of a study to determine what
2806users would tolerate, what they would feel comfortable with, and what
2807absolutely is the highest quality they would ever need.  Conceding that
2808he had adopted a dire tone in order to arouse people about the issue,
2809BESSER closed on a sanguine note by saying that he would not be in this
2810business if he did not think that things could be accomplished.
2811
2812                                 ******
2813
2814+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
2815LARSEN * Issues of scalability and modularity * Geometric growth of the
2816Internet and the role played by layering * Basic functions sustaining
2817this growth * A library's roles and functions in a network environment *
2818Effects of implementation of the Z39.50 protocol for information
2819retrieval on the library system * The trade-off between volumes of data
2820and its potential usage * A snapshot of current trends *
2821+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
2822
2823Ronald LARSEN, associate director for information technology, University
2824of Maryland at College Park, first addressed the issues of scalability
2825and modularity.  He noted the difficulty of anticipating the effects of
2826orders-of-magnitude growth, reflecting on the twenty years of experience
2827with the Arpanet and Internet.  Recalling the day's demonstrations of
2828CD-ROM and optical disk material, he went on to ask if the field has yet
2829learned how to scale new systems to enable delivery and dissemination
2830across large-scale networks.
2831
2832LARSEN focused on the geometric growth of the Internet from its inception
2833circa 1969 to the present, and the adjustments required to respond to
2834that rapid growth.  To illustrate the issue of scalability, LARSEN
2835considered computer networks as including three generic components:
2836computers, network communication nodes, and communication media.  Each
2837component scales (e.g., computers range from PCs to supercomputers;
2838network nodes scale from interface cards in a PC through sophisticated
2839routers and gateways; and communication media range from 2,400-baud
2840dial-up facilities through 4.5-Mbps backbone links, and eventually to
2841multigigabit-per-second communication lines), and architecturally, the
2842components are organized to scale hierarchically from local area networks
2843to international-scale networks.  Such growth is made possible by
2844building layers of communication protocols, as BESSER pointed out.
2845By layering both physically and logically, a sense of scalability is
2846maintained from local area networks in offices, across campuses, through
2847bridges, routers, campus backbones, fiber-optic links, etc., up into
2848regional networks and ultimately into national and international
2849networks.
2850
2851LARSEN then illustrated the geometric growth over a two-year period--
2852through September 1991--of the number of networks that comprise the
2853Internet.  This growth has been sustained largely by the availability of
2854three basic functions:  electronic mail, file transfer (ftp), and remote
2855log-on (telnet).  LARSEN also reviewed the growth in the kind of traffic
2856that occurs on the network.  Network traffic reflects the joint contributions
2857of a larger population of users and increasing use per user.  Today one sees
2858serious applications involving moving images across the network--a rarity
2859ten years ago.  LARSEN recalled and concurred with BESSER's main point
2860that the interesting problems occur at the application level.
2861
2862LARSEN then illustrated a model of a library's roles and functions in a
2863network environment.  He noted, in particular, the placement of on-line
2864catalogues onto the network and patrons obtaining access to the library
2865increasingly through local networks, campus networks, and the Internet.
2866LARSEN supported LYNCH's earlier suggestion that we need to address
2867fundamental questions of networked information in order to build
2868environments that scale in the information sense as well as in the
2869physical sense.
2870
2871LARSEN supported the role of the library system as the access point into
2872the nation's electronic collections.  Implementation of the Z39.50
2873protocol for information retrieval would make such access practical and
2874feasible.  For example, this would enable patrons in Maryland to search
2875California libraries, or other libraries around the world that are
2876conformant with Z39.50 in a manner that is familiar to University of
2877Maryland patrons.  This client-server model also supports moving beyond
2878secondary content into primary content.  (The notion of how one links
2879from secondary content to primary content, LARSEN said, represents a
2880fundamental problem that requires rigorous thought.)  After noting
2881numerous network experiments in accessing full-text materials, including
2882projects supporting the ordering of materials across the network, LARSEN
2883revisited the issue of transmitting high-density, high-resolution color
2884images across the network and the large amounts of bandwidth they
2885require.  He went on to address the bandwidth and synchronization
2886problems inherent in sending full-motion video across the network.
2887
2888LARSEN illustrated the trade-off between volumes of data in bytes or
2889orders of magnitude and the potential usage of that data.  He discussed
2890transmission rates (particularly, the time it takes to move various forms
2891of information), and what one could do with a network supporting
2892multigigabit-per-second transmission.  At the moment, the network
2893environment includes a composite of data-transmission requirements,
2894volumes and forms, going from steady to bursty (high-volume) and from
2895very slow to very fast.  This aggregate must be considered in the design,
2896construction, and operation of multigigabyte networks.
2897
2898LARSEN's objective is to use the networks and library systems now being
2899constructed to increase access to resources wherever they exist, and
2900thus, to evolve toward an on-line electronic virtual library.
2901
2902LARSEN concluded by offering a snapshot of current trends:  continuing
2903geometric growth in network capacity and number of users; slower
2904development of applications; and glacial development and adoption of
2905standards.  The challenge is to design and develop each new application
2906system with network access and scalability in mind.
2907
2908                                 ******
2909
2910+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
2911BROWNRIGG * Access to the Internet cannot be taken for granted * Packet
2912radio and the development of MELVYL in 1980-81 in the Division of Library
2913Automation at the University of California  *  Design criteria for packet
2914radio * A demonstration project in San Diego and future plans * Spread
2915spectrum * Frequencies at which the radios will run and plans to
2916reimplement the WAIS server software in the public domain * Need for an
2917infrastructure of radios that do not move around *
2918+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
2919
2920Edwin BROWNRIGG, executive director, Memex Research Institute, first
2921polled the audience in order to seek out regular users of the Internet as
2922well as those planning to use it some time in the future.  With nearly
2923everybody in the room falling into one category or the other, BROWNRIGG
2924made a point re access, namely that numerous individuals, especially those
2925who use the Internet every day, take for granted their access to it, the
2926speeds with which they are connected, and how well it all works.
2927However, as BROWNRIGG discovered between 1987 and 1989 in Australia,
2928if one wants access to the Internet but cannot afford it or has some
2929physical boundary that prevents her or him from gaining access, it can
2930be extremely frustrating.  He suggested that because of economics and
2931physical barriers we were beginning to create a world of haves and have-nots
2932in the process of scholarly communication, even in the United States.
2933
2934BROWNRIGG detailed the development of MELVYL in academic year 1980-81 in
2935the Division of Library Automation at the University of California, in
2936order to underscore the issue of access to the system, which at the
2937outset was extremely limited.  In short, the project needed to build a
2938network, which at that time entailed use of satellite technology, that is,
2939putting earth stations on campus and also acquiring some terrestrial links
2940from the State of California's microwave system.  The installation of
2941satellite links, however, did not solve the problem (which actually
2942formed part of a larger problem involving politics and financial resources).
2943For while the project team could get a signal onto a campus, it had no means
2944of distributing the signal throughout the campus.  The solution involved
2945adopting a recent development in wireless communication called packet radio,
2946which combined the basic notion of packet-switching with radio.  The project
2947used this technology to get the signal from a point on campus where it
2948came down, an earth station for example, into the libraries, because it
2949found that wiring the libraries, especially the older marble buildings,
2950would cost $2,000-$5,000 per terminal.
2951
2952BROWNRIGG noted that, ten years ago, the project had neither the public
2953policy nor the technology that would have allowed it to use packet radio
2954in any meaningful way.  Since then much had changed.  He proceeded to
2955detail research and development of the technology, how it is being
2956deployed in California, and what direction he thought it would take.
2957The design criteria are to produce a high-speed, one-time, low-cost,
2958high-quality, secure, license-free device (packet radio) that one can
2959plug in and play today, forget about it, and have access to the Internet.
2960By high speed, BROWNRIGG meant 1 megabyte and 1.5 megabytes.  Those units
2961have been built, he continued, and are in the process of being
2962type-certified by an independent underwriting laboratory so that they can
2963be type-licensed by the Federal Communications Commission.  As is the
2964case with citizens band, one will be able to purchase a unit and not have
2965to worry about applying for a license.
2966
2967The basic idea, BROWNRIGG elaborated, is to take high-speed radio data
2968transmission and create a backbone network that at certain strategic
2969points in the network will "gateway" into a medium-speed packet radio
2970(i.e., one that runs at 38.4 kilobytes), so that perhaps by 1994-1995
2971people, like those in the audience for the price of a VCR could purchase
2972a medium-speed radio for the office or home, have full network connectivity
2973to the Internet, and partake of all its services, with no need for an FCC
2974license and no regular bill from the local common carrier.  BROWNRIGG
2975presented several details of a demonstration project currently taking
2976place in San Diego and described plans, pending funding, to install a
2977full-bore network in the San Francisco area.  This network will have 600
2978nodes running at backbone speeds, and 100 of these nodes will be libraries,
2979which in turn will be the gateway ports to the 38.4 kilobyte radios that
2980will give coverage for the neighborhoods surrounding the libraries.
2981
2982BROWNRIGG next explained Part 15.247, a new rule within Title 47 of the
2983Code of Federal Regulations enacted by the FCC in 1985.  This rule
2984challenged the industry, which has only now risen to the occasion, to
2985build a radio that would run at no more than one watt of output power and
2986use a fairly exotic method of modulating the radio wave called spread
2987spectrum.  Spread spectrum in fact permits the building of networks so
2988that numerous data communications can occur simultaneously, without
2989interfering with each other, within the same wide radio channel.
2990
2991BROWNRIGG explained that the frequencies at which the radios would run
2992are very short wave signals.  They are well above standard microwave and
2993radar.  With a radio wave that small, one watt becomes a tremendous punch
2994per bit and thus makes transmission at reasonable speed possible.  In
2995order to minimize the potential for congestion, the project is
2996undertaking to reimplement software which has been available in the
2997networking business and is taken for granted now, for example, TCP/IP,
2998routing algorithms, bridges, and gateways.  In addition, the project
2999plans to take the WAIS server software in the public domain and
3000reimplement it so that one can have a WAIS server on a Mac instead of a
3001Unix machine.  The Memex Research Institute believes that libraries, in
3002particular, will want to use the WAIS servers with packet radio.  This
3003project, which has a team of about twelve people, will run through 1993
3004and will include the 100 libraries already mentioned as well as other
3005professionals such as those in the medical profession, engineering, and
3006law.  Thus, the need is to create an infrastructure of radios that do not
3007move around, which, BROWNRIGG hopes, will solve a problem not only for
3008libraries but for individuals who, by and large today, do not have access
3009to the Internet from their homes and offices.
3010
3011                                 ******
3012
3013+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
3014DISCUSSION * Project operating frequencies *
3015+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
3016
3017During a brief discussion period, which also concluded the day's
3018proceedings, BROWNRIGG stated that the project was operating in four
3019frequencies.  The slow speed is operating at 435 megahertz, and it would
3020later go up to 920 megahertz.  With the high-speed frequency, the
3021one-megabyte radios will run at 2.4 gigabits, and 1.5 will run at 5.7.
3022At 5.7, rain can be a factor, but it would have to be tropical rain,
3023unlike what falls in most parts of the United States.
3024
3025                                 ******
3026
3027SESSION IV.  IMAGE CAPTURE, TEXT CAPTURE, OVERVIEW OF TEXT AND
3028             IMAGE STORAGE FORMATS
3029
3030William HOOTON, vice president of operations, I-NET, moderated this session.
3031
3032+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
3033KENNEY * Factors influencing development of CXP * Advantages of using
3034digital technology versus photocopy and microfilm * A primary goal of
3035CXP; publishing challenges * Characteristics of copies printed * Quality
3036of samples achieved in image capture * Several factors to be considered
3037in choosing scanning * Emphasis of CXP on timely and cost-effective
3038production of black-and-white printed facsimiles * Results of producing
3039microfilm from digital files * Advantages of creating microfilm * Details
3040concerning production * Costs * Role of digital technology in library
3041preservation *
3042+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
3043
3044Anne KENNEY, associate director, Department of Preservation and
3045Conservation, Cornell University, opened her talk by observing that the
3046Cornell Xerox Project (CXP) has been guided by the assumption that the
3047ability to produce printed facsimiles or to replace paper with paper
3048would be important, at least for the present generation of users and
3049equipment.  She described three factors that influenced development of
3050the project:  1) Because the project has emphasized the preservation of
3051deteriorating brittle books, the quality of what was produced had to be
3052sufficiently high to return a paper replacement to the shelf.  CXP was
3053only interested in using:  2) a system that was cost-effective, which
3054meant that it had to be cost-competitive with the processes currently
3055available, principally photocopy and microfilm, and 3) new or currently
3056available product hardware and software.
3057
3058KENNEY described the advantages that using digital technology offers over
3059both photocopy and microfilm:  1) The potential exists to create a higher
3060quality reproduction of a deteriorating original than conventional
3061light-lens technology.  2) Because a digital image is an encoded
3062representation, it can be reproduced again and again with no resulting
3063loss of quality, as opposed to the situation with light-lens processes,
3064in which there is discernible difference between a second and a
3065subsequent generation of an image.  3) A digital image can be manipulated
3066in a number of ways to improve image capture; for example, Xerox has
3067developed a windowing application that enables one to capture a page
3068containing both text and illustrations in a manner that optimizes the
3069reproduction of both.  (With light-lens technology, one must choose which
3070to optimize, text or the illustration; in preservation microfilming, the
3071current practice is to shoot an illustrated page twice, once to highlight
3072the text and the second time to provide the best capture for the
3073illustration.)  4) A digital image can also be edited, density levels
3074adjusted to remove underlining and stains, and to increase legibility for
3075faint documents.  5) On-screen inspection can take place at the time of
3076initial setup and adjustments made prior to scanning, factors that
3077substantially reduce the number of retakes required in quality control.
3078
3079A primary goal of CXP has been to evaluate the paper output printed on
3080the Xerox DocuTech, a high-speed printer that produces 600-dpi pages from
3081scanned images at a rate of 135 pages a minute.  KENNEY recounted several
3082publishing challenges to represent faithful and legible reproductions of
3083the originals that the 600-dpi copy for the most part successfully
3084captured.  For example, many of the deteriorating volumes in the project
3085were heavily illustrated with fine line drawings or halftones or came in
3086languages such as Japanese, in which the buildup of characters comprised
3087of varying strokes is difficult to reproduce at lower resolutions; a
3088surprising number of them came with annotations and mathematical
3089formulas, which it was critical to be able to duplicate exactly.
3090
3091KENNEY noted that 1) the copies are being printed on paper that meets the
3092ANSI standards for performance, 2) the DocuTech printer meets the machine
3093and toner requirements for proper adhesion of print to page, as described
3094by the National Archives, and thus 3) paper product is considered to be
3095the archival equivalent of preservation photocopy.
3096
3097KENNEY then discussed several samples of the quality achieved in the
3098project that had been distributed in a handout, for example, a copy of a
3099print-on-demand version of the 1911 Reed lecture on the steam turbine,
3100which contains halftones, line drawings, and illustrations embedded in
3101text; the first four loose pages in the volume compared the capture
3102capabilities of scanning to photocopy for a standard test target, the
3103IEEE standard 167A 1987 test chart.  In all instances scanning proved
3104superior to photocopy, though only slightly more so in one.
3105
3106Conceding the simplistic nature of her review of the quality of scanning
3107to photocopy, KENNEY described it as one representation of the kinds of
3108settings that could be used with scanning capabilities on the equipment
3109CXP uses.  KENNEY also pointed out that CXP investigated the quality
3110achieved with binary scanning only, and noted the great promise in gray
3111scale and color scanning, whose advantages and disadvantages need to be
3112examined.  She argued further that scanning resolutions and file formats
3113can represent a complex trade-off between the time it takes to capture
3114material, file size, fidelity to the original, and on-screen display; and
3115printing and equipment availability.  All these factors must be taken
3116into consideration.
3117
3118CXP placed primary emphasis on the production in a timely and
3119cost-effective manner of printed facsimiles that consisted largely of
3120black-and-white text.  With binary scanning, large files may be
3121compressed efficiently and in a lossless manner (i.e., no data is lost in
3122the process of compressing [and decompressing] an image--the exact
3123bit-representation is maintained) using Group 4 CCITT (i.e., the French
3124acronym for International Consultative Committee for Telegraph and
3125Telephone) compression.  CXP was getting compression ratios of about
3126forty to one.  Gray-scale compression, which primarily uses JPEG, is much
3127less economical and can represent a lossy compression (i.e., not
3128lossless), so that as one compresses and decompresses, the illustration
3129is subtly changed.  While binary files produce a high-quality printed
3130version, it appears 1) that other combinations of spatial resolution with
3131gray and/or color hold great promise as well, and 2) that gray scale can
3132represent a tremendous advantage for on-screen viewing.  The quality
3133associated with binary and gray scale also depends on the equipment used.
3134For instance, binary scanning produces a much better copy on a binary
3135printer.
3136
3137Among CXP's findings concerning the production of microfilm from digital
3138files, KENNEY reported that the digital files for the same Reed lecture
3139were used to produce sample film using an electron beam recorder.  The
3140resulting film was faithful to the image capture of the digital files,
3141and while CXP felt that the text and image pages represented in the Reed
3142lecture were superior to that of the light-lens film, the resolution
3143readings for the 600 dpi were not as high as standard microfilming.
3144KENNEY argued that the standards defined for light-lens technology are
3145not totally transferable to a digital environment.  Moreover, they are
3146based on definition of quality for a preservation copy.  Although making
3147this case will prove to be a long, uphill struggle, CXP plans to continue
3148to investigate the issue over the course of the next year.
3149
3150KENNEY concluded this portion of her talk with a discussion of the
3151advantages of creating film:  it can serve as a primary backup and as a
3152preservation master to the digital file; it could then become the print
3153or production master and service copies could be paper, film, optical
3154disks, magnetic media, or on-screen display.
3155
3156Finally, KENNEY presented details re production:
3157
3158     * Development and testing of a moderately-high resolution production
3159     scanning workstation represented a third goal of CXP; to date, 1,000
3160     volumes have been scanned, or about 300,000 images.
3161
3162     * The resulting digital files are stored and used to produce
3163     hard-copy replacements for the originals and additional prints on
3164     demand; although the initial costs are high, scanning technology
3165     offers an affordable means for reformatting brittle material.
3166
3167     * A technician in production mode can scan 300 pages per hour when
3168     performing single-sheet scanning, which is a necessity when working
3169     with truly brittle paper; this figure is expected to increase
3170     significantly with subsequent iterations of the software from Xerox;
3171     a three-month time-and-cost study of scanning found that the average
3172     300-page book would take about an hour and forty minutes to scan
3173     (this figure included the time for setup, which involves keying in
3174     primary bibliographic data, going into quality control mode to
3175     define page size, establishing front-to-back registration, and
3176     scanning sample pages to identify a default range of settings for
3177     the entire book--functions not dissimilar to those performed by
3178     filmers or those preparing a book for photocopy).
3179
3180     * The final step in the scanning process involved rescans, which
3181     happily were few and far between, representing well under 1 percent
3182     of the total pages scanned.
3183
3184In addition to technician time, CXP costed out equipment, amortized over
3185four years, the cost of storing and refreshing the digital files every
3186four years, and the cost of printing and binding, book-cloth binding, a
3187paper reproduction.  The total amounted to a little under $65 per single
3188300-page volume, with 30 percent overhead included--a figure competitive
3189with the prices currently charged by photocopy vendors.
3190
3191Of course, with scanning, in addition to the paper facsimile, one is left
3192with a digital file from which subsequent copies of the book can be
3193produced for a fraction of the cost of photocopy, with readers afforded
3194choices in the form of these copies.
3195
3196KENNEY concluded that digital technology offers an electronic means for a
3197library preservation effort to pay for itself.  If a brittle-book program
3198included the means of disseminating reprints of books that are in demand
3199by libraries and researchers alike, the initial investment in capture
3200could be recovered and used to preserve additional but less popular
3201books.  She disclosed that an economic model for a self-sustaining
3202program could be developed for CXP's report to the Commission on
3203Preservation and Access (CPA).
3204
3205KENNEY stressed that the focus of CXP has been on obtaining high quality
3206in a production environment.  The use of digital technology is viewed as
3207an affordable alternative to other reformatting options.
3208
3209                                 ******
3210
3211+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
3212ANDRE * Overview and history of NATDP * Various agricultural CD-ROM
3213products created inhouse and by service bureaus * Pilot project on
3214Internet transmission * Additional products in progress *
3215+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
3216
3217Pamela ANDRE, associate director for automation, National Agricultural
3218Text Digitizing Program (NATDP), National Agricultural Library (NAL),
3219presented an overview of NATDP, which has been underway at NAL the last
3220four years, before Judith ZIDAR discussed the technical details.  ANDRE
3221defined agricultural information as a broad range of material going from
3222basic and applied research in the hard sciences to the one-page pamphlets
3223that are distributed by the cooperative state extension services on such
3224things as how to grow blueberries.
3225
3226NATDP began in late 1986 with a meeting of representatives from the
3227land-grant library community to deal with the issue of electronic
3228information.  NAL and forty-five of these libraries banded together to
3229establish this project--to evaluate the technology for converting what
3230were then source documents in paper form into electronic form, to provide
3231access to that digital information, and then to distribute it.
3232Distributing that material to the community--the university community as
3233well as the extension service community, potentially down to the county
3234level--constituted the group's chief concern.
3235
3236Since January 1988 (when the microcomputer-based scanning system was
3237installed at NAL), NATDP has done a variety of things, concerning which
3238ZIDAR would provide further details.  For example, the first technology
3239considered in the project's discussion phase was digital videodisc, which
3240indicates how long ago it was conceived.
3241
3242Over the four years of this project, four separate CD-ROM products on
3243four different agricultural topics were created, two at a
3244scanning-and-OCR station installed at NAL, and two by service bureaus.
3245Thus, NATDP has gained comparative information in terms of those relative
3246costs.  Each of these products contained the full ASCII text as well as
3247page images of the material, or between 4,000 and 6,000 pages of material
3248on these disks.  Topics included aquaculture, food, agriculture and
3249science (i.e., international agriculture and research), acid rain, and
3250Agent Orange, which was the final product distributed (approximately
3251eighteen months before the Workshop).
3252
3253The third phase of NATDP focused on delivery mechanisms other than
3254CD-ROM.  At the suggestion of Clifford LYNCH, who was a technical
3255consultant to the project at this point, NATDP became involved with the
3256Internet and initiated a project with the help of North Carolina State
3257University, in which fourteen of the land-grant university libraries are
3258transmitting digital images over the Internet in response to interlibrary
3259loan requests--a topic for another meeting.  At this point, the pilot
3260project had been completed for about a year and the final report would be
3261available shortly after the Workshop.  In the meantime, the project's
3262success had led to its extension.  (ANDRE noted that one of the first
3263things done under the program title was to select a retrieval package to
3264use with subsequent products; Windows Personal Librarian was the package
3265of choice after a lengthy evaluation.)
3266
3267Three additional products had been planned and were in progress:
3268
3269     1) An arrangement with the American Society of Agronomy--a
3270     professional society that has published the Agronomy Journal since
3271     about 1908--to scan and create bit-mapped images of its journal.
3272     ASA granted permission first to put and then to distribute this
3273     material in electronic form, to hold it at NAL, and to use these
3274     electronic images as a mechanism to deliver documents or print out
3275     material for patrons, among other uses.  Effectively, NAL has the
3276     right to use this material in support of its program.
3277     (Significantly, this arrangement offers a potential cooperative
3278     model for working with other professional societies in agriculture
3279     to try to do the same thing--put the journals of particular interest
3280     to agriculture research into electronic form.)
3281
3282     2) An extension of the earlier product on aquaculture.
3283
3284     3) The George Washington Carver Papers--a joint project with
3285     Tuskegee University to scan and convert from microfilm some 3,500
3286     images of Carver's papers, letters, and drawings.
3287
3288It was anticipated that all of these products would appear no more than
3289six months after the Workshop.
3290
3291                                 ******
3292
3293+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
3294ZIDAR * (A separate arena for scanning) * Steps in creating a database *
3295Image capture, with and without performing OCR * Keying in tracking data
3296* Scanning, with electronic and manual tracking * Adjustments during
3297scanning process * Scanning resolutions * Compression * De-skewing and
3298filtering * Image capture from microform:  the papers and letters of
3299George Washington Carver * Equipment used for a scanning system *
3300+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
3301
3302Judith ZIDAR, coordinator, National Agricultural Text Digitizing Program
3303(NATDP), National Agricultural Library (NAL), illustrated the technical
3304details of NATDP, including her primary responsibility, scanning and
3305creating databases on a topic and putting them on CD-ROM.
3306
3307(ZIDAR remarked a separate arena from the CD-ROM projects, although the
3308processing of the material is nearly identical, in which NATDP is also
3309scanning material and loading it on a Next microcomputer, which in turn
3310is linked to NAL's integrated library system.  Thus, searches in NAL's
3311bibliographic database will enable people to pull up actual page images
3312and text for any documents that have been entered.)
3313
3314In accordance with the session's topic, ZIDAR focused her illustrated
3315talk on image capture, offering a primer on the three main steps in the
3316process:  1) assemble the printed publications; 2) design the database
3317(database design occurs in the process of preparing the material for
3318scanning; this step entails reviewing and organizing the material,
3319defining the contents--what will constitute a record, what kinds of
3320fields will be captured in terms of author, title, etc.); 3) perform a
3321certain amount of markup on the paper publications.  NAL performs this
3322task record by record, preparing work sheets or some other sort of
3323tracking material and designing descriptors and other enhancements to be
3324added to the data that will not be captured from the printed publication.
3325Part of this process also involves determining NATDP's file and directory
3326structure:  NATDP attempts to avoid putting more than approximately 100
3327images in a directory, because placing more than that on a CD-ROM would
3328reduce the access speed.
3329
3330This up-front process takes approximately two weeks for a
33316,000-7,000-page database.  The next step is to capture the page images.
3332How long this process takes is determined by the decision whether or not
3333to perform OCR.  Not performing OCR speeds the process, whereas text
3334capture requires greater care because of the quality of the image:  it
3335has to be straighter and allowance must be made for text on a page, not
3336just for the capture of photographs.
3337
3338NATDP keys in tracking data, that is, a standard bibliographic record
3339including the title of the book and the title of the chapter, which will
3340later either become the access information or will be attached to the
3341front of a full-text record so that it is searchable.
3342
3343Images are scanned from a bound or unbound publication, chiefly from
3344bound publications in the case of NATDP, however, because often they are
3345the only copies and the publications are returned to the shelves.  NATDP
3346usually scans one record at a time, because its database tracking system
3347tracks the document in that way and does not require further logical
3348separating of the images.  After performing optical character
3349recognition, NATDP moves the images off the hard disk and maintains a
3350volume sheet.  Though the system tracks electronically, all the
3351processing steps are also tracked manually with a log sheet.
3352
3353ZIDAR next illustrated the kinds of adjustments that one can make when
3354scanning from paper and microfilm, for example, redoing images that need
3355special handling, setting for dithering or gray scale, and adjusting for
3356brightness or for the whole book at one time.
3357
3358NATDP is scanning at 300 dots per inch, a standard scanning resolution.
3359Though adequate for capturing text that is all of a standard size, 300
3360dpi is unsuitable for any kind of photographic material or for very small
3361text.  Many scanners allow for different image formats, TIFF, of course,
3362being a de facto standard.  But if one intends to exchange images with
3363other people, the ability to scan other image formats, even if they are
3364less common, becomes highly desirable.
3365
3366CCITT Group 4 is the standard compression for normal black-and-white
3367images, JPEG for gray scale or color.   ZIDAR recommended 1) using the
3368standard compressions, particularly if one attempts to make material
3369available and to allow users to download images and reuse them from
3370CD-ROMs; and 2) maintaining the ability to output an uncompressed image,
3371because in image exchange uncompressed images are more likely to be able
3372to cross platforms.
3373
3374ZIDAR emphasized the importance of de-skewing and filtering as
3375requirements on NATDP's upgraded system.  For instance, scanning bound
3376books, particularly books published by the federal government whose pages
3377are skewed, and trying to scan them straight if OCR is to be performed,
3378is extremely time-consuming.  The same holds for filtering of
3379poor-quality or older materials.
3380
3381ZIDAR described image capture from microform, using as an example three
3382reels from a sixty-seven-reel set of the papers and letters of George
3383Washington Carver that had been produced by Tuskegee University.  These
3384resulted in approximately 3,500 images, which NATDP had had scanned by
3385its service contractor, Science Applications International Corporation
3386(SAIC).  NATDP also created bibliographic records for access.  (NATDP did
3387not have such specialized equipment as a microfilm scanner.
3388
3389Unfortunately, the process of scanning from microfilm was not an
3390unqualified success, ZIDAR reported:  because microfilm frame sizes vary,
3391occasionally some frames were missed, which without spending much time
3392and money could not be recaptured.
3393
3394OCR could not be performed from the scanned images of the frames.  The
3395bleeding in the text simply output text, when OCR was run, that could not
3396even be edited.  NATDP tested for negative versus positive images,
3397landscape versus portrait orientation, and single- versus dual-page
3398microfilm, none of which seemed to affect the quality of the image; but
3399also on none of them could OCR be performed.
3400
3401In selecting the microfilm they would use, therefore, NATDP had other
3402factors in mind.  ZIDAR noted two factors that influenced the quality of
3403the images:  1) the inherent quality of the original and 2) the amount of
3404size reduction on the pages.
3405
3406The Carver papers were selected because they are informative and visually
3407interesting, treat a single subject, and are valuable in their own right.
3408The images were scanned and divided into logical records by SAIC, then
3409delivered, and loaded onto NATDP's system, where bibliographic
3410information taken directly from the images was added.  Scanning was
3411completed in summer 1991 and by the end of summer 1992 the disk was
3412scheduled to be published.
3413
3414Problems encountered during processing included the following:  Because
3415the microfilm scanning had to be done in a batch, adjustment for
3416individual page variations was not possible.  The frame size varied on
3417account of the nature of the material, and therefore some of the frames
3418were missed while others were just partial frames.  The only way to go
3419back and capture this material was to print out the page with the
3420microfilm reader from the missing frame and then scan it in from the
3421page, which was extremely time-consuming.  The quality of the images
3422scanned from the printout of the microfilm compared unfavorably with that
3423of the original images captured directly from the microfilm.  The
3424inability to perform OCR also was a major disappointment.  At the time,
3425computer output microfilm was unavailable to test.
3426
3427The equipment used for a scanning system was the last topic addressed by
3428ZIDAR.  The type of equipment that one would purchase for a scanning
3429system included:  a microcomputer, at least a 386, but preferably a 486;
3430a large hard disk, 380 megabyte at minimum; a multi-tasking operating
3431system that allows one to run some things in batch in the background
3432while scanning or doing text editing, for example, Unix or OS/2 and,
3433theoretically, Windows; a high-speed scanner and scanning software that
3434allows one to make the various adjustments mentioned earlier; a
3435high-resolution monitor (150 dpi ); OCR software and hardware to perform
3436text recognition; an optical disk subsystem on which to archive all the
3437images as the processing is done; file management and tracking software.
3438
3439ZIDAR opined that the software one purchases was more important than the
3440hardware and might also cost more than the hardware, but it was likely to
3441prove critical to the success or failure of one's system.  In addition to
3442a stand-alone scanning workstation for image capture, then, text capture
3443requires one or two editing stations networked to this scanning station
3444to perform editing.  Editing the text takes two or three times as long as
3445capturing the images.
3446
3447Finally, ZIDAR stressed the importance of buying an open system that allows
3448for more than one vendor, complies with standards, and can be upgraded.
3449
3450                                 ******
3451
3452+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
3453WATERS *Yale University Library's master plan to convert microfilm to
3454digital imagery (POB) * The place of electronic tools in the library of
3455the future * The uses of images and an image library * Primary input from
3456preservation microfilm * Features distinguishing POB from CXP and key
3457hypotheses guiding POB * Use of vendor selection process to facilitate
3458organizational work * Criteria for selecting vendor * Finalists and
3459results of process for Yale * Key factor distinguishing vendors *
3460Components, design principles, and some estimated costs of POB * Role of
3461preservation materials in developing imaging market * Factors affecting
3462quality and cost * Factors affecting the usability of complex documents
3463in image form *
3464+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
3465
3466Donald WATERS, head of the Systems Office, Yale University Library,
3467reported on the progress of a master plan for a project at Yale to
3468convert microfilm to digital imagery, Project Open Book (POB).  Stating
3469that POB was in an advanced stage of planning, WATERS detailed, in
3470particular, the process of selecting a vendor partner and several key
3471issues under discussion as Yale prepares to move into the project itself.
3472He commented first on the vision that serves as the context of POB and
3473then described its purpose and scope.
3474
3475WATERS sees the library of the future not necessarily as an electronic
3476library but as a place that generates, preserves, and improves for its
3477clients ready access to both intellectual and physical recorded
3478knowledge.  Electronic tools must find a place in the library in the
3479context of this vision.  Several roles for electronic tools include
3480serving as:  indirect sources of electronic knowledge or as "finding"
3481aids (the on-line catalogues, the article-level indices, registers for
3482documents and archives); direct sources of recorded knowledge; full-text
3483images; and various kinds of compound sources of recorded knowledge (the
3484so-called compound documents of Hypertext, mixed text and image,
3485mixed-text image format, and multimedia).
3486
3487POB is looking particularly at images and an image library, the uses to
3488which images will be put (e.g., storage, printing, browsing, and then use
3489as input for other processes), OCR as a subsequent process to image
3490capture, or creating an image library, and also possibly generating
3491microfilm.
3492
3493While input will come from a variety of sources, POB is considering
3494especially input from preservation microfilm.  A possible outcome is that
3495the film and paper which provide the input for the image library
3496eventually may go off into remote storage, and that the image library may
3497be the primary access tool.
3498
3499The purpose and scope of POB focus on imaging.  Though related to CXP,
3500POB has two features which distinguish it:  1) scale--conversion of
350110,000 volumes into digital image form; and 2) source--conversion from
3502microfilm.  Given these features, several key working hypotheses guide
3503POB, including:  1) Since POB is using microfilm, it is not concerned with
3504the image library as a preservation medium.  2) Digital imagery can improve
3505access to recorded knowledge through printing and network distribution at
3506a modest incremental cost of microfilm.  3) Capturing and storing documents
3507in a digital image form is necessary to further improvements in access.
3508(POB distinguishes between the imaging, digitizing process and OCR,
3509which at this stage it does not plan to perform.)
3510
3511Currently in its first or organizational phase, POB found that it could
3512use a vendor selection process to facilitate a good deal of the
3513organizational work (e.g., creating a project team and advisory board,
3514confirming the validity of the plan, establishing the cost of the project
3515and a budget, selecting the materials to convert, and then raising the
3516necessary funds).
3517
3518POB developed numerous selection criteria, including:  a firm committed
3519to image-document management, the ability to serve as systems integrator
3520in a large-scale project over several years, interest in developing the
3521requisite software as a standard rather than a custom product, and a
3522willingness to invest substantial resources in the project itself.
3523
3524Two vendors, DEC and Xerox, were selected as finalists in October 1991,
3525and with the support of the Commission on Preservation and Access, each
3526was commissioned to generate a detailed requirements analysis for the
3527project and then to submit a formal proposal for the completion of the
3528project, which included a budget and costs. The terms were that POB would
3529pay the loser.  The results for Yale of involving a vendor included:
3530broad involvement of Yale staff across the board at a relatively low
3531cost, which may have long-term significance in carrying out the project
3532(twenty-five to thirty university people are engaged in POB); better
3533understanding of the factors that affect corporate response to markets
3534for imaging products; a competitive proposal; and a more sophisticated
3535view of the imaging markets.
3536
3537The most important factor that distinguished the vendors under
3538consideration was their identification with the customer.  The size and
3539internal complexity of the company also was an important factor.  POB was
3540looking at large companies that had substantial resources.  In the end,
3541the process generated for Yale two competitive proposals, with Xerox's
3542the clear winner.  WATERS then described the components of the proposal,
3543the design principles, and some of the costs estimated for the process.
3544
3545Components are essentially four:  a conversion subsystem, a
3546network-accessible storage subsystem for 10,000 books (and POB expects
3547200 to 600 dpi storage), browsing stations distributed on the campus
3548network, and network access to the image printers.
3549
3550Among the design principles, POB wanted conversion at the highest
3551possible resolution.  Assuming TIFF files, TIFF files with Group 4
3552compression, TCP/IP, and ethernet network on campus, POB wanted a
3553client-server approach with image documents distributed to the
3554workstations and made accessible through native workstation interfaces
3555such as Windows.  POB also insisted on a phased approach to
3556implementation:  1) a stand-alone, single-user, low-cost entry into the
3557business with a workstation focused on conversion and allowing POB to
3558explore user access; 2) movement into a higher-volume conversion with
3559network-accessible storage and multiple access stations; and 3) a
3560high-volume conversion, full-capacity storage, and multiple browsing
3561stations distributed throughout the campus.
3562
3563The costs proposed for start-up assumed the existence of the Yale network
3564and its two DocuTech image printers.  Other start-up costs are estimated
3565at $1 million over the three phases.  At the end of the project, the annual
3566operating costs estimated primarily for the software and hardware proposed
3567come to about $60,000, but these exclude costs for labor needed in the
3568conversion process, network and printer usage, and facilities management.
3569
3570Finally, the selection process produced for Yale a more sophisticated
3571view of the imaging markets:  the management of complex documents in
3572image form is not a preservation problem, not a library problem, but a
3573general problem in a broad, general industry.  Preservation materials are
3574useful for developing that market because of the qualities of the
3575material.  For example, much of it is out of copyright.  The resolution
3576of key issues such as the quality of scanning and image browsing also
3577will affect development of that market.
3578
3579The technology is readily available but changing rapidly.  In this
3580context of rapid change, several factors affect quality and cost, to
3581which POB intends to pay particular attention, for example, the various
3582levels of resolution that can be achieved.  POB believes it can bring
3583resolution up to 600 dpi, but an interpolation process from 400 to 600 is
3584more likely.  The variation quality in microfilm will prove to be a
3585highly important factor.  POB may reexamine the standards used to film in
3586the first place by looking at this process as a follow-on to microfilming.
3587
3588Other important factors include:  the techniques available to the
3589operator for handling material, the ways of integrating quality control
3590into the digitizing work flow, and a work flow that includes indexing and
3591storage.  POB's requirement was to be able to deal with quality control
3592at the point of scanning.  Thus, thanks to Xerox, POB anticipates having
3593a mechanism which will allow it not only to scan in batch form, but to
3594review the material as it goes through the scanner and control quality
3595from the outset.
3596
3597The standards for measuring quality and costs depend greatly on the uses
3598of the material, including subsequent OCR, storage, printing, and
3599browsing.  But especially at issue for POB is the facility for browsing.
3600This facility, WATERS said, is perhaps the weakest aspect of imaging
3601technology and the most in need of development.
3602
3603A variety of factors affect the usability of complex documents in image
3604form, among them:  1) the ability of the system to handle the full range
3605of document types, not just monographs but serials, multi-part
3606monographs, and manuscripts; 2) the location of the database of record
3607for bibliographic information about the image document, which POB wants
3608to enter once and in the most useful place, the on-line catalog; 3) a
3609document identifier for referencing the bibliographic information in one
3610place and the images in another; 4) the technique for making the basic
3611internal structure of the document accessible to the reader; and finally,
36125) the physical presentation on the CRT of those documents.  POB is ready
3613to complete this phase now.  One last decision involves deciding which
3614material to scan.
3615
3616                                 ******
3617
3618+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
3619DISCUSSION * TIFF files constitute de facto standard * NARA's experience
3620with image conversion software and text conversion * RFC 1314 *
3621Considerable flux concerning available hardware and software solutions *
3622NAL through-put rate during scanning * Window management questions *
3623+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
3624
3625In the question-and-answer period that followed WATERS's presentation,
3626the following points emerged:
3627
3628     * ZIDAR's statement about using TIFF files as a standard meant de
3629     facto standard.  This is what most people use and typically exchange
3630     with other groups, across platforms, or even occasionally across
3631     display software.
3632
3633     * HOLMES commented on the unsuccessful experience of NARA in
3634     attempting to run image-conversion software or to exchange between
3635     applications:  What are supposedly TIFF files go into other software
3636     that is supposed to be able to accept TIFF but cannot recognize the
3637     format and cannot deal with it, and thus renders the exchange
3638     useless.  Re text conversion, he noted the different recognition
3639     rates obtained by substituting the make and model of scanners in
3640     NARA's recent test of an "intelligent" character-recognition product
3641     for a new company.  In the selection of hardware and software,
3642     HOLMES argued, software no longer constitutes the overriding factor
3643     it did until about a year ago; rather it is perhaps important to
3644     look at both now.
3645
3646     * Danny Cohen and Alan Katz of the University of Southern California
3647     Information Sciences Institute began circulating as an Internet RFC
3648     (RFC 1314) about a month ago a standard for a TIFF interchange
3649     format for Internet distribution of monochrome bit-mapped images,
3650     which LYNCH said he believed would be used as a de facto standard.
3651
3652     * FLEISCHHAUER's impression from hearing these reports and thinking
3653     about AM's experience was that there is considerable flux concerning
3654     available hardware and software solutions.  HOOTON agreed and
3655     commented at the same time on ZIDAR's statement that the equipment
3656     employed affects the results produced.  One cannot draw a complete
3657     conclusion by saying it is difficult or impossible to perform OCR
3658     from scanning microfilm, for example, with that device,  that set of
3659     parameters, and system requirements, because numerous other people
3660     are accomplishing just that, using other components, perhaps.
3661     HOOTON opined that both the hardware and the software were highly
3662     important.  Most of the problems discussed today have been solved in
3663     numerous different ways by other people.  Though it is good to be
3664     cognizant of various experiences, this is not to say that it will
3665     always be thus.
3666
3667     * At NAL, the through-put rate of the scanning process for paper,
3668     page by page, performing OCR, ranges from 300 to 600 pages per day;
3669     not performing OCR is considerably faster, although how much faster
3670     is not known.  This is for scanning from bound books, which is much
3671     slower.
3672
3673     * WATERS commented on window management questions:  DEC proposed an
3674     X-Windows solution which was problematical for two reasons.  One was
3675     POB's requirement to be able to manipulate images on the workstation
3676     and bring them down to the workstation itself and the other was
3677     network usage.
3678
3679                                 ******
3680
3681+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
3682THOMA * Illustration of deficiencies in scanning and storage process *
3683Image quality in this process * Different costs entailed by better image
3684quality * Techniques for overcoming various de-ficiencies:  fixed
3685thresholding, dynamic thresholding, dithering, image merge * Page edge
3686effects *
3687+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
3688
3689George THOMA, chief, Communications Engineering Branch, National Library
3690of Medicine (NLM), illustrated several of the deficiencies discussed by
3691the previous speakers.  He introduced the topic of special problems by
3692noting the advantages of electronic imaging.  For example, it is regenerable
3693because it is a coded file, and real-time quality control is possible with
3694electronic capture, whereas in photographic capture it is not.
3695
3696One of the difficulties discussed in the scanning and storage process was
3697image quality which, without belaboring the obvious, means different
3698things for maps, medical X-rays, or broadcast television.  In the case of
3699documents, THOMA said, image quality boils down to legibility of the
3700textual parts, and fidelity in the case of gray or color photo print-type
3701material.  Legibility boils down to scan density, the standard in most
3702cases being 300 dpi.  Increasing the resolution with scanners that
3703perform 600 or 1200 dpi, however, comes at a cost.
3704
3705Better image quality entails at least four different kinds of costs:  1)
3706equipment costs, because the CCD (i.e., charge-couple device) with
3707greater number of elements costs more;  2) time costs that translate to
3708the actual capture costs, because manual labor is involved (the time is
3709also dependent on the fact that more data has to be moved around in the
3710machine in the scanning or network devices that perform the scanning as
3711well as the storage);  3) media costs, because at high resolutions larger
3712files have to be stored; and 4) transmission costs, because there is just
3713more data to be transmitted.
3714
3715But while resolution takes care of the issue of legibility in image
3716quality, other deficiencies have to do with contrast and elements on the
3717page scanned or the image that needed to be removed or clarified.  Thus,
3718THOMA proceeded to illustrate various deficiencies, how they are
3719manifested, and several techniques to overcome them.
3720
3721Fixed thresholding was the first technique described, suitable for
3722black-and-white text, when the contrast does not vary over the page.  One
3723can have many different threshold levels in scanning devices.  Thus,
3724THOMA offered an example of extremely poor contrast, which resulted from
3725the fact that the stock was a heavy red.  This is the sort of image that
3726when microfilmed fails to provide any legibility whatsoever.  Fixed
3727thresholding is the way to change the black-to-red contrast to the
3728desired black-to-white contrast.
3729
3730Other examples included material that had been browned or yellowed by
3731age.  This was also a case of contrast deficiency, and correction was
3732done by fixed thresholding.  A final example boils down to the same
3733thing, slight variability, but it is not significant.  Fixed thresholding
3734solves this problem as well.  The microfilm equivalent is certainly legible,
3735but it comes with dark areas.  Though THOMA did not have a slide of the
3736microfilm in this case, he did show the reproduced electronic image.
3737
3738When one has variable contrast over a page or the lighting over the page
3739area varies, especially in the case where a bound volume has light
3740shining on it, the image must be processed by a dynamic thresholding
3741scheme.  One scheme, dynamic averaging, allows the threshold level not to
3742be fixed but to be recomputed for every pixel from the neighboring
3743characteristics.  The neighbors of a pixel determine where the threshold
3744should be set for that pixel.
3745
3746THOMA showed an example of a page that had been made deficient by a
3747variety of techniques, including a burn mark, coffee stains, and a yellow
3748marker.  Application of a fixed-thresholding scheme, THOMA argued, might
3749take care of several deficiencies on the page but not all of them.
3750Performing the calculation for a dynamic threshold setting, however,
3751removes most of the deficiencies so that at least the text is legible.
3752
3753Another problem is representing a gray level with black-and-white pixels
3754by a process known as dithering or electronic screening.  But dithering
3755does not provide good image quality for pure black-and-white textual
3756material.  THOMA illustrated this point with examples. Although its
3757suitability for photoprint is the reason for electronic screening or
3758dithering, it cannot be used for every compound image.  In the document
3759that was distributed by CXP, THOMA noticed that the dithered image of the
3760IEEE test chart evinced some deterioration in the text.  He presented an
3761extreme example of deterioration in the text in which compounded
3762documents had to be set right by other techniques.  The technique
3763illustrated by the present example was an image merge in which the page
3764is scanned twice and the settings go from fixed threshold to the
3765dithering matrix; the resulting images are merged to give the best
3766results with each technique.
3767
3768THOMA illustrated how dithering is also used in nonphotographic or
3769nonprint materials with an example of a grayish page from a medical text,
3770which was reproduced to show all of the gray that appeared in the
3771original.  Dithering provided a reproduction of all the gray in the
3772original of another example from the same text.
3773
3774THOMA finally illustrated the problem of bordering, or page-edge,
3775effects.  Books and bound volumes that are placed on a photocopy machine
3776or a scanner produce page-edge effects that are undesirable for two
3777reasons:  1) the aesthetics of the image; after all, if the image is to
3778be preserved, one does not necessarily want to keep all of its
3779deficiencies; 2) compression (with the bordering problem THOMA
3780illustrated, the compression ratio deteriorated tremendously).  One way
3781to eliminate this more serious problem is to have the operator at the
3782point of scanning window the part of the image that is desirable and
3783automatically turn all of the pixels out of that picture to white.
3784
3785                                 ******
3786
3787+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
3788FLEISCHHAUER * AM's experience with scanning bound materials * Dithering
3789*
3790+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
3791
3792Carl FLEISCHHAUER, coordinator, American Memory, Library of Congress,
3793reported AM's experience with scanning bound materials, which he likened
3794to the problems involved in using photocopying machines.  Very few
3795devices in the industry offer book-edge scanning, let alone book cradles.
3796The problem may be unsolvable, FLEISCHHAUER said, because a large enough
3797market does not exist for a preservation-quality scanner.  AM is using a
3798Kurzweil scanner, which is a book-edge scanner now sold by Xerox.
3799
3800Devoting the remainder of his brief presentation to dithering,
3801FLEISCHHAUER related AM's experience with a contractor who was using
3802unsophisticated equipment and software to reduce moire patterns from
3803printed halftones.  AM took the same image and used the dithering
3804algorithm that forms part of the same Kurzweil Xerox scanner; it
3805disguised moire patterns much more effectively.
3806
3807FLEISCHHAUER also observed that dithering produces a binary file which is
3808useful for numerous purposes, for example, printing it on a laser printer
3809without having to "re-halftone" it.  But it tends to defeat efficient
3810compression, because the very thing that dithers to reduce moire patterns
3811also tends to work against compression schemes.  AM thought the
3812difference in image quality was worth it.
3813
3814                                 ******
3815
3816+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
3817DISCUSSION * Relative use as a criterion for POB's selection of books to
3818be converted into digital form *
3819+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
3820
3821During the discussion period, WATERS noted that one of the criteria for
3822selecting books among the 10,000 to be converted into digital image form
3823would be how much relative use they would receive--a subject still
3824requiring evaluation.  The challenge will be to understand whether
3825coherent bodies of material will increase usage or whether POB should
3826seek material that is being used, scan that, and make it more accessible.
3827POB might decide to digitize materials that are already heavily used, in
3828order to make them more accessible and decrease wear on them.  Another
3829approach would be to provide a large body of intellectually coherent
3830material that may be used more in digital form than it is currently used
3831in microfilm.  POB would seek material that was out of copyright.
3832
3833                                 ******
3834
3835+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
3836BARONAS * Origin and scope of AIIM * Types of documents produced in
3837AIIM's standards program * Domain of AIIM's standardization work * AIIM's
3838structure * TC 171 and MS23 * Electronic image management standards *
3839Categories of EIM standardization where AIIM standards are being
3840developed *
3841+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
3842
3843Jean BARONAS, senior manager, Department of Standards and Technology,
3844Association for Information and Image Management (AIIM), described the
3845not-for-profit association and the national and international programs
3846for standardization in which AIIM is active.
3847
3848Accredited for twenty-five years as the nation's standards development
3849organization for document image management, AIIM began life in a library
3850community developing microfilm standards.  Today the association
3851maintains both its library and business-image management standardization
3852activities--and has moved into electronic image-management
3853standardization (EIM).
3854
3855BARONAS defined the program's scope.  AIIM deals with:  1) the
3856terminology of standards and of the technology it uses; 2) methods of
3857measurement for the systems, as well as quality; 3) methodologies for
3858users to evaluate and measure quality; 4) the features of apparatus used
3859to manage and edit images; and 5) the procedures used to manage images.
3860
3861BARONAS noted that three types of documents are produced in the AIIM
3862standards program:  the first two, accredited by the American National
3863Standards Institute (ANSI), are standards and standard recommended
3864practices.  Recommended practices differ from standards in that they
3865contain more tutorial information.  A technical report is not an ANSI
3866standard.  Because AIIM's policies and procedures for developing
3867standards are approved by ANSI, its standards are labeled ANSI/AIIM,
3868followed by the number and title of the standard.
3869
3870BARONAS then illustrated the domain of AIIM's standardization work.  For
3871example, AIIM is the administrator of the U.S. Technical Advisory Group
3872(TAG) to the International Standards Organization's (ISO) technical
3873committee, TC l7l Micrographics and Optical Memories for Document and
3874Image Recording, Storage, and Use.  AIIM officially works through ANSI in
3875the international standardization process.
3876
3877BARONAS described AIIM's structure, including its board of directors, its
3878standards board of twelve individuals active in the image-management
3879industry, its strategic planning and legal admissibility task forces, and
3880its National Standards Council, which is comprised of the members of a
3881number of organizations who vote on every AIIM standard before it is
3882published.  BARONAS pointed out that AIIM's liaisons deal with numerous
3883other standards developers, including the optical disk community, office
3884and publishing systems, image-codes-and-character set committees, and the
3885National Information Standards Organization (NISO).
3886
3887BARONAS illustrated the procedures of TC l7l, which covers all aspects of
3888image management.  When AIIM's national program has conceptualized a new
3889project, it is usually submitted to the international level, so that the
3890member countries of TC l7l can simultaneously work on the development of
3891the standard or the technical report.  BARONAS also illustrated a classic
3892microfilm standard, MS23, which deals with numerous imaging concepts that
3893apply to electronic imaging.  Originally developed in the l970s, revised
3894in the l980s, and revised again in l991, this standard is scheduled for
3895another revision.  MS23 is an active standard whereby users may propose
3896new density ranges and new methods of evaluating film images in the
3897standard's revision.
3898
3899BARONAS detailed several electronic image-management standards, for
3900instance, ANSI/AIIM MS44, a quality-control guideline for scanning 8.5"
3901by 11" black-and-white office documents.  This standard is used with the
3902IEEE fax image--a continuous tone photographic image with gray scales,
3903text, and several continuous tone pictures--and AIIM test target number
39042, a representative document used in office document management.
3905
3906BARONAS next outlined the four categories of EIM standardization in which
3907AIIM standards are being developed:  transfer and retrieval, evaluation,
3908optical disc and document scanning applications, and design and
3909conversion of documents.  She detailed several of the main projects of
3910each:  1) in the category of image transfer and retrieval, a bi-level
3911image transfer format, ANSI/AIIM MS53, which is a proposed standard that
3912describes a file header for image transfer between unlike systems when
3913the images are compressed using G3 and G4 compression; 2) the category of
3914image evaluation, which includes the AIIM-proposed TR26 tutorial on image
3915resolution (this technical report will treat the differences and
3916similarities between classical or photographic and electronic imaging);
39173) design and conversion, which includes a proposed technical report
3918called "Forms Design Optimization for EIM" (this report considers how
3919general-purpose business forms can be best designed so that scanning is
3920optimized; reprographic characteristics such as type, rules, background,
3921tint, and color will likewise be treated in the technical report); 4)
3922disk and document scanning applications includes a project a) on planning
3923platters and disk management, b) on generating an application profile for
3924EIM when images are stored and distributed on CD-ROM, and c) on
3925evaluating SCSI2, and how a common command set can be generated for SCSI2
3926so that document scanners are more easily integrated.  (ANSI/AIIM MS53
3927will also apply to compressed images.)
3928
3929                                 ******
3930
3931+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
3932BATTIN * The implications of standards for preservation * A major
3933obstacle to successful cooperation * A hindrance to access in the digital
3934environment * Standards a double-edged sword for those concerned with the
3935preservation of the human record * Near-term prognosis for reliable
3936archival standards * Preservation concerns for electronic media * Need
3937for reconceptualizing our preservation principles * Standards in the real
3938world and the politics of reproduction * Need to redefine the concept of
3939archival and to begin to think in terms of life cycles * Cooperation and
3940the La Guardia Eight * Concerns generated by discussions on the problems
3941of preserving text and image * General principles to be adopted in a
3942world without standards *
3943+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
3944
3945Patricia BATTIN, president, the Commission on Preservation and Access
3946(CPA), addressed the implications of standards for preservation.  She
3947listed several areas where the library profession and the analog world of
3948the printed book had made enormous contributions over the past hundred
3949years--for example, in bibliographic formats, binding standards, and, most
3950important, in determining what constitutes longevity or archival quality.
3951
3952Although standards have lightened the preservation burden through the
3953development of national and international collaborative programs,
3954nevertheless, a pervasive mistrust of other people's standards remains a
3955major obstacle to successful cooperation, BATTIN said.
3956
3957The zeal to achieve perfection, regardless of the cost, has hindered
3958rather than facilitated access in some instances, and in the digital
3959environment, where no real standards exist, has brought an ironically
3960just reward.
3961
3962BATTIN argued that standards are a double-edged sword for those concerned
3963with the preservation of the human record, that is, the provision of
3964access to recorded knowledge in a multitude of media as far into the
3965future as possible.  Standards are essential to facilitate
3966interconnectivity and access, but, BATTIN said, as LYNCH pointed out
3967yesterday, if set too soon they can hinder creativity, expansion of
3968capability, and the broadening of access.  The characteristics of
3969standards for digital imagery differ radically from those for analog
3970imagery.  And the nature of digital technology implies continuing
3971volatility and change.  To reiterate, precipitous standard-setting can
3972inhibit creativity, but delayed standard-setting results in chaos.
3973
3974Since in BATTIN'S opinion the near-term prognosis for reliable archival
3975standards, as defined by librarians in the analog world, is poor, two
3976alternatives remain:  standing pat with the old technology, or
3977reconceptualizing.
3978
3979Preservation concerns for electronic media fall into two general domains.
3980One is the continuing assurance of access to knowledge originally
3981generated, stored, disseminated, and used in electronic form.  This
3982domain contains several subdivisions, including 1) the closed,
3983proprietary systems discussed the previous day, bundled information such
3984as electronic journals and government agency records, and electronically
3985produced or captured raw data; and 2) the application of digital
3986technologies to the reformatting of materials originally published on a
3987deteriorating analog medium such as acid paper or videotape.
3988
3989The preservation of electronic media requires a reconceptualizing of our
3990preservation principles during a volatile, standardless transition which
3991may last far longer than any of us envision today.  BATTIN urged the
3992necessity of shifting focus from assessing, measuring, and setting
3993standards for the permanence of the medium to the concept of managing
3994continuing access to information stored on a variety of media and
3995requiring a variety of ever-changing hardware and software for access--a
3996fundamental shift for the library profession.
3997
3998BATTIN offered a primer on how to move forward with reasonable confidence
3999in a world without standards.  Her comments fell roughly into two sections:
40001) standards in the real world and 2) the politics of reproduction.
4001
4002In regard to real-world standards, BATTIN argued the need to redefine the
4003concept of archive and to begin to think in terms of life cycles.  In
4004the past, the naive assumption that paper would last forever produced a
4005cavalier attitude toward life cycles.  The transient nature of the
4006electronic media has compelled people to recognize and accept upfront the
4007concept of life cycles in place of permanency.
4008
4009Digital standards have to be developed and set in a cooperative context
4010to ensure efficient exchange of information.  Moreover, during this
4011transition period, greater flexibility concerning how concepts such as
4012backup copies and archival copies in the CXP are defined is necessary,
4013or the opportunity to move forward will be lost.
4014
4015In terms of cooperation, particularly in the university setting, BATTIN
4016also argued the need to avoid going off in a hundred different
4017directions.  The CPA has catalyzed a small group of universities called
4018the La Guardia Eight--because La Guardia Airport is where meetings take
4019place--Harvard, Yale, Cornell, Princeton, Penn State, Tennessee,
4020Stanford, and USC, to develop a digital preservation consortium to look
4021at all these issues and develop de facto standards as we move along,
4022instead of waiting for something that is officially blessed.  Continuing
4023to apply analog values and definitions of standards to the digital
4024environment, BATTIN said, will effectively lead to forfeiture of the
4025benefits of digital technology to research and scholarship.
4026
4027Under the second rubric, the politics of reproduction, BATTIN reiterated
4028an oft-made argument concerning the electronic library, namely, that it
4029is more difficult to transform than to create, and nowhere is that belief
4030expressed more dramatically than in the conversion of brittle books to
4031new media.  Preserving information published in electronic media involves
4032making sure the information remains accessible and that digital
4033information is not lost through reproduction.  In the analog world of
4034photocopies and microfilm, the issue of fidelity to the original becomes
4035paramount, as do issues of "Whose fidelity?" and "Whose original?"
4036
4037BATTIN elaborated these arguments with a few examples from a recent study
4038conducted by the CPA on the problems of preserving text and image.
4039Discussions with scholars, librarians, and curators in a variety of
4040disciplines dependent on text and image generated a variety of concerns,
4041for example:  1) Copy what is, not what the technology is capable of.
4042This is very important for the history of ideas.  Scholars wish to know
4043what the author saw and worked from.  And make available at the
4044workstation the opportunity to erase all the defects and enhance the
4045presentation.  2) The fidelity of reproduction--what is good enough, what
4046can we afford, and the difference it makes--issues of subjective versus
4047objective resolution.  3) The differences between primary and secondary
4048users.  Restricting the definition of primary user to the one in whose
4049discipline the material has been published runs one headlong into the
4050reality that these printed books have had a host of other users from a
4051host of other disciplines, who not only were looking for very different
4052things, but who also shared values very different from those of the
4053primary user.  4) The relationship of the standard of reproduction to new
4054capabilities of scholarship--the browsing standard versus an archival
4055standard.  How good must the archival standard be?  Can a distinction be
4056drawn between potential users in setting standards for reproduction?
4057Archival storage, use copies, browsing copies--ought an attempt to set
4058standards even be made?  5) Finally, costs.  How much are we prepared to
4059pay to capture absolute fidelity?  What are the trade-offs between vastly
4060enhanced access, degrees of fidelity, and costs?
4061
4062These standards, BATTIN concluded, serve to complicate further the
4063reproduction process, and add to the long list of technical standards
4064that are necessary to ensure widespread access.  Ways to articulate and
4065analyze the costs that are attached to the different levels of standards
4066must be found.
4067
4068Given the chaos concerning standards, which promises to linger for the
4069foreseeable future, BATTIN urged adoption of the following general
4070principles:
4071
4072     * Strive to understand the changing information requirements of
4073     scholarly disciplines as more and more technology is integrated into
4074     the process of research and scholarly communication in order to meet
4075     future scholarly needs, not to build for the past.  Capture
4076     deteriorating information at the highest affordable resolution, even
4077     though the dissemination and display technologies will lag.
4078
4079     * Develop cooperative mechanisms to foster agreement on protocols
4080     for document structure and other interchange mechanisms necessary
4081     for widespread dissemination and use before official standards are
4082     set.
4083
4084     * Accept that, in a transition period, de facto standards will have
4085     to be developed.
4086
4087     * Capture information in a way that keeps all options open and
4088     provides for total convertibility:  OCR, scanning of microfilm,
4089     producing microfilm from scanned documents, etc.
4090
4091     * Work closely with the generators of information and the builders
4092     of networks and databases to ensure that continuing accessibility is
4093     a primary concern from the beginning.
4094
4095     * Piggyback on standards under development for the broad market, and
4096     avoid library-specific standards; work with the vendors, in order to
4097     take advantage of that which is being standardized for the rest of
4098     the world.
4099
4100     * Concentrate efforts on managing permanence in the digital world,
4101     rather than perfecting the longevity of a particular medium.
4102
4103                                 ******
4104
4105+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
4106DISCUSSION * Additional comments on TIFF *
4107+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
4108
4109During the brief discussion period that followed BATTIN's presentation,
4110BARONAS explained that TIFF was not developed in collaboration with or
4111under the auspices of AIIM.  TIFF is a company product, not a standard,
4112is owned by two corporations, and is always changing.  BARONAS also
4113observed that ANSI/AIIM MS53, a bi-level image file transfer format that
4114allows unlike systems to exchange images, is compatible with TIFF as well
4115as with DEC's architecture and IBM's MODCA/IOCA.
4116
4117                                 ******
4118
4119+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
4120HOOTON * Several questions to be considered in discussing text conversion
4121*
4122+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
4123
4124HOOTON introduced the final topic, text conversion, by noting that it is
4125becoming an increasingly important part of the imaging business.  Many
4126people now realize that it enhances their system to be able to have more
4127and more character data as part of their imaging system.  Re the issue of
4128OCR versus rekeying, HOOTON posed several questions:  How does one get
4129text into computer-readable form?  Does one use automated processes?
4130Does one attempt to eliminate the use of operators where possible?
4131Standards for accuracy, he said, are extremely important:  it makes a
4132major difference in cost and time whether one sets as a standard 98.5
4133percent acceptance or 99.5 percent.  He mentioned outsourcing as a
4134possibility for converting text.  Finally, what one does with the image
4135to prepare it for the recognition process is also important, he said,
4136because such preparation changes how recognition is viewed, as well as
4137facilitates recognition itself.
4138
4139                                 ******
4140
4141+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
4142LESK * Roles of participants in CORE * Data flow * The scanning process *
4143The image interface * Results of experiments involving the use of
4144electronic resources and traditional paper copies * Testing the issue of
4145serendipity * Conclusions *
4146+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
4147
4148Michael LESK, executive director, Computer Science Research, Bell
4149Communications Research, Inc. (Bellcore), discussed the Chemical Online
4150Retrieval Experiment (CORE), a cooperative project involving Cornell
4151University, OCLC, Bellcore, and the American Chemical Society (ACS).
4152
4153LESK spoke on 1) how the scanning was performed, including the unusual
4154feature of page segmentation, and 2) the use made of the text and the
4155image in experiments.
4156
4157Working with the chemistry journals (because ACS has been saving its
4158typesetting tapes since the mid-1970s and thus has a significant back-run
4159of the most important chemistry journals in the United States), CORE is
4160attempting to create an automated chemical library.  Approximately a
4161quarter of the pages by square inch are made up of images of
4162quasi-pictorial material; dealing with the graphic components of the
4163pages is extremely important.  LESK described the roles of participants
4164in CORE:  1) ACS provides copyright permission, journals on paper,
4165journals on microfilm, and some of the definitions of the files; 2) at
4166Bellcore, LESK chiefly performs the data preparation, while Dennis Egan
4167performs experiments on the users of chemical abstracts, and supplies the
4168indexing and numerous magnetic tapes;  3) Cornell provides the site of the
4169experiment; 4) OCLC develops retrieval software and other user interfaces.
4170Various manufacturers and publishers have furnished other help.
4171
4172Concerning data flow, Bellcore receives microfilm and paper from ACS; the
4173microfilm is scanned by outside vendors, while the paper is scanned
4174inhouse on an Improvision scanner, twenty pages per minute at 300 dpi,
4175which provides sufficient quality for all practical uses.  LESK would
4176prefer to have more gray level, because one of the ACS journals prints on
4177some colored pages, which creates a problem.
4178
4179Bellcore performs all this scanning, creates a page-image file, and also
4180selects from the pages the graphics, to mix with the text file (which is
4181discussed later in the Workshop).  The user is always searching the ASCII
4182file, but she or he may see a display based on the ASCII or a display
4183based on the images.
4184
4185LESK illustrated how the program performs page analysis, and the image
4186interface.  (The user types several words, is presented with a list--
4187usually of the titles of articles contained in an issue--that derives
4188from the ASCII, clicks on an icon and receives an image that mirrors an
4189ACS page.)  LESK also illustrated an alternative interface, based on text
4190on the ASCII, the so-called SuperBook interface from Bellcore.
4191
4192LESK next presented the results of an experiment conducted by Dennis Egan
4193and involving thirty-six students at Cornell, one third of them
4194undergraduate chemistry majors, one third senior undergraduate chemistry
4195majors, and one third graduate chemistry students.  A third of them
4196received the paper journals, the traditional paper copies and chemical
4197abstracts on paper.  A third received image displays of the pictures of
4198the pages, and a third received the text display with pop-up graphics.
4199
4200The students were given several questions made up by some chemistry
4201professors.  The questions fell into five classes, ranging from very easy
4202to very difficult, and included questions designed to simulate browsing
4203as well as a traditional information retrieval-type task.
4204
4205LESK furnished the following results.  In the straightforward question
4206search--the question being, what is the phosphorus oxygen bond distance
4207and hydroxy phosphate?--the students were told that they could take
4208fifteen minutes and, then, if they wished, give up.  The students with
4209paper took more than fifteen minutes on average, and yet most of them
4210gave up.  The students with either electronic format, text or image,
4211received good scores in reasonable time, hardly ever had to give up, and
4212usually found the right answer.
4213
4214In the browsing study, the students were given a list of eight topics,
4215told to imagine that an issue of the Journal of the American Chemical
4216Society had just appeared on their desks, and were also told to flip
4217through it and to find topics mentioned in the issue.  The average scores
4218were about the same.  (The students were told to answer yes or no about
4219whether or not particular topics appeared.)  The errors, however, were
4220quite different.  The students with paper rarely said that something
4221appeared when it had not.  But they often failed to find something
4222actually mentioned in the issue.  The computer people found numerous
4223things, but they also frequently said that a topic was mentioned when it
4224was not.  (The reason, of course, was that they were performing word
4225searches.  They were finding that words were mentioned and they were
4226concluding that they had accomplished their task.)
4227
4228This question also contained a trick to test the issue of serendipity.
4229The students were given another list of eight topics and instructed,
4230without taking a second look at the journal, to recall how many of this
4231new list of eight topics were in this particular issue.  This was an
4232attempt to see if they performed better at remembering what they were not
4233looking for.  They all performed about the same, paper or electronics,
4234about 62 percent accurate.  In short, LESK said, people were not very
4235good when it came to serendipity, but they were no worse at it with
4236computers than they were with paper.
4237
4238(LESK gave a parenthetical illustration of the learning curve of students
4239who used SuperBook.)
4240
4241The students using the electronic systems started off worse than the ones
4242using print, but by the third of the three sessions in the series had
4243caught up to print.  As one might expect, electronics provide a much
4244better means of finding what one wants to read; reading speeds, once the
4245object of the search has been found, are about the same.
4246
4247Almost none of the students could perform the hard task--the analogous
4248transformation.  (It would require the expertise of organic chemists to
4249complete.)  But an interesting result was that the students using the text
4250search performed terribly, while those using the image system did best.
4251That the text search system is driven by text offers the explanation.
4252Everything is focused on the text; to see the pictures, one must press
4253on an icon.  Many students found the right article containing the answer
4254to the question, but they did not click on the icon to bring up the right
4255figure and see it.  They did not know that they had found the right place,
4256and thus got it wrong.
4257
4258The short answer demonstrated by this experiment was that in the event
4259one does not know what to read, one needs the electronic systems; the
4260electronic systems hold no advantage at the moment if one knows what to
4261read, but neither do they impose a penalty.
4262
4263LESK concluded by commenting that, on one hand, the image system was easy
4264to use.  On the other hand, the text display system, which represented
4265twenty man-years of work in programming and polishing, was not winning,
4266because the text was not being read, just searched.  The much easier
4267system is highly competitive as well as remarkably effective for the
4268actual chemists.
4269
4270                                 ******
4271
4272+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
4273ERWAY * Most challenging aspect of working on AM * Assumptions guiding
4274AM's approach * Testing different types of service bureaus * AM's
4275requirement for 99.95 percent accuracy * Requirements for text-coding *
4276Additional factors influencing AM's approach to coding * Results of AM's
4277experience with rekeying * Other problems in dealing with service bureaus
4278* Quality control the most time-consuming aspect of contracting out
4279conversion * Long-term outlook uncertain *
4280+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
4281
4282To Ricky ERWAY, associate coordinator, American Memory, Library of
4283Congress, the constant variety of conversion projects taking place
4284simultaneously represented perhaps the most challenging aspect of working
4285on AM.  Thus, the challenge was not to find a solution for text
4286conversion but a tool kit of solutions to apply to LC's varied
4287collections that need to be converted.  ERWAY limited her remarks to the
4288process of converting text to machine-readable form, and the variety of
4289LC's text collections, for example, bound volumes, microfilm, and
4290handwritten manuscripts.
4291
4292Two assumptions have guided AM's approach, ERWAY said:  1) A desire not
4293to perform the conversion inhouse.  Because of the variety of formats and
4294types of texts, to capitalize the equipment and have the talents and
4295skills to operate them at LC would be extremely expensive.  Further, the
4296natural inclination to upgrade to newer and better equipment each year
4297made it reasonable for AM to focus on what it did best and seek external
4298conversion services.  Using service bureaus also allowed AM to have
4299several types of operations take place at the same time.  2) AM was not a
4300technology project, but an effort to improve access to library
4301collections.  Hence, whether text was converted using OCR or rekeying
4302mattered little to AM.  What mattered were cost and accuracy of results.
4303
4304AM considered different types of service bureaus and selected three to
4305perform several small tests in order to acquire a sense of the field.
4306The sample collections with which they worked included handwritten
4307correspondence, typewritten manuscripts from the 1940s, and
4308eighteenth-century printed broadsides on microfilm.  On none of these
4309samples was OCR performed; they were all rekeyed.  AM had several special
4310requirements for the three service bureaus it had engaged.  For instance,
4311any errors in the original text were to be retained.  Working from bound
4312volumes or anything that could not be sheet-fed also constituted a factor
4313eliminating companies that would have performed OCR.
4314
4315AM requires 99.95 percent accuracy, which, though it sounds high, often
4316means one or two errors per page.  The initial batch of test samples
4317contained several handwritten materials for which AM did not require
4318text-coding.  The results, ERWAY reported, were in all cases fairly
4319comparable:  for the most part, all three service bureaus achieved 99.95
4320percent accuracy.  AM was satisfied with the work but surprised at the cost.
4321
4322As AM began converting whole collections, it retained the requirement for
432399.95 percent accuracy and added requirements for text-coding.  AM needed
4324to begin performing work more than three years ago before LC requirements
4325for SGML applications had been established.  Since AM's goal was simply
4326to retain any of the intellectual content represented by the formatting
4327of the document (which would be lost if one performed a straight ASCII
4328conversion), AM used "SGML-like" codes.  These codes resembled SGML tags
4329but were used without the benefit of document-type definitions.  AM found
4330that many service bureaus were not yet SGML-proficient.
4331
4332Additional factors influencing the approach AM took with respect to
4333coding included:  1) the inability of any known microcomputer-based
4334user-retrieval software to take advantage of SGML coding; and 2) the
4335multiple inconsistencies in format of the older documents, which
4336confirmed AM in its desire not to attempt to force the different formats
4337to conform to a single document-type definition (DTD) and thus create the
4338need for a separate DTD for each document.
4339
4340The five text collections that AM has converted or is in the process of
4341converting include a collection of eighteenth-century broadsides, a
4342collection of pamphlets, two typescript document collections, and a
4343collection of 150 books.
4344
4345ERWAY next reviewed the results of AM's experience with rekeying, noting
4346again that because the bulk of AM's materials are historical, the quality
4347of the text often does not lend itself to OCR.  While non-English
4348speakers are less likely to guess or elaborate or correct typos in the
4349original text, they are also less able to infer what we would; they also
4350are nearly incapable of converting handwritten text.  Another
4351disadvantage of working with overseas keyers is that they are much less
4352likely to telephone with questions, especially on the coding, with the
4353result that they develop their own rules as they encounter new
4354situations.
4355
4356Government contracting procedures and time frames posed a major challenge
4357to performing the conversion.  Many service bureaus are not accustomed to
4358retaining the image, even if they perform OCR.  Thus, questions of image
4359format and storage media were somewhat novel to many of them.  ERWAY also
4360remarked other problems in dealing with service bureaus, for example,
4361their inability to perform text conversion from the kind of microfilm
4362that LC uses for preservation purposes.
4363
4364But quality control, in ERWAY's experience, was the most time-consuming
4365aspect of contracting out conversion.  AM has been attempting to perform
4366a 10-percent quality review, looking at either every tenth document or
4367every tenth page to make certain that the service bureaus are maintaining
436899.95 percent accuracy.  But even if they are complying with the
4369requirement for accuracy, finding errors produces a desire to correct
4370them and, in turn, to clean up the whole collection, which defeats the
4371purpose to some extent.  Even a double entry requires a
4372character-by-character comparison to the original to meet the accuracy
4373requirement.  LC is not accustomed to publish imperfect texts, which
4374makes attempting to deal with the industry standard an emotionally
4375fraught issue for AM.  As was mentioned in the previous day's discussion,
4376going from 99.95 to 99.99 percent accuracy usually doubles costs and
4377means a third keying or another complete run-through of the text.
4378
4379Although AM has learned much from its experiences with various collections
4380and various service bureaus, ERWAY concluded pessimistically that no
4381breakthrough has been achieved.   Incremental improvements have occurred
4382in some of the OCR technology, some of the processes, and some of the
4383standards acceptances, which, though they may lead to somewhat lower costs,
4384do not offer much encouragement to many people who are anxiously awaiting
4385the day that the entire contents of LC are available on-line.
4386
4387                                 ******
4388
4389+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
4390ZIDAR * Several answers to why one attempts to perform full-text
4391conversion * Per page cost of performing OCR * Typical problems
4392encountered during editing * Editing poor copy OCR vs. rekeying *
4393+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
4394
4395Judith ZIDAR, coordinator, National Agricultural Text Digitizing Program
4396(NATDP), National Agricultural Library (NAL), offered several answers to
4397the question of why one attempts to perform full-text conversion:  1)
4398Text in an image can be read by a human but not by a computer, so of
4399course it is not searchable and there is not much one can do with it.  2)
4400Some material simply requires word-level access.  For instance, the legal
4401profession insists on full-text access to its material; with taxonomic or
4402geographic material, which entails numerous names, one virtually requires
4403word-level access.  3) Full text permits rapid browsing and searching,
4404something that cannot be achieved in an image with today's technology.
44054) Text stored as ASCII and delivered in ASCII is standardized and highly
4406portable.  5) People just want full-text searching, even those who do not
4407know how to do it.  NAL, for the most part, is performing OCR at an
4408actual cost per average-size page of approximately $7.  NAL scans the
4409page to create the electronic image and passes it through the OCR device.
4410
4411ZIDAR next rehearsed several typical problems encountered during editing.
4412Praising the celerity of her student workers, ZIDAR observed that editing
4413requires approximately five to ten minutes per page, assuming that there
4414are no large tables to audit.  Confusion among the three characters I, 1,
4415and l, constitutes perhaps the most common problem encountered.  Zeroes
4416and  O's also are  frequently confused.  Double M's create a particular
4417problem, even on clean pages.  They are so wide in most fonts that they
4418touch, and the system simply cannot tell where one letter ends and the
4419other begins.  Complex page formats occasionally fail to columnate
4420properly, which entails rescanning as though one were working with a
4421single column, entering the ASCII, and decolumnating for better
4422searching.  With proportionally spaced text, OCR can have difficulty
4423discerning what is a space and what are merely spaces between letters, as
4424opposed to spaces between words, and therefore will merge text or break
4425up words where it should not.
4426
4427ZIDAR said that it can often take longer to edit a poor-copy OCR than to
4428key it from scratch.  NAL has also experimented with partial editing of
4429text, whereby project workers go into and clean up the format, removing
4430stray characters but not running a spell-check.  NAL corrects typos in
4431the title and authors' names, which provides a foothold for searching and
4432browsing.  Even extremely poor-quality OCR (e.g., 60-percent accuracy)
4433can still be searched, because numerous words are correct, while the
4434important words are probably repeated often enough that they are likely
4435to be found correct somewhere.  Librarians, however, cannot tolerate this
4436situation, though end users seem more willing to use this text for
4437searching, provided that NAL indicates that it is unedited.  ZIDAR
4438concluded that rekeying of text may be the best route to take, in spite
4439of numerous problems with quality control and cost.
4440
4441                                 ******
4442
4443+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
4444DISCUSSION * Modifying an image before performing OCR * NAL's costs per
4445page *AM's costs per page and experience with Federal Prison Industries *
4446Elements comprising NATDP's costs per page * OCR and structured markup *
4447Distinction between the structure of a document and its representation
4448when put on the screen or printed *
4449+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
4450
4451HOOTON prefaced the lengthy discussion that followed with several
4452comments about modifying an image before one reaches the point of
4453performing OCR.  For example, in regard to an application containing a
4454significant amount of redundant data, such as form-type data, numerous
4455companies today are working on various kinds of form renewal, prior to
4456going through a recognition process, by using dropout colors.  Thus,
4457acquiring access to form design or using electronic means are worth
4458considering.  HOOTON also noted that conversion usually makes or breaks
4459one's imaging system.  It is extremely important, extremely costly in
4460terms of either capital investment or service, and determines the quality
4461of the remainder of one's system, because it determines the character of
4462the raw material used by the system.
4463
4464Concerning the four projects undertaken by NAL, two inside and two
4465performed by outside contractors, ZIDAR revealed that an in-house service
4466bureau executed the first at a cost between $8 and $10 per page for
4467everything, including building of the database.  The project undertaken
4468by the Consultative Group on International Agricultural Research (CGIAR)
4469cost approximately $10 per page for the conversion, plus some expenses
4470for the software and building of the database.  The Acid Rain Project--a
4471two-disk set produced by the University of Vermont, consisting of
4472Canadian publications on acid rain--cost $6.70 per page for everything,
4473including keying of the text, which was double keyed, scanning of the
4474images, and building of the database.  The in-house project offered
4475considerable ease of convenience and greater control of the process.  On
4476the other hand, the service bureaus know their job and perform it
4477expeditiously, because they have more people.
4478
4479As a useful comparison, ERWAY revealed AM's costs as follows:  $0.75
4480cents to $0.85 cents per thousand characters, with an average page
4481containing 2,700 characters.  Requirements for coding and imaging
4482increase the costs.  Thus, conversion of the text, including the coding,
4483costs approximately $3 per page.  (This figure does not include the
4484imaging and database-building included in the NAL costs.)  AM also
4485enjoyed a happy experience with Federal Prison Industries, which
4486precluded the necessity of going through the request-for-proposal process
4487to award a contract, because it is another government agency.  The
4488prisoners performed AM's rekeying just as well as other service bureaus
4489and proved handy as well.  AM shipped them the books, which they would
4490photocopy on a book-edge scanner.  They would perform the markup on
4491photocopies, return the books as soon as they were done with them,
4492perform the keying, and return the material to AM on WORM disks.
4493
4494ZIDAR detailed the elements that constitute the previously noted cost of
4495approximately $7 per page.  Most significant is the editing, correction
4496of errors, and spell-checkings, which though they may sound easy to
4497perform require, in fact, a great deal of time.  Reformatting text also
4498takes a while, but a significant amount of NAL's expenses are for equipment,
4499which was extremely expensive when purchased because it was one of the few
4500systems on the market.  The costs of equipment are being amortized over
4501five years but are still quite high, nearly $2,000 per month.
4502
4503HOCKEY raised a general question concerning OCR and the amount of editing
4504required (substantial in her experience) to generate the kind of
4505structured markup necessary for manipulating the text on the computer or
4506loading it into any retrieval system.  She wondered if the speakers could
4507extend the previous question about the cost-benefit of adding or exerting
4508structured markup.  ERWAY noted that several OCR systems retain italics,
4509bolding, and other spatial formatting.  While the material may not be in
4510the format desired, these systems possess the ability to remove the
4511original materials quickly from the hands of the people performing the
4512conversion, as well as to retain that information so that users can work
4513with it.  HOCKEY rejoined that the current thinking on markup is that one
4514should not say that something is italic or bold so much as why it is that
4515way.  To be sure, one needs to know that something was italicized, but
4516how can one get from one to the other?  One can map from the structure to
4517the typographic representation.
4518
4519FLEISCHHAUER suggested that, given the 100 million items the Library
4520holds, it may not be possible for LC to do more than report that a thing
4521was in italics as opposed to why it was italics, although that may be
4522desirable in some contexts.  Promising to talk a bit during the afternoon
4523session about several experiments OCLC performed on automatic recognition
4524of document elements, and which they hoped to extend, WEIBEL said that in
4525fact one can recognize the major elements of a document with a fairly
4526high degree of reliability, at least as good as OCR.  STEVENS drew a
4527useful distinction between standard, generalized markup (i.e., defining
4528for a document-type definition the structure of the document), and what
4529he termed a style sheet, which had to do with italics, bolding, and other
4530forms of emphasis.  Thus, two different components are at work, one being
4531the structure of the document itself (its logic), and the other being its
4532representation when it is put on the screen or printed.
4533
4534                                 ******
4535
4536SESSION V.  APPROACHES TO PREPARING ELECTRONIC TEXTS
4537
4538+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
4539HOCKEY * Text in ASCII and the representation of electronic text versus
4540an image * The need to look at ways of using markup to assist retrieval *
4541The need for an encoding format that will be reusable and multifunctional
4542+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
4543
4544Susan HOCKEY, director, Center for Electronic Texts in the Humanities
4545(CETH), Rutgers and Princeton Universities, announced that one talk
4546(WEIBEL's) was moved into this session from the morning and that David
4547Packard was unable to attend.  The session would attempt to focus more on
4548what one can do with a text in ASCII and the representation of electronic
4549text rather than just an image, what one can do with a computer that
4550cannot be done with a book or an image.  It would be argued that one can
4551do much more than just read a text, and from that starting point one can
4552use markup and methods of preparing the text to take full advantage of
4553the capability of the computer.  That would lead to a discussion of what
4554the European Community calls REUSABILITY, what may better be termed
4555DURABILITY, that is, how to prepare or make a text that will last a long
4556time and that can be used for as many applications as possible, which
4557would lead to issues of improving intellectual access.
4558
4559HOCKEY urged the need to look at ways of using markup to facilitate retrieval,
4560not just for referencing or to help locate an item that is retrieved, but also to put markup tags in
4561a text to help retrieve the thing sought either with linguistic tagging or
4562interpretation.  HOCKEY also argued that little advancement had occurred in
4563the software tools currently available for retrieving and searching text.
4564She pressed the desideratum of going beyond Boolean searches and performing
4565more sophisticated searching, which the insertion of more markup in the text
4566would facilitate.  Thinking about electronic texts as opposed to images means
4567considering material that will never appear in print form, or print will not
4568be its primary form, that is, material which only appears in electronic form.
4569HOCKEY alluded to the history and the need for markup and tagging and
4570electronic text, which was developed through the use of computers in the
4571humanities; as MICHELSON had observed, Father Busa had started in 1949
4572to prepare the first-ever text on the computer.
4573
4574HOCKEY remarked several large projects, particularly in Europe, for the
4575compilation of dictionaries, language studies, and language analysis, in
4576which people have built up archives of text and have begun to recognize
4577the need for an encoding format that will be reusable and multifunctional,
4578that can be used not just to print the text, which may be assumed to be a
4579byproduct of what one wants to do, but to structure it inside the computer
4580so that it can be searched, built into a Hypertext system, etc.
4581
4582                                 ******
4583
4584+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
4585WEIBEL * OCLC's approach to preparing electronic text:  retroconversion,
4586keying of texts, more automated ways of developing data * Project ADAPT
4587and the CORE Project * Intelligent character recognition does not exist *
4588Advantages of SGML * Data should be free of procedural markup;
4589descriptive markup strongly advocated * OCLC's interface illustrated *
4590Storage requirements and costs for putting a lot of information on line *
4591+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
4592
4593Stuart WEIBEL, senior research scientist, Online Computer Library Center,
4594Inc. (OCLC), described OCLC's approach to preparing electronic text.  He
4595argued that the electronic world into which we are moving must
4596accommodate not only the future but the past as well, and to some degree
4597even the present.  Thus, starting out at one end with retroconversion and
4598keying of texts, one would like to move toward much more automated ways
4599of developing data.
4600
4601For example, Project ADAPT had to do with automatically converting
4602document images into a structured document database with OCR text as
4603indexing and also a little bit of automatic formatting and tagging of
4604that text.  The CORE project hosted by Cornell University, Bellcore,
4605OCLC, the American Chemical Society, and Chemical Abstracts, constitutes
4606WEIBEL's principal concern at the moment.  This project is an example of
4607converting text for which one already has a machine-readable version into
4608a format more suitable for electronic delivery and database searching.
4609(Since Michael LESK had previously described CORE, WEIBEL would say
4610little concerning it.)  Borrowing a chemical phrase, de novo synthesis,
4611WEIBEL cited the Online Journal of Current Clinical Trials as an example
4612of de novo electronic publishing, that is, a form in which the primary
4613form of the information is electronic.
4614
4615Project ADAPT, then, which OCLC completed a couple of years ago and in
4616fact is about to resume, is a model in which one takes page images either
4617in paper or microfilm and converts them automatically to a searchable
4618electronic database, either on-line or local.  The operating assumption
4619is that accepting some blemishes in the data, especially for
4620retroconversion of materials, will make it possible to accomplish more.
4621Not enough money is available to support perfect conversion.
4622
4623WEIBEL related several steps taken to perform image preprocessing
4624(processing on the image before performing optical character
4625recognition), as well as image postprocessing.  He denied the existence
4626of intelligent character recognition and asserted that what is wanted is
4627page recognition, which is a long way off.  OCLC has experimented with
4628merging of multiple optical character recognition systems that will
4629reduce errors from an unacceptable rate of 5 characters out of every
4630l,000 to an unacceptable rate of 2 characters out of every l,000, but it
4631is not good enough.  It will never be perfect.
4632
4633Concerning the CORE Project, WEIBEL observed that Bellcore is taking the
4634topography files, extracting the page images, and converting those
4635topography files to SGML markup.  LESK hands that data off to OCLC, which
4636builds that data into a Newton database, the same system that underlies
4637the on-line system in virtually all of the reference products at OCLC.
4638The long-term goal is to make the systems interoperable so that not just
4639Bellcore's system and OCLC's system can access this data, but other
4640systems can as well, and the key to that is the Z39.50 common command
4641language and the full-text extension.  Z39.50 is fine for MARC records,
4642but is not enough to do it for full text (that is, make full texts
4643interoperable).
4644
4645WEIBEL next outlined the critical role of SGML for a variety of purposes,
4646for example, as noted by HOCKEY, in the world of extremely large
4647databases, using highly structured data to perform field searches.
4648WEIBEL argued that by building the structure of the data in (i.e., the
4649structure of the data originally on a printed page), it becomes easy to
4650look at a journal article even if one cannot read the characters and know
4651where the title or author is, or what the sections of that document would be.
4652OCLC wants to make that structure explicit in the database, because it will
4653be important for retrieval purposes.
4654
4655The second big advantage of SGML is that it gives one the ability to
4656build structure into the database that can be used for display purposes
4657without contaminating the data with instructions about how to format
4658things.  The distinction lies between procedural markup, which tells one
4659where to put dots on the page, and descriptive markup, which describes
4660the elements of a document.
4661
4662WEIBEL believes that there should be no procedural markup in the data at
4663all, that the data should be completely unsullied by information about
4664italics or boldness.  That should be left up to the display device,
4665whether that display device is a page printer or a screen display device.
4666By keeping one's database free of that kind of contamination, one can
4667make decisions down the road, for example, reorganize the data in ways
4668that are not cramped by built-in notions of what should be italic and
4669what should be bold.  WEIBEL strongly advocated descriptive markup.  As
4670an example, he illustrated the index structure in the CORE data.  With
4671subsequent illustrated examples of markup, WEIBEL acknowledged the common
4672complaint that SGML is hard to read in its native form, although markup
4673decreases considerably once one gets into the body.  Without the markup,
4674however, one would not have the structure in the data.  One can pass
4675markup through a LaTeX processor and convert it relatively easily to a
4676printed version of the document.
4677
4678WEIBEL next illustrated an extremely cluttered screen dump of OCLC's
4679system, in order to show as much as possible the inherent capability on
4680the screen.  (He noted parenthetically that he had become a supporter of
4681X-Windows as a result of the progress of the CORE Project.)  WEIBEL also
4682illustrated the two major parts of the interface:  l) a control box that
4683allows one to generate lists of items, which resembles a small table of
4684contents based on key words one wishes to search, and 2) a document
4685viewer, which is a separate process in and of itself.  He demonstrated
4686how to follow links through the electronic database simply by selecting
4687the appropriate button and bringing them up.  He also noted problems that
4688remain to be accommodated in the interface (e.g., as pointed out by LESK,
4689what happens when users do not click on the icon for the figure).
4690
4691Given the constraints of time, WEIBEL omitted a large number of ancillary
4692items in order to say a few words concerning storage requirements and
4693what will be required to put a lot of things on line.  Since it is
4694extremely expensive to reconvert all of this data, especially if it is
4695just in paper form (and even if it is in electronic form in typesetting
4696tapes), he advocated building journals electronically from the start.  In
4697that case, if one only has text graphics and indexing (which is all that
4698one needs with de novo electronic publishing, because there is no need to
4699go back and look at bit-maps of pages), one can get 10,000 journals of
4700full text, or almost 6 million pages per year.  These pages can be put in
4701approximately 135 gigabytes of storage, which is not all that much,
4702WEIBEL said.  For twenty years, something less than three terabytes would
4703be required.  WEIBEL calculated the costs of storing this information as
4704follows:  If a gigabyte costs approximately $1,000, then a terabyte costs
4705approximately $1 million to buy in terms of hardware.  One also needs a
4706building to put it in and a staff like OCLC to handle that information.
4707So, to support a terabyte, multiply by five, which gives $5 million per
4708year for a supported terabyte of data.
4709
4710                                 ******
4711
4712+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
4713DISCUSSION * Tapes saved by ACS are the typography files originally
4714supporting publication of the journal * Cost of building tagged text into
4715the database *
4716+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
4717
4718During the question-and-answer period that followed WEIBEL's
4719presentation, these clarifications emerged.  The tapes saved by the
4720American Chemical Society are the typography files that originally
4721supported the publication of the journal.  Although they are not tagged
4722in SGML, they are tagged in very fine detail.  Every single sentence is
4723marked, all the registry numbers, all the publications issues, dates, and
4724volumes.  No cost figures on tagging material on a per-megabyte basis
4725were available.  Because ACS's typesetting system runs from tagged text,
4726there is no extra cost per article.  It was unknown what it costs ACS to
4727keyboard the tagged text rather than just keyboard the text in the
4728cheapest process.  In other words, since one intends to publish things
4729and will need to build tagged text into a typography system in any case,
4730if one does that in such a way that it can drive not only typography but
4731an electronic system (which is what ACS intends to do--move to SGML
4732publishing), the marginal cost is zero.  The marginal cost represents the
4733cost of building tagged text into the database, which is small.
4734
4735                                 ******
4736
4737+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
4738SPERBERG-McQUEEN * Distinction between texts and computers * Implications
4739of recognizing that all representation is encoding * Dealing with
4740complicated representations of text entails the need for a grammar of
4741documents * Variety of forms of formal grammars * Text as a bit-mapped
4742image does not represent a serious attempt to represent text in
4743electronic form * SGML, the TEI, document-type declarations, and the
4744reusability and longevity of data * TEI conformance explicitly allows
4745extension or modification of the TEI tag set * Administrative background
4746of the TEI * Several design goals for the TEI tag set * An absolutely
4747fixed requirement of the TEI Guidelines * Challenges the TEI has
4748attempted to face * Good texts not beyond economic feasibility * The
4749issue of reproducibility or processability * The issue of mages as
4750simulacra for the text redux * One's model of text determines what one's
4751software can do with a text and has economic consequences *
4752+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
4753
4754Prior to speaking about SGML and markup, Michael SPERBERG-McQUEEN, editor,
4755Text Encoding Initiative (TEI), University of Illinois-Chicago, first drew
4756a distinction between texts and computers:  Texts are abstract cultural
4757and linguistic objects while computers are complicated physical devices,
4758he said.  Abstract objects cannot be placed inside physical devices; with
4759computers one can only represent text and act upon those representations.
4760
4761The recognition that all representation is encoding, SPERBERG-McQUEEN
4762argued, leads to the recognition of two things:  1) The topic description
4763for this session is slightly misleading, because there can be no discussion
4764of pros and cons of text-coding unless what one means is pros and cons of
4765working with text with computers.  2) No text can be represented in a
4766computer without some sort of encoding; images are one way of encoding text,
4767ASCII is another, SGML yet another.  There is no encoding without some
4768information loss, that is, there is no perfect reproduction of a text that
4769allows one to do away with the original.  Thus, the question becomes,
4770What is the most useful representation of text for a serious work?
4771This depends on what kind of serious work one is talking about.
4772
4773The projects demonstrated the previous day all involved highly complex
4774information and fairly complex manipulation of the textual material.
4775In order to use that complicated information, one has to calculate it
4776slowly or manually and store the result.  It needs to be stored, therefore,
4777as part of one's representation of the text.  Thus, one needs to store the
4778structure in the text.  To deal with complicated representations of text,
4779one needs somehow to control the complexity of the representation of a text;
4780that means one needs a way of finding out whether a document and an
4781electronic representation of a document is legal or not; and that
4782means one needs a grammar of documents.
4783
4784SPERBERG-McQUEEN discussed the variety of forms of formal grammars,
4785implicit and explicit, as applied to text, and their capabilities.  He
4786argued that these grammars correspond to different models of text that
4787different developers have.  For example, one implicit model of the text
4788is that there is no internal structure, but just one thing after another,
4789a few characters and then perhaps a start-title command, and then a few
4790more characters and an end-title command.  SPERBERG-McQUEEN also
4791distinguished several kinds of text that have a sort of hierarchical
4792structure that is not very well defined, which, typically, corresponds
4793to grammars that are not very well defined, as well as hierarchies that
4794are very well defined (e.g., the Thesaurus Linguae Graecae) and extremely
4795complicated things such as SGML, which handle strictly hierarchical data
4796very nicely.
4797
4798SPERBERG-McQUEEN conceded that one other model not illustrated on his two
4799displays was the model of text as a bit-mapped image, an image of a page,
4800and confessed to having been converted to a limited extent by the
4801Workshop to the view that electronic images constitute a promising,
4802probably superior alternative to microfilming.  But he was not convinced
4803that electronic images represent a serious attempt to represent text in
4804electronic form.  Many of their problems stem from the fact that they are
4805not direct attempts to represent the text but attempts to represent the
4806page, thus making them representations of representations.
4807
4808In this situation of increasingly complicated textual information and the
4809need to control that complexity in a useful way (which begs the question
4810of the need for good textual grammars), one has the introduction of SGML.
4811With SGML, one can develop specific document-type declarations
4812for specific text types or, as with the TEI, attempts to generate
4813general document-type declarations that can handle all sorts of text.
4814The TEI is an attempt to develop formats for text representation that
4815will ensure the kind of reusability and longevity of data discussed earlier.
4816It offers a way to stay alive in the state of permanent technological
4817revolution.
4818
4819It has been a continuing challenge in the TEI to create document grammars
4820that do some work in controlling the complexity of the textual object but
4821also allowing one to represent the real text that one will find.
4822Fundamental to the notion of the TEI is that TEI conformance allows one
4823the ability to extend or modify the TEI tag set so that it fits the text
4824that one is attempting to represent.
4825
4826SPERBERG-McQUEEN next outlined the administrative background of the TEI.
4827The TEI is an international project to develop and disseminate guidelines
4828for the encoding and interchange of machine-readable text.  It is
4829sponsored by the Association for Computers in the Humanities, the
4830Association for Computational Linguistics, and the Association for
4831Literary and Linguistic Computing.  Representatives of numerous other
4832professional societies sit on its advisory board.  The TEI has a number
4833of affiliated projects that have provided assistance by testing drafts of
4834the guidelines.
4835
4836Among the design goals for the TEI tag set, the scheme first of all must
4837meet the needs of research, because the TEI came out of the research
4838community, which did not feel adequately served by existing tag sets.
4839The tag set must be extensive as well as compatible with existing and
4840emerging standards.  In 1990, version 1.0 of the Guidelines was released
4841(SPERBERG-McQUEEN illustrated their contents).
4842
4843SPERBERG-McQUEEN noted that one problem besetting electronic text has
4844been the lack of adequate internal or external documentation for many
4845existing electronic texts.  The TEI guidelines as currently formulated
4846contain few fixed requirements, but one of them is this:  There must
4847always be a document header, an in-file SGML tag that provides
48481) a bibliographic description of the electronic object one is talking
4849about (that is, who included it, when, what for, and under which title);
4850and 2) the copy text from which it was derived, if any.  If there was
4851no copy text or if the copy text is unknown, then one states as much.
4852Version 2.0 of the Guidelines was scheduled to be completed in fall 1992
4853and a revised third version is to be presented to the TEI advisory board
4854for its endorsement this coming winter.  The TEI itself exists to provide
4855a markup language, not a marked-up text.
4856
4857Among the challenges the TEI has attempted to face is the need for a
4858markup language that will work for existing projects, that is, handle the
4859level of markup that people are using now to tag only chapter, section,
4860and paragraph divisions and not much else.  At the same time, such a
4861language also will be able to scale up gracefully to handle the highly
4862detailed markup which many people foresee as the future destination of
4863much electronic text, and which is not the future destination but the
4864present home of numerous electronic texts in specialized areas.
4865
4866SPERBERG-McQUEEN dismissed the lowest-common-denominator approach as
4867unable to support the kind of applications that draw people who have
4868never been in the public library regularly before, and make them come
4869back.  He advocated more interesting text and more intelligent text.
4870Asserting that it is not beyond economic feasibility to have good texts,
4871SPERBERG-McQUEEN noted that the TEI Guidelines listing 200-odd tags
4872contains tags that one is expected to enter every time the relevant
4873textual feature occurs.  It contains all the tags that people need now,
4874and it is not expected that everyone will tag things in the same way.
4875
4876The question of how people will tag the text is in large part a function
4877of their reaction to what SPERBERG-McQUEEN termed the issue of
4878reproducibility.  What one needs to be able to reproduce are the things
4879one wants to work with.  Perhaps a more useful concept than that of
4880reproducibility or recoverability is that of processability, that is,
4881what can one get from an electronic text without reading it again
4882in the original.  He illustrated this contention with a page from
4883Jan Comenius's bilingual Introduction to Latin.
4884
4885SPERBERG-McQUEEN returned at length to the issue of images as simulacra
4886for the text, in order to reiterate his belief that in the long run more
4887than images of pages of particular editions of the text are needed,
4888because just as second-generation photocopies and second-generation
4889microfilm degenerate, so second-generation representations tend to
4890degenerate, and one tends to overstress some relatively trivial aspects
4891of the text such as its layout on the page, which is not always
4892significant, despite what the text critics might say, and slight other
4893pieces of information such as the very important lexical ties between the
4894English and Latin versions of Comenius's bilingual text, for example.
4895Moreover, in many crucial respects it is easy to fool oneself concerning
4896what a scanned image of the text will accomplish.  For example, in order
4897to study the transmission of texts, information concerning the text
4898carrier is necessary, which scanned images simply do not always handle.
4899Further, even the high-quality materials being produced at Cornell use
4900much of the information that one would need if studying those books as
4901physical objects.  It is a choice that has been made.  It is an arguably
4902justifiable choice, but one does not know what color those pen strokes in
4903the margin are or whether there was a stain on the page, because it has
4904been filtered out.  One does not know whether there were rips in the page
4905because they do not show up, and on a couple of the marginal marks one
4906loses half of the mark because the pen is very light and the scanner
4907failed to pick it up, and so what is clearly a checkmark in the margin of
4908the original becomes a little scoop in the margin of the facsimile.
4909Standard problems for facsimile editions, not new to electronics, but
4910also true of light-lens photography, and are remarked here because it is
4911important that we not fool ourselves that even if we produce a very nice
4912image of this page with good contrast, we are not replacing the
4913manuscript any more than microfilm has replaced the manuscript.
4914
4915The TEI comes from the research community, where its first allegiance
4916lies, but it is not just an academic exercise.  It has relevance far
4917beyond those who spend all of their time studying text, because one's
4918model of text determines what one's software can do with a text.  Good
4919models lead to good software.  Bad models lead to bad software.  That has
4920economic consequences, and it is these economic consequences that have
4921led the European Community to help support the TEI, and that will lead,
4922SPERBERG-McQUEEN hoped, some software vendors to realize that if they
4923provide software with a better model of the text they can make a killing.
4924
4925                                 ******
4926
4927+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
4928DISCUSSION * Implications of different DTDs and tag sets * ODA versus SGML *
4929+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
4930
4931During the discussion that followed, several additional points were made.
4932Neither AAP (i.e., Association of American Publishers) nor CALS (i.e.,
4933Computer-aided Acquisition and Logistics Support) has a document-type
4934definition for ancient Greek drama, although the TEI will be able to
4935handle that.  Given this state of affairs and assuming that the
4936technical-journal producers and the commercial vendors decide to use the
4937other two types, then an institution like the Library of Congress, which
4938might receive all of their publications, would have to be able to handle
4939three different types of document definitions and tag sets and be able to
4940distinguish among them.
4941
4942Office Document Architecture (ODA) has some advantages that flow from its
4943tight focus on office documents and clear directions for implementation.
4944Much of the ODA standard is easier to read and clearer at first reading
4945than the SGML standard, which is extremely general.  What that means is
4946that if one wants to use graphics in TIFF and ODA, one is stuck, because
4947ODA defines graphics formats while TIFF does not, whereas SGML says the
4948world is not waiting for this work group to create another graphics format.
4949What is needed is an ability to use whatever graphics format one wants.
4950
4951The TEI provides a socket that allows one to connect the SGML document to
4952the graphics.  The notation that the graphics are in is clearly a choice
4953that one needs to make based on her or his environment, and that is one
4954advantage.  SGML is less megalomaniacal in attempting to define formats
4955for all kinds of information, though more megalomaniacal in attempting to
4956cover all sorts of documents.  The other advantage is that the model of
4957text represented by SGML is simply an order of magnitude richer and more
4958flexible than the model of text offered by ODA.  Both offer hierarchical
4959structures, but SGML recognizes that the hierarchical model of the text
4960that one is looking at may not have been in the minds of the designers,
4961whereas ODA does not.
4962
4963ODA is not really aiming for the kind of document that the TEI wants to
4964encompass.  The TEI can handle the kind of material ODA has, as well as a
4965significantly broader range of material.  ODA seems to be very much
4966focused on office documents, which is what it started out being called--
4967office document architecture.
4968
4969                                 ******
4970
4971+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
4972CALALUCA * Text-encoding from a publisher's perspective *
4973Responsibilities of a publisher * Reproduction of Migne's Latin series
4974whole and complete with SGML tags based on perceived need and expected
4975use * Particular decisions arising from the general decision to produce
4976and publish PLD *
4977+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
4978
4979The final speaker in this session, Eric CALALUCA, vice president,
4980Chadwyck-Healey, Inc., spoke from the perspective of a publisher re
4981text-encoding, rather than as one qualified to discuss methods of
4982encoding data, and observed that the presenters sitting in the room,
4983whether they had chosen to or not, were acting as publishers:  making
4984choices, gathering data, gathering information, and making assessments.
4985CALALUCA offered the hard-won conviction that in publishing very large
4986text files (such as PLD), one cannot avoid making personal judgments of
4987appropriateness and structure.
4988
4989In CALALUCA's view, encoding decisions stem from prior judgments.  Two
4990notions have become axioms for him in the consideration of future sources
4991for electronic publication:  1) electronic text publishing is as personal
4992as any other kind of publishing, and questions of if and how to encode
4993the data are simply a consequence of that prior decision;  2) all
4994personal decisions are open to criticism, which is unavoidable.
4995
4996CALALUCA rehearsed his role as a publisher or, better, as an intermediary
4997between what is viewed as a sound idea and the people who would make use
4998of it.  Finding the specialist to advise in this process is the core of
4999that function.  The publisher must monitor and hug the fine line between
5000giving users what they want and suggesting what they might need.  One
5001responsibility of a publisher is to represent the desires of scholars and
5002research librarians as opposed to bullheadedly forcing them into areas
5003they would not choose to enter.
5004
5005CALALUCA likened the questions being raised today about data structure
5006and standards to the decisions faced by the Abbe Migne himself during
5007production of the Patrologia series in the mid-nineteenth century.
5008Chadwyck-Healey's decision to reproduce Migne's Latin series whole and
5009complete with SGML tags was also based upon a perceived need and an
5010expected use.  In the same way that Migne's work came to be far more than
5011a simple handbook for clerics, PLD is already far more than a database
5012for theologians.  It is a bedrock source for the study of Western
5013civilization, CALALUCA asserted.
5014
5015In regard to the decision to produce and publish PLD, the editorial board
5016offered direct judgments on the question of appropriateness of these
5017texts for conversion, their encoding and their distribution, and
5018concluded that the best possible project was one that avoided overt
5019intrusions or exclusions in so important a resource.  Thus, the general
5020decision to transmit the original collection as clearly as possible with
5021the widest possible avenues for use led to other decisions:  1) To encode
5022the data or not, SGML or not, TEI or not.  Again, the expected user
5023community asserted the need for normative tagging structures of important
5024humanities texts, and the TEI seemed the most appropriate structure for
5025that purpose.  Research librarians, who are trained to view the larger
5026impact of electronic text sources on 80 or 90 or 100 doctoral
5027disciplines, loudly approved the decision to include tagging.  They see
5028what is coming better than the specialist who is completely focused on
5029one edition of Ambrose's De Anima, and they also understand that the
5030potential uses exceed present expectations.  2) What will be tagged and
5031what will not.  Once again, the board realized that one must tag the
5032obvious.  But in no way should one attempt to identify through encoding
5033schemes every single discrete area of a text that might someday be
5034searched.  That was another decision.  Searching by a column number, an
5035author, a word, a volume, permitting combination searches, and tagging
5036notations seemed logical choices as core elements.  3) How does one make
5037the data available?  Tieing it to a CD-ROM edition creates limitations,
5038but a magnetic tape file that is very large, is accompanied by the
5039encoding specifications, and that allows one to make local modifications
5040also allows one to incorporate any changes one may desire within the
5041bounds of private research, though exporting tag files from a CD-ROM
5042could serve just as well.  Since no one on the board could possibly
5043anticipate each and every way in which a scholar might choose to mine
5044this data bank, it was decided to satisfy the basics and make some
5045provisions for what might come.  4) Not to encode the database would rob
5046it of the interchangeability and portability these important texts should
5047accommodate.  For CALALUCA, the extensive options presented by full-text
5048searching require care in text selection and strongly support encoding of
5049data to facilitate the widest possible search strategies.  Better
5050software can always be created, but summoning the resources, the people,
5051and the energy to reconvert the text is another matter.
5052
5053PLD is being encoded, captured, and distributed, because to
5054Chadwyck-Healey and the board it offers the widest possible array of
5055future research applications that can be seen today.  CALALUCA concluded
5056by urging the encoding of all important text sources in whatever way
5057seems most appropriate and durable at the time, without blanching at the
5058thought that one's work may require emendation in the future.  (Thus,
5059Chadwyck-Healey produced a very large humanities text database before the
5060final release of the TEI Guidelines.)
5061
5062                                 ******
5063
5064+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
5065DISCUSSION * Creating texts with markup advocated * Trends in encoding *
5066The TEI and the issue of interchangeability of standards * A
5067misconception concerning the TEI * Implications for an institution like
5068LC in the event that a multiplicity of DTDs develops * Producing images
5069as a first step towards possible conversion to full text through
5070character recognition * The AAP tag sets as a common starting point and
5071the need for caution *
5072+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
5073
5074HOCKEY prefaced the discussion that followed with several comments in
5075favor of creating texts with markup and on trends in encoding.  In the
5076future, when many more texts are available for on-line searching, real
5077problems in finding what is wanted will develop, if one is faced with
5078millions of words of data.  It therefore becomes important to consider
5079putting markup in texts to help searchers home in on the actual things
5080they wish to retrieve.  Various approaches to refining retrieval methods
5081toward this end include building on a computer version of a dictionary
5082and letting the computer look up words in it to obtain more information
5083about the semantic structure or semantic field of a word, its grammatical
5084structure, and syntactic structure.
5085
5086HOCKEY commented on the present keen interest in the encoding world
5087in creating:  1) machine-readable versions of dictionaries that can be
5088initially tagged in SGML, which gives a structure to the dictionary entry;
5089these entries can then be converted into a more rigid or otherwise
5090different database structure inside the computer, which can be treated as
5091a dynamic tool for searching mechanisms; 2) large bodies of text to study
5092the language.  In order to incorporate more sophisticated mechanisms,
5093more about how words behave needs to be known, which can be learned in
5094part from information in dictionaries.  However, the last ten years have
5095seen much interest in studying the structure of printed dictionaries
5096converted into computer-readable form.  The information one derives about
5097many words from those is only partial, one or two definitions of the
5098common or the usual meaning of a word, and then numerous definitions of
5099unusual usages.  If the computer is using a dictionary to help retrieve
5100words in a text, it needs much more information about the common usages,
5101because those are the ones that occur over and over again.  Hence the
5102current interest in developing large bodies of text in computer-readable
5103form in order to study the language.  Several projects are engaged in
5104compiling, for example, 100 million words. HOCKEY described one with
5105which she was associated briefly at Oxford University involving
5106compilation of 100 million words of British English:  about 10 percent of
5107that will contain detailed linguistic tagging encoded in SGML; it will
5108have word class taggings, with words identified as nouns, verbs,
5109adjectives, or other parts of speech.  This tagging can then be used by
5110programs which will begin to learn a bit more about the structure of the
5111language, and then, can go to tag more text.
5112
5113HOCKEY said that the more that is tagged accurately, the more one can
5114refine the tagging process and thus the bigger body of text one can build
5115up with linguistic tagging incorporated into it.  Hence, the more tagging
5116or annotation there is in the text, the more one may begin to learn about
5117language and the more it will help accomplish more intelligent OCR.  She
5118recommended the development of software tools that will help one begin to
5119understand more about a text, which can then be applied to scanning
5120images of that text in that format and to using more intelligence to help
5121one interpret or understand the text.
5122
5123HOCKEY posited the need to think about common methods of text-encoding
5124for a long time to come, because building these large bodies of text is
5125extremely expensive and will only be done once.
5126
5127In the more general discussion on approaches to encoding that followed,
5128these points were made:
5129
5130BESSER identified the underlying problem with standards that all have to
5131struggle with in adopting a standard, namely, the tension between a very
5132highly defined standard that is very interchangeable but does not work
5133for everyone because something is lacking, and a standard that is less
5134defined, more open, more adaptable, but less interchangeable.  Contending
5135that the way in which people use SGML is not sufficiently defined, BESSER
5136wondered 1) if people resist the TEI because they think it is too defined
5137in certain things they do not fit into, and 2) how progress with
5138interchangeability can be made without frightening people away.
5139
5140SPERBERG-McQUEEN replied that the published drafts of the TEI had met
5141with surprisingly little objection on the grounds that they do not allow
5142one to handle X or Y or Z.  Particular concerns of the affiliated
5143projects have led, in practice, to discussions of how extensions are to
5144be made; the primary concern of any project has to be how it can be
5145represented locally, thus making interchange secondary.  The TEI has
5146received much criticism based on the notion that everything in it is
5147required or even recommended, which, as it happens, is a misconception
5148from the beginning,   because none of it is required and very little is
5149actually actively recommended for all cases, except that one document
5150one's source.
5151
5152SPERBERG-McQUEEN agreed with BESSER about this trade-off:  all the
5153projects in a set of twenty TEI-conformant projects will not necessarily
5154tag the material in the same way.  One result of the TEI will be that the
5155easiest problems will be solved--those dealing with the external form of
5156the information; but the problem that is hardest in interchange is that
5157one is not encoding what another wants, and vice versa.  Thus, after
5158the adoption of a common notation, the differences in the underlying
5159conceptions of what is interesting about texts become more visible.
5160The success of a standard like the TEI will lie in the ability of
5161the recipient of interchanged texts to use some of what it contains
5162and to add the information that was not encoded that one wants, in a
5163layered way, so that texts can be gradually enriched and one does not
5164have to put in everything all at once.  Hence, having a well-behaved
5165markup scheme is important.
5166
5167STEVENS followed up on the paradoxical analogy that BESSER alluded to in
5168the example of the MARC records, namely, the formats that are the same
5169except that they are different.  STEVENS drew a parallel between
5170document-type definitions and MARC records for books and serials and maps,
5171where one has a tagging structure and there is a text-interchange.
5172STEVENS opined that the producers of the information will set the terms
5173for the standard (i.e., develop document-type definitions for the users
5174of their products), creating a situation that will be problematical for
5175an institution like the Library of Congress, which will have to deal with
5176the DTDs in the event that a multiplicity of them develops.  Thus,
5177numerous people are seeking a standard but cannot find the tag set that
5178will be acceptable to them and their clients.  SPERBERG-McQUEEN agreed
5179with this view, and said that the situation was in a way worse:  attempting
5180to unify arbitrary DTDs resembled attempting to unify a MARC record with a
5181bibliographic record done according to the Prussian instructions.
5182According to STEVENS, this situation occurred very early in the process.
5183
5184WATERS recalled from early discussions on Project Open Book the concern
5185of many people that merely by producing images, POB was not really
5186enhancing intellectual access to the material.  Nevertheless, not wishing
5187to overemphasize the opposition between imaging and full text, WATERS
5188stated that POB views getting the images as a first step toward possibly
5189converting to full text through character recognition, if the technology
5190is appropriate.  WATERS also emphasized that encoding is involved even
5191with a set of images.
5192
5193SPERBERG-McQUEEN agreed with WATERS that one can create an SGML document
5194consisting wholly of images.  At first sight, organizing graphic images
5195with an SGML document may not seem to offer great advantages, but the
5196advantages of the scheme WATERS described would be precisely that
5197ability to move into something that is more of a multimedia document:
5198a combination of transcribed text and page images.  WEIBEL concurred in
5199this judgment, offering evidence from Project ADAPT, where a page is
5200divided into text elements and graphic elements, and in fact the text
5201elements are organized by columns and lines.  These lines may be used as
5202the basis for distributing documents in a network environment.  As one
5203develops software intelligent enough to recognize what those elements
5204are, it makes sense to apply SGML to an image initially, that may, in
5205fact, ultimately become more and more text, either through OCR or edited
5206OCR or even just through keying.  For WATERS, the labor of composing the
5207document and saying this set of documents or this set of images belongs
5208to this document constitutes a significant investment.
5209
5210WEIBEL also made the point that the AAP tag sets, while not excessively
5211prescriptive, offer a common starting point; they do not define the
5212structure of the documents, though.  They have some recommendations about
5213DTDs one could use as examples, but they do just suggest tag sets.   For
5214example, the CORE project attempts to use the AAP markup as much as
5215possible, but there are clearly areas where structure must be added.
5216That in no way contradicts the use of AAP tag sets.
5217
5218SPERBERG-McQUEEN noted that the TEI prepared a long working paper early
5219on about the AAP tag set and what it lacked that the TEI thought it
5220needed, and a fairly long critique of the naming conventions, which has
5221led to a very different style of naming in the TEI.  He stressed the
5222importance of the opposition between prescriptive markup, the kind that a
5223publisher or anybody can do when producing documents de novo, and
5224descriptive markup, in which one has to take what the text carrier
5225provides.  In these particular tag sets it is easy to overemphasize this
5226opposition, because the AAP tag set is extremely flexible.  Even if one
5227just used the DTDs, they allow almost anything to appear almost anywhere.
5228
5229                                 ******
5230
5231SESSION VI.  COPYRIGHT ISSUES
5232
5233+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
5234PETERS * Several cautions concerning copyright in an electronic
5235environment * Review of copyright law in the United States * The notion
5236of the public good and the desirability of incentives to promote it *
5237What copyright protects * Works not protected by copyright * The rights
5238of copyright holders * Publishers' concerns in today's electronic
5239environment * Compulsory licenses * The price of copyright in a digital
5240medium and the need for cooperation * Additional clarifications *  Rough
5241justice oftentimes the outcome in numerous copyright matters * Copyright
5242in an electronic society * Copyright law always only sets up the
5243boundaries; anything can be changed by contract *
5244+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
5245
5246Marybeth PETERS, policy planning adviser to the Register of Copyrights,
5247Library of Congress,   made several general comments and then opened the
5248floor to discussion of subjects of interest to the audience.
5249
5250Having attended several sessions in an effort to gain a sense of what
5251people did and where copyright would affect their lives, PETERS expressed
5252the following cautions:
5253
5254     * If one takes and converts materials and puts them in new forms,
5255     then, from a copyright point of view, one is creating something and
5256     will receive some rights.
5257
5258     * However, if what one is converting already exists, a question
5259     immediately arises about the status of the materials in question.
5260
5261     * Putting something in the public domain in the United States offers
5262     some freedom from anxiety, but distributing it throughout the world
5263     on a network is another matter, even if one has put it in the public
5264     domain in the United States.  Re foreign laws, very frequently a
5265     work can be in the public domain in the United States but protected
5266     in other countries.  Thus, one must consider all of the places a
5267     work may reach, lest one unwittingly become liable to being faced
5268     with a suit for copyright infringement, or at least a letter
5269     demanding discussion of what one is doing.
5270
5271PETERS reviewed copyright law in the United States.  The U.S.
5272Constitution effectively states that Congress has the power to enact
5273copyright laws for two purposes:  1) to encourage the creation and
5274dissemination of intellectual works for the good of society as a whole;
5275and, significantly, 2) to give creators and those who package and
5276disseminate materials the economic rewards that are due them.
5277
5278Congress strives to strike a balance, which at times can become an
5279emotional issue.  The United States has never accepted the notion of the
5280natural right of an author so much as it has accepted the notion of the
5281public good and the desirability of incentives to promote it.  This state
5282of affairs, however, has created strains on the international level and
5283is the reason for several of the differences in the laws that we have.
5284Today the United States protects almost every kind of work that can be
5285called an expression of an author.  The standard for gaining copyright
5286protection is simply originality.  This is a low standard and means that
5287a work is not copied from something else, as well as shows a certain
5288minimal amount of authorship.  One can also acquire copyright protection
5289for making a new version of preexisting material, provided it manifests
5290some spark of creativity.
5291
5292However, copyright does not protect ideas, methods, systems--only the way
5293that one expresses those things.  Nor does copyright protect anything
5294that is mechanical, anything that does not involve choice, or criteria
5295concerning whether or not one should do a thing.  For example, the
5296results of a process called declicking, in which one mechanically removes
5297impure sounds from old recordings, are not copyrightable.  On the other
5298hand, the choice to record a song digitally and to increase the sound of
5299violins or to bring up the tympani constitutes the results of conversion
5300that are copyrightable.  Moreover, if a work is protected by copyright in
5301the United States, one generally needs the permission of the copyright
5302owner to convert it.  Normally, who will own the new--that is, converted-
5303-material is a matter of contract.  In the absence of a contract, the
5304person who creates the new material is the author and owner.  But people
5305do not generally think about the copyright implications until after the
5306fact.  PETERS stressed the need when dealing with copyrighted works to
5307think about copyright in advance.  One's bargaining power is much greater
5308up front than it is down the road.
5309
5310PETERS next discussed works not protected by copyright, for example, any
5311work done by a federal employee as part of his or her official duties is
5312in the public domain in the United States.  The issue is not wholly free
5313of doubt concerning whether or not the work is in the public domain
5314outside the United States.  Other materials in the public domain include:
5315any works published more than seventy-five years ago, and any work
5316published in the United States more than twenty-eight years ago, whose
5317copyright was not renewed.  In talking about the new technology and
5318putting material in a digital form to send all over the world, PETERS
5319cautioned, one must keep in mind that while the rights may not be an
5320issue in the United States, they may be in different parts of the world,
5321where most countries previously employed a copyright term of the life of
5322the author plus fifty years.
5323
5324PETERS next reviewed the economics of copyright holding.  Simply,
5325economic rights are the rights to control the reproduction of a work in
5326any form.  They belong to the author, or in the case of a work made for
5327hire, the employer.  The second right, which is critical to conversion,
5328is the right to change a work.  The right to make new versions is perhaps
5329one of the most significant rights of authors, particularly in an
5330electronic world.  The third right is the right to publish the work and
5331the right to disseminate it, something that everyone who deals in an
5332electronic medium needs to know.  The basic rule is if a copy is sold,
5333all rights of distribution are extinguished with the sale of that copy.
5334The key is that it must be sold.  A number of companies overcome this
5335obstacle by leasing or renting their product.  These companies argue that
5336if the material is rented or leased and not sold, they control the uses
5337of a work.  The fourth right, and one very important in a digital world,
5338is a right of public performance, which means the right to show the work
5339sequentially.  For example, copyright owners control the showing of a
5340CD-ROM product in a public place such as a public library.  The reverse
5341side of public performance is something called the right of public
5342display.  Moral rights also exist, which at the federal level apply only
5343to very limited visual works of art, but in theory may apply under
5344contract and other principles.  Moral rights may include the right of an
5345author to have his or her name on a work, the right of attribution, and
5346the right to object to distortion or mutilation--the right of integrity.
5347
5348The way copyright law is worded gives much latitude to activities such as
5349preservation; to use of material for scholarly and research purposes when
5350the user does not make multiple copies; and to the generation of
5351facsimile copies of unpublished works by libraries for themselves and
5352other libraries.  But the law does not allow anyone to become the
5353distributor of the product for the entire world.  In today's electronic
5354environment, publishers are extremely concerned that the entire world is
5355networked and can obtain the information desired from a single copy in a
5356single library.  Hence, if there is to be only one sale, which publishers
5357may choose to live with, they will obtain their money in other ways, for
5358example, from access and use.  Hence, the development of site licenses
5359and other kinds of agreements to cover what publishers believe they
5360should be compensated for.  Any solution that the United States takes
5361today has to consider the international arena.
5362
5363Noting that the United States is a member of the Berne Convention and
5364subscribes to its provisions, PETERS described the permissions process.
5365She also defined compulsory licenses.  A compulsory license, of which the
5366United States has had a few, builds into the law the right to use a work
5367subject to certain terms and conditions.  In the international arena,
5368however, the ability to use compulsory licenses is extremely limited.
5369Thus, clearinghouses and other collectives comprise one option that has
5370succeeded in providing for use of a work.  Often overlooked when one
5371begins to use copyrighted material and put products together is how
5372expensive the permissions process and managing it is.  According to
5373PETERS, the price of copyright in a digital medium, whatever solution is
5374worked out, will include managing and assembling the database.  She
5375strongly recommended that publishers and librarians or people with
5376various backgrounds cooperate to work out administratively feasible
5377systems, in order to produce better results.
5378
5379In the lengthy question-and-answer period that followed PETERS's
5380presentation, the following points emerged:
5381
5382     * The Copyright Office maintains that anything mechanical and
5383     totally exhaustive probably is not protected.  In the event that
5384     what an individual did in developing potentially copyrightable
5385     material is not understood, the Copyright Office will ask about the
5386     creative choices the applicant chose to make or not to make.  As a
5387     practical matter, if one believes she or he has made enough of those
5388     choices, that person has a right to assert a copyright and someone
5389     else must assert that the work is not copyrightable.  The more
5390     mechanical, the more automatic, a thing is, the less likely it is to
5391     be copyrightable.
5392
5393     * Nearly all photographs are deemed to be copyrightable, but no one
5394     worries about them much, because everyone is free to take the same
5395     image.  Thus, a photographic copyright represents what is called a
5396     "thin" copyright.  The photograph itself must be duplicated, in
5397     order for copyright to be violated.
5398
5399     * The Copyright Office takes the position that X-rays are not
5400     copyrightable because they are mechanical.  It  can be argued
5401     whether or not image enhancement in scanning can be protected.  One
5402     must exercise care with material created with public funds and
5403     generally in the public domain.  An article written by a federal
5404     employee, if written as part of official duties, is not
5405     copyrightable.  However, control over a scientific article written
5406     by a National Institutes of Health grantee (i.e., someone who
5407     receives money from the U.S. government), depends on NIH policy.  If
5408     the government agency has no policy (and that policy can be
5409     contained in its regulations, the contract, or the grant), the
5410     author retains copyright.  If a provision of the contract, grant, or
5411     regulation states that there will be no copyright, then it does not
5412     exist.  When a work is created, copyright automatically comes into
5413     existence unless something exists that says it does not.
5414
5415     * An enhanced electronic copy of a print copy of an older reference
5416     work in the public domain that does not contain copyrightable new
5417     material is a purely mechanical rendition of the original work, and
5418     is not copyrightable.
5419
5420     * Usually, when a work enters the public domain, nothing can remove
5421     it.  For example, Congress recently passed into law the concept of
5422     automatic renewal, which means that copyright on any work published
5423     between l964 and l978 does not have to be renewed in order to
5424     receive a seventy-five-year term.  But any work not renewed before
5425     1964 is in the public domain.
5426
5427     * Concerning whether or not the United States keeps track of when
5428     authors die, nothing was ever done, nor is anything being done at
5429     the moment by the Copyright Office.
5430
5431     * Software that drives a mechanical process is itself copyrightable.
5432     If one changes platforms, the software itself has a copyright.  The
5433     World Intellectual Property Organization will hold a symposium 28
5434     March through 2 April l993, at Harvard University, on digital
5435     technology, and will study this entire issue.  If one purchases a
5436     computer software package, such as MacPaint, and creates something
5437     new, one receives protection only for that which has been added.
5438
5439PETERS added that often in copyright matters, rough justice is the
5440outcome, for example, in collective licensing, ASCAP (i.e., American
5441Society of Composers, Authors, and Publishers), and BMI (i.e., Broadcast
5442Music, Inc.), where it may seem that the big guys receive more than their
5443due.  Of course, people ought not to copy a creative product without
5444paying for it; there should be some compensation.  But the truth of the
5445world, and it is not a great truth, is that the big guy gets played on
5446the radio more frequently than the little guy, who has to do much more
5447until he becomes a big guy.  That is true of every author, every
5448composer, everyone, and, unfortunately, is part of life.
5449
5450Copyright always originates with the author, except in cases of works
5451made for hire.  (Most software falls into this category.)  When an author
5452sends his article to a journal, he has not relinquished copyright, though
5453he retains the right to relinquish it.  The author receives absolutely
5454everything.  The less prominent the author, the more leverage the
5455publisher will have in contract negotiations.  In order to transfer the
5456rights, the author must sign an agreement giving them away.
5457
5458In an electronic society, it is important to be able to license a writer
5459and work out deals.  With regard to use of a work, it usually is much
5460easier when a publisher holds the rights.  In an electronic era, a real
5461problem arises when one is digitizing and making information available.
5462PETERS referred again to electronic licensing clearinghouses.  Copyright
5463ought to remain with the author, but as one moves forward globally in the
5464electronic arena, a middleman who can handle the various rights becomes
5465increasingly necessary.
5466
5467The notion of copyright law is that it resides with the individual, but
5468in an on-line environment, where a work can be adapted and tinkered with
5469by many individuals, there is concern.  If changes are authorized and
5470there is no agreement to the contrary, the person who changes a work owns
5471the changes.  To put it another way, the person who acquires permission
5472to change a work technically will become the author and the owner, unless
5473some agreement to the contrary has been made.  It is typical for the
5474original publisher to try to control all of the versions and all of the
5475uses.  Copyright law always only sets up the boundaries.  Anything can be
5476changed by contract.
5477
5478                                 ******
5479
5480SESSION VII.  CONCLUSION
5481
5482+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
5483GENERAL DISCUSSION * Two questions for discussion * Different emphases in
5484the Workshop * Bringing the text and image partisans together *
5485Desiderata in planning the long-term development of something * Questions
5486surrounding the issue of electronic deposit * Discussion of electronic
5487deposit as an allusion to the issue of standards * Need for a directory
5488of preservation projects in digital form and for access to their
5489digitized files * CETH's catalogue of machine-readable texts in the
5490humanities * What constitutes a publication in the electronic world? *
5491Need for LC to deal with the concept of on-line publishing * LC's Network
5492Development Office  exploring the limits of MARC as a standard in terms
5493of handling electronic information * Magnitude of the problem and the
5494need for distributed responsibility in order to maintain and store
5495electronic information * Workshop participants to be viewed as a starting
5496point * Development of a network version of AM urged * A step toward AM's
5497construction of some sort of apparatus for network access * A delicate
5498and agonizing policy question for LC * Re the issue of electronic
5499deposit, LC urged to initiate a catalytic process in terms of distributed
5500responsibility * Suggestions for cooperative ventures * Commercial
5501publishers' fears * Strategic questions for getting the image and text
5502people to think through long-term cooperation * Clarification of the
5503driving force behind both the Perseus and the Cornell Xerox projects *
5504+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
5505
5506In his role as moderator of the concluding session, GIFFORD raised two
5507questions he believed would benefit from discussion:  1) Are there enough
5508commonalities among those of us that have been here for two days so that
5509we can see courses of action that should be taken in the future?  And, if
5510so, what are they and who might take them?  2) Partly derivative from
5511that, but obviously very dangerous to LC as host, do you see a role for
5512the Library of Congress in all this?  Of course, the Library of Congress
5513holds a rather special status in a number of these matters, because it is
5514not perceived as a player with an economic stake in them, but are there
5515roles that LC can play that can help advance us toward where we are heading?
5516
5517Describing himself as an uninformed observer of the technicalities of the
5518last two days, GIFFORD detected three different emphases in the Workshop:
55191) people who are very deeply committed to text; 2) people who are almost
5520passionate about images; and 3) a few people who are very committed to
5521what happens to the networks.  In other words, the new networking
5522dimension, the accessibility of the processability, the portability of
5523all this across the networks.  How do we pull those three together?
5524
5525Adding a question that reflected HOCKEY's comment that this was the
5526fourth workshop she had attended in the previous thirty days, FLEISCHHAUER
5527wondered to what extent this meeting had reinvented the wheel, or if it
5528had contributed anything in the way of bringing together a different group
5529of people from those who normally appear on the workshop circuit.
5530
5531HOCKEY confessed to being struck at this meeting and the one the
5532Electronic Pierce Consortium organized the previous week that this was a
5533coming together of people working on texts and not images.  Attempting to
5534bring the two together is something we ought to be thinking about for the
5535future:  How one can think about working with image material to begin
5536with, but structuring it and digitizing it in such a way that at a later
5537stage it can be interpreted into text, and find a common way of building
5538text and images together so that they can be used jointly in the future,
5539with the network support to begin there because that is how people will
5540want to access it.
5541
5542In planning the long-term development of something, which is what is
5543being done in electronic text, HOCKEY stressed the importance not only
5544of discussing the technical aspects of how one does it but particularly
5545of thinking about what the people who use the stuff will want to do.
5546But conversely, there are numerous things that people start to do with
5547electronic text or material that nobody ever thought of in the beginning.
5548
5549LESK, in response to the question concerning the role of the Library of
5550Congress, remarked the often suggested desideratum of having electronic
5551deposit:  Since everything is now computer-typeset, an entire decade of
5552material that was machine-readable exists, but the publishers frequently
5553did not save it; has LC taken any action to have its copyright deposit
5554operation start collecting these machine-readable versions?  In the
5555absence of PETERS, GIFFORD replied that the question was being
5556actively considered but that that was only one dimension of the problem.
5557Another dimension is the whole question of the integrity of the original
5558electronic document.  It becomes highly important in science to prove
5559authorship.  How will that be done?
5560
5561ERWAY explained that, under the old policy, to make a claim for a
5562copyright for works that were published in electronic form, including
5563software, one had to submit a paper copy of the first and last twenty
5564pages of code--something that represented the work but did not include
5565the entire work itself and had little value to anyone.  As a temporary
5566measure, LC has claimed the right to demand electronic versions of
5567electronic publications.  This measure entails a proactive role for the
5568Library to say that it wants a particular electronic version.  Publishers
5569then have perhaps a year to submit it.  But the real problem for LC is
5570what to do with all this material in all these different formats.  Will
5571the Library mount it?  How will it give people access to it?  How does LC
5572keep track of the appropriate computers, software, and media?  The situation
5573is so hard to control, ERWAY said, that it makes sense for each publishing
5574house to maintain its own archive.  But LC cannot enforce that either.
5575
5576GIFFORD acknowledged LESK's suggestion that establishing a priority
5577offered the solution, albeit a fairly complicated one.  But who maintains
5578that register?, he asked.  GRABER noted that LC does attempt to collect a
5579Macintosh version and the IBM-compatible version of software.  It does
5580not collect other versions.  But while true for software, BYRUM observed,
5581this reply does not speak to materials, that is, all the materials that
5582were published that were on somebody's microcomputer or driver tapes
5583at a publishing office across the country.  LC does well to acquire
5584specific machine-readable products selectively that were intended to be
5585machine-readable.  Materials that were in machine-readable form at one time,
5586BYRUM said, would be beyond LC's capability at the moment, insofar as
5587attempting to acquire, organize, and preserve them are concerned--and
5588preservation would be the most important consideration.  In this
5589connection, GIFFORD reiterated the need to work out some sense of
5590distributive responsibility for a number of these issues, which
5591inevitably will require significant cooperation and discussion.
5592Nobody can do it all.
5593
5594LESK suggested that some publishers may look with favor on LC beginning
5595to serve as a depository of tapes in an electronic manuscript standard.
5596Publishers may view this as a service that they did not have to perform
5597and they might send in tapes.  However, SPERBERG-McQUEEN countered,
5598although publishers have had equivalent services available to them for a
5599long time, the electronic text archive has never turned away or been
5600flooded with tapes and is forever sending feedback to the depositor.
5601Some publishers do send in tapes.
5602
5603ANDRE viewed this discussion as an allusion to the issue of standards.
5604She recommended that the AAP standard and the TEI, which has already been
5605somewhat harmonized internationally and which also shares several
5606compatibilities with the AAP, be harmonized to ensure sufficient
5607compatibility in the software.  She drew the line at saying LC ought to
5608be the locus or forum for such harmonization.
5609
5610Taking the group in a slightly different direction, but one where at
5611least in the near term LC might play a helpful role, LYNCH remarked the
5612plans of a number of projects to carry out preservation by creating
5613digital images that will end up in on-line or near-line storage at some
5614institution.   Presumably, LC will link this material somehow to its
5615on-line catalog in most cases.  Thus, it is in a digital form.  LYNCH had
5616the impression that many of these institutions would be willing to make
5617those files accessible to other people outside the institution, provided
5618that there is no copyright problem.  This desideratum will require
5619propagating the knowledge that those digitized files exist, so that they
5620can end up in other on-line catalogs.  Although uncertain about the
5621mechanism for achieving this result, LYNCH said that it warranted
5622scrutiny because it seemed to be connected to some of the basic issues of
5623cataloging and distribution of records.  It would be  foolish, given the
5624amount of work that all of us have to do and our meager resources, to
5625discover multiple institutions digitizing the same work.  Re microforms,
5626LYNCH said, we are in pretty good shape.
5627
5628BATTIN called this a big problem and noted that the Cornell people (who
5629had already departed) were working on it.  At issue from the beginning
5630was to learn how to catalog that information into RLIN and then into
5631OCLC, so that it would be accessible.  That issue remains to be resolved.
5632LYNCH rejoined that putting it into OCLC or RLIN was helpful insofar as
5633somebody who is thinking of performing preservation activity on that work
5634could learn about it.  It is not necessarily helpful for institutions to
5635make that available.  BATTIN opined that the idea was that it not only be
5636for preservation purposes but for the convenience of people looking for
5637this material.  She endorsed LYNCH's dictum that duplication of this
5638effort was to be avoided by every means.
5639
5640HOCKEY informed the Workshop about one major current activity of CETH,
5641namely a catalogue of machine-readable texts in the humanities.  Held on
5642RLIN at present, the catalogue has been concentrated on ASCII as opposed
5643to digitized images of text.  She is exploring ways to improve the
5644catalogue and make it more widely available, and welcomed suggestions
5645about these concerns.  CETH owns the records, which are not just
5646restricted to RLIN, and can distribute them however it wishes.
5647
5648Taking up LESK's earlier question, BATTIN inquired whether LC, since it
5649is accepting electronic files and designing a mechanism for dealing with
5650that rather than putting books on shelves, would become responsible for
5651the National Copyright Depository of Electronic Materials.  Of course
5652that could not be accomplished overnight, but it would be something LC
5653could plan for.  GIFFORD acknowledged that much thought was being devoted
5654to that set of problems and returned the discussion to the issue raised
5655by LYNCH--whether or not putting the kind of records that both BATTIN and
5656HOCKEY have been talking about in RLIN is not a satisfactory solution.
5657It seemed to him that RLIN answered LYNCH's original point concerning
5658some kind of directory for these kinds of materials.  In a situation
5659where somebody is attempting to decide whether or not to scan this or
5660film that or to learn whether or not someone has already done so, LYNCH
5661suggested, RLIN is helpful, but it is not helpful in the case of a local,
5662on-line catalogue.  Further, one would like to have her or his system be
5663aware that that exists in digital form, so that one can present it to a
5664patron, even though one did not digitize it, if it is out of copyright.
5665The only way to make those linkages would be to perform a tremendous
5666amount of real-time look-up, which would be awkward at best, or
5667periodically to yank the whole file from RLIN and match it against one's
5668own stuff, which is a nuisance.
5669
5670But where, ERWAY inquired, does one stop including things that are
5671available with Internet, for instance, in one's local catalogue?
5672It almost seems that that is LC's means to acquire access to them.
5673That represents LC's new form of library loan.  Perhaps LC's new on-line
5674catalogue is an amalgamation of all these catalogues on line.  LYNCH
5675conceded that perhaps that was true in the very long term, but was not
5676applicable to scanning in the short term.  In his view, the totals cited
5677by Yale, 10,000 books over perhaps a four-year period, and 1,000-1,500
5678books from Cornell, were not big numbers, while searching all over
5679creation for relatively rare occurrences will prove to be less efficient.
5680As GIFFORD wondered if this would not be a separable file on RLIN and
5681could be requested from them, BATTIN interjected that it was easily
5682accessible to an institution.  SEVERTSON pointed out that that file, cum
5683enhancements, was available with reference information on CD-ROM, which
5684makes it a little more available.
5685
5686In HOCKEY's view, the real question facing the Workshop is what to put in
5687this catalogue, because that raises the question of what constitutes a
5688publication in the electronic world.  (WEIBEL interjected that Eric Joule
5689in OCLC's Office of Research is also wrestling with this particular
5690problem, while GIFFORD thought it sounded fairly generic.)  HOCKEY
5691contended that a majority of texts in the humanities are in the hands
5692of either a small number of large research institutions or individuals
5693and are not generally available for anyone else to access at all.
5694She wondered if these texts ought to be catalogued.
5695
5696After argument proceeded back and forth for several minutes over why
5697cataloguing might be a necessary service, LEBRON suggested that this
5698issue involved the responsibility of a publisher.  The fact that someone
5699has created something electronically and keeps it under his or her
5700control does not constitute publication.  Publication implies
5701dissemination.  While it would be important for a scholar to let other
5702people know that this creation exists, in many respects this is no
5703different from an unpublished manuscript.  That is what is being accessed
5704in there, except that now one is not looking at it in the hard-copy but
5705in the electronic environment.
5706
5707LEBRON expressed puzzlement at the variety of ways electronic publishing
5708has been viewed.  Much of what has been discussed throughout these two
5709days has concerned CD-ROM publishing, whereas in the on-line environment
5710that she confronts, the constraints and challenges are very different.
5711Sooner or later LC will have to deal with the concept of on-line
5712publishing.  Taking up the comment ERWAY made earlier about storing
5713copies, LEBRON gave her own journal as an example.  How would she deposit
5714OJCCT for copyright?, she asked, because the journal will exist in the
5715mainframe at OCLC and people will be able to access it.  Here the
5716situation is different, ownership versus access, and is something that
5717arises with publication in the on-line environment, faster than is
5718sometimes realized.  Lacking clear answers to all of these questions
5719herself, LEBRON did not anticipate that LC would be able to take a role
5720in helping to define some of them for quite a while.
5721
5722GREENFIELD observed that LC's Network Development Office is attempting,
5723among other things, to explore the limits of MARC as a standard in terms
5724of handling electronic information.  GREENFIELD also noted that Rebecca
5725GUENTHER from that office gave a paper to the American Society for
5726Information Science (ASIS) summarizing several of the discussion papers
5727that were coming out of the Network Development Office.  GREENFIELD said
5728he understood that that office had a list-server soliciting just the kind
5729of feedback received today concerning the difficulties of identifying and
5730cataloguing electronic information.  GREENFIELD hoped that everybody
5731would be aware of that and somehow contribute to that conversation.
5732
5733Noting two of LC's roles, first, to act as a repository of record for
5734material that is copyrighted in this country, and second, to make
5735materials it holds available in some limited form to a clientele that
5736goes beyond Congress, BESSER suggested that it was incumbent on LC to
5737extend those responsibilities to all the things being published in
5738electronic form.  This would mean eventually accepting electronic
5739formats.  LC could require that at some point they be in a certain
5740limited set of formats, and then develop mechanisms for allowing people
5741to access those in the same way that other things are accessed.  This
5742does not imply that they are on the network and available to everyone.
5743LC does that with most of its bibliographic records, BESSER said, which
5744end up migrating to the utility (e.g., OCLC) or somewhere else.  But just
5745as most of LC's books are available in some form through interlibrary
5746loan or some other mechanism, so in the same way electronic formats ought
5747to be available to others in some format, though with some copyright
5748considerations.  BESSER was not suggesting that these mechanisms be
5749established tomorrow, only that they seemed to fall within LC's purview,
5750and that there should be long-range plans to establish them.
5751
5752Acknowledging that those from LC in the room agreed with BESSER
5753concerning the need to confront difficult questions, GIFFORD underscored
5754the magnitude of the problem of what to keep and what to select.  GIFFORD
5755noted that LC currently receives some 31,000 items per day, not counting
5756electronic materials, and argued for much more distributed responsibility
5757in order to maintain and store electronic information.
5758
5759BESSER responded that the assembled group could be viewed as a starting
5760point, whose initial operating premise could be helping to move in this
5761direction and defining how LC could do so, for example, in areas of
5762standardization or distribution of responsibility.
5763
5764FLEISCHHAUER added that AM was fully engaged, wrestling with some of the
5765questions that pertain to the conversion of older historical materials,
5766which would be one thing that the Library of Congress might do.  Several
5767points mentioned by BESSER and several others on this question have a
5768much greater impact on those who are concerned with cataloguing and the
5769networking of bibliographic information, as well as preservation itself.
5770
5771Speaking directly to AM, which he considered was a largely uncopyrighted
5772database, LYNCH urged development of a network version of AM, or
5773consideration of making the data in it available to people interested in
5774doing network multimedia.  On account of the current great shortage of
5775digital data that is both appealing and unencumbered by complex rights
5776problems, this course of action could have a significant effect on making
5777network multimedia a reality.
5778
5779In this connection, FLEISCHHAUER reported on a fragmentary prototype in
5780LC's Office of Information Technology Services that attempts to associate
5781digital images of photographs with cataloguing information in ways that
5782work within a local area network--a step, so to say, toward AM's
5783construction of some sort of apparatus for access.  Further, AM has
5784attempted to use standard data forms in order to help make that
5785distinction between the access tools and the underlying data, and thus
5786believes that the database is networkable.
5787
5788A delicate and agonizing policy question for LC, however, which comes
5789back to resources and unfortunately has an impact on this, is to find
5790some appropriate, honorable, and legal cost-recovery possibilities.  A
5791certain skittishness concerning cost-recovery has made people unsure
5792exactly what to do.  AM would be highly receptive to discussing further
5793LYNCH's offer to test or demonstrate its database in a network
5794environment, FLEISCHHAUER said.
5795
5796Returning the discussion to what she viewed as the vital issue of
5797electronic deposit, BATTIN recommended that LC initiate a catalytic
5798process in terms of distributed responsibility, that is, bring together
5799the distributed organizations and set up a study group to look at all
5800these issues and see where we as a nation should move.  The broader
5801issues of how we deal with the management of electronic information will
5802not disappear, but only grow worse.
5803
5804LESK took up this theme and suggested that LC attempt to persuade one
5805major library in each state to deal with its state equivalent publisher,
5806which might produce a cooperative project that would be equitably
5807distributed around the country, and one in which LC would be dealing with
5808a minimal number of publishers and minimal copyright problems.
5809
5810GRABER remarked the recent development in the scientific community of a
5811willingness to use SGML and either deposit or interchange on a fairly
5812standardized format.  He wondered if a similar movement was taking place
5813in the humanities.  Although the National Library of Medicine found only
5814a few publishers to cooperate in a like venture two or three years ago, a
5815new effort might generate a much larger number willing to cooperate.
5816
5817KIMBALL recounted his unit's (Machine-Readable Collections Reading Room)
5818troubles with the commercial publishers of electronic media in acquiring
5819materials for LC's collections, in particular the publishers' fear that
5820they would not be able to cover their costs and would lose control of
5821their products, that LC would give them away or sell them and make
5822profits from them.  He doubted that the publishing industry was prepared
5823to move into this area at the moment, given its resistance to allowing LC
5824to use its machine-readable materials as the Library would like.
5825
5826The copyright law now addresses compact disk as a medium, and LC can
5827request one copy of that, or two copies if it is the only version, and
5828can request copies of software, but that fails to address magazines or
5829books or anything like that which is in machine-readable form.
5830
5831GIFFORD acknowledged the thorny nature of this issue, which he illustrated
5832with the example of the cumbersome process involved in putting a copy of a
5833scientific database on a LAN in LC's science reading room.  He also
5834acknowledged that LC needs help and could enlist the energies and talents
5835of Workshop participants in thinking through a number of these problems.
5836
5837GIFFORD returned the discussion to getting the image and text people to
5838think through together where they want to go in the long term.  MYLONAS
5839conceded that her experience at the Pierce Symposium the previous week at
5840Georgetown University and this week at LC had forced her to reevaluate
5841her perspective on the usefulness of text as images.  MYLONAS framed the
5842issues in a series of questions:  How do we acquire machine-readable
5843text?  Do we take pictures of it and perform OCR on it later?  Is it
5844important to obtain very high-quality images and text, etc.?
5845FLEISCHHAUER agreed with MYLONAS's framing of strategic questions, adding
5846that a large institution such as LC probably has to do all of those
5847things at different times.  Thus, the trick is to exercise judgment.  The
5848Workshop had added to his and AM's considerations in making those
5849judgments.  Concerning future meetings or discussions, MYLONAS suggested
5850that screening priorities would be helpful.
5851
5852WEIBEL opined that the diversity reflected in this group was a sign both
5853of the health and of the immaturity of the field, and more time would
5854have to pass before we convince one another concerning standards.
5855
5856An exchange between MYLONAS and BATTIN clarified the point that the
5857driving force behind both the Perseus and the Cornell Xerox projects was
5858the preservation of knowledge for the future, not simply for particular
5859research use.  In the case of Perseus, MYLONAS said, the assumption was
5860that the texts would not be entered again into electronically readable
5861form.  SPERBERG-McQUEEN added that a scanned image would not serve as an
5862archival copy for purposes of preservation in the case of, say, the Bill
5863of Rights, in the sense that the scanned images are effectively the
5864archival copies for the Cornell mathematics books.
5865
5866
5867               ***   ***   ***   ******   ***   ***   ***
5868
5869
5870                          Appendix I:  PROGRAM
5871
5872
5873
5874                                WORKSHOP
5875                                   ON
5876                               ELECTRONIC
5877                                  TEXTS
5878
5879
5880
5881                             9-10 June 1992
5882
5883                           Library of Congress
5884                            Washington, D.C.
5885
5886
5887
5888    Supported by a Grant from the David and Lucile Packard Foundation
5889
5890
5891Tuesday, 9 June 1992
5892
5893NATIONAL DEMONSTRATION LAB, ATRIUM, LIBRARY MADISON
5894
58958:30 AM   Coffee and Danish, registration
5896
58979:00 AM   Welcome
5898
5899          Prosser Gifford, Director for Scholarly Programs, and Carl
5900             Fleischhauer, Coordinator, American Memory, Library of
5901             Congress
5902
59039:l5 AM   Session I.  Content in a New Form:  Who Will Use It and What
5904          Will They Do?
5905
5906          Broad description of the range of electronic information.
5907          Characterization of who uses it and how it is or may be used.
5908          In addition to a look at scholarly uses, this session will
5909          include a presentation on use by students (K-12 and college)
5910          and the general public.
5911
5912          Moderator:  James Daly
5913          Avra Michelson, Archival Research and Evaluation Staff,
5914             National Archives and Records Administration (Overview)
5915          Susan H. Veccia, Team Leader, American Memory, User Evaluation,
5916             and
5917          Joanne Freeman, Associate Coordinator, American Memory, Library
5918             of Congress (Beyond the scholar)
5919
592010:30-
592111:00 AM  Break
5922
592311:00 AM  Session II.  Show and Tell.
5924
5925          Each presentation to consist of a fifteen-minute
5926          statement/show; group discussion will follow lunch.
5927
5928          Moderator:  Jacqueline Hess, Director, National Demonstration
5929             Lab
5930
5931            1.  A classics project, stressing texts and text retrieval
5932                more than multimedia:  Perseus Project, Harvard
5933                University
5934                Elli Mylonas, Managing Editor
5935
5936            2.  Other humanities projects employing the emerging norms of
5937                the Text Encoding Initiative (TEI):  Chadwyck-Healey's
5938                The English Poetry Full Text Database and/or Patrologia
5939                Latina Database
5940                Eric M. Calaluca, Vice President, Chadwyck-Healey, Inc.
5941
5942            3.  American Memory
5943                Carl Fleischhauer, Coordinator, and
5944                Ricky Erway, Associate Coordinator, Library of Congress
5945
5946            4.  Founding Fathers example from Packard Humanities
5947                Institute:  The Papers of George Washington, University
5948                of Virginia
5949                Dorothy Twohig, Managing Editor, and/or
5950                David Woodley Packard
5951
5952            5.  An electronic medical journal offering graphics and
5953                full-text searchability:  The Online Journal of Current
5954                Clinical Trials, American Association for the Advancement
5955                of Science
5956                Maria L. Lebron, Managing Editor
5957
5958            6.  A project that offers facsimile images of pages but omits
5959                searchable text:  Cornell math books
5960                Lynne K. Personius, Assistant Director, Cornell
5961                   Information Technologies for Scholarly Information
5962                   Sources, Cornell University
5963
596412:30 PM  Lunch  (Dining Room A, Library Madison 620.  Exhibits
5965          available.)
5966
59671:30 PM   Session II.  Show and Tell (Cont'd.).
5968
59693:00-
59703:30 PM   Break
5971
59723:30-
59735:30 PM   Session III.  Distribution, Networks, and Networking:  Options
5974          for Dissemination.
5975
5976          Published disks:  University presses and public-sector
5977             publishers, private-sector publishers
5978          Computer networks
5979
5980          Moderator:  Robert G. Zich, Special Assistant to the Associate
5981             Librarian for Special Projects, Library of Congress
5982          Clifford A. Lynch, Director, Library Automation, University of
5983             California
5984          Howard Besser, School of Library and Information Science,
5985             University of Pittsburgh
5986          Ronald L. Larsen, Associate Director of Libraries for
5987             Information Technology, University of Maryland at College
5988             Park
5989          Edwin B. Brownrigg, Executive Director, Memex Research
5990             Institute
5991
59926:30 PM   Reception  (Montpelier Room, Library Madison 619.)
5993
5994                                 ******
5995
5996Wednesday, 10 June 1992
5997
5998DINING ROOM A, LIBRARY MADISON 620
5999
60008:30 AM   Coffee and Danish
6001
60029:00 AM   Session IV.  Image Capture, Text Capture, Overview of Text and
6003          Image Storage Formats.
6004
6005          Moderator:  William L. Hooton, Vice President of Operations,
6006             I-NET
6007
6008          A) Principal Methods for Image Capture of Text:
6009             Direct scanning
6010             Use of microform
6011
6012          Anne R. Kenney, Assistant Director, Department of Preservation
6013             and Conservation, Cornell University
6014          Pamela Q.J. Andre, Associate Director, Automation, and
6015          Judith A. Zidar, Coordinator, National Agricultural Text
6016             Digitizing Program (NATDP), National Agricultural Library
6017             (NAL)
6018          Donald J. Waters, Head, Systems Office, Yale University Library
6019
6020          B) Special Problems:
6021             Bound volumes
6022             Conservation
6023             Reproducing printed halftones
6024
6025          Carl Fleischhauer, Coordinator, American Memory, Library of
6026             Congress
6027          George Thoma, Chief, Communications Engineering Branch,
6028             National Library of Medicine (NLM)
6029
603010:30-
603111:00 AM  Break
6032
603311:00 AM  Session IV.  Image Capture, Text Capture, Overview of Text and
6034          Image Storage Formats (Cont'd.).
6035
6036          C) Image Standards and Implications for Preservation
6037
6038          Jean Baronas, Senior Manager, Department of Standards and
6039             Technology, Association for Information and Image Management
6040             (AIIM)
6041          Patricia Battin, President, The Commission on Preservation and
6042             Access (CPA)
6043
6044          D) Text Conversion:
6045             OCR vs. rekeying
6046             Standards of accuracy and use of imperfect texts
6047             Service bureaus
6048
6049          Stuart Weibel, Senior Research Specialist, Online Computer
6050             Library Center, Inc. (OCLC)
6051          Michael Lesk, Executive Director, Computer Science Research,
6052             Bellcore
6053          Ricky Erway, Associate Coordinator, American Memory, Library of
6054             Congress
6055          Pamela Q.J. Andre, Associate Director, Automation, and
6056          Judith A. Zidar, Coordinator, National Agricultural Text
6057             Digitizing Program (NATDP), National Agricultural Library
6058             (NAL)
6059
606012:30-
60611:30 PM   Lunch
6062
60631:30 PM   Session V.  Approaches to Preparing Electronic Texts.
6064
6065          Discussion of approaches to structuring text for the computer;
6066          pros and cons of text coding, description of methods in
6067          practice, and comparison of text-coding methods.
6068
6069          Moderator:  Susan Hockey, Director, Center for Electronic Texts
6070             in the Humanities (CETH), Rutgers and Princeton Universities
6071          David Woodley Packard
6072          C.M. Sperberg-McQueen, Editor, Text Encoding Initiative (TEI),
6073             University of Illinois-Chicago
6074          Eric M. Calaluca, Vice President, Chadwyck-Healey, Inc.
6075
60763:30-
60774:00 PM   Break
6078
60794:00 PM   Session VI.  Copyright Issues.
6080
6081          Marybeth Peters, Policy Planning Adviser to the Register of
6082             Copyrights, Library of Congress
6083
60845:00 PM   Session VII. Conclusion.
6085
6086          General discussion.
6087          What topics were omitted or given short shrift that anyone
6088             would like to talk about now?
6089          Is there a "group" here?  What should the group do next, if
6090             anything?  What should the Library of Congress do next, if
6091             anything?
6092          Moderator:  Prosser Gifford, Director for Scholarly Programs,
6093             Library of Congress
6094
60956:00 PM   Adjourn
6096
6097
6098               ***   ***   ***   ******   ***   ***   ***
6099
6100
6101                         Appendix II:  ABSTRACTS
6102
6103
6104SESSION I
6105
6106Avra MICHELSON           Forecasting the Use of Electronic Texts by
6107                         Social Sciences and Humanities Scholars
6108
6109This presentation explores the ways in which electronic texts are likely
6110to be used by the non-scientific scholarly community.  Many of the
6111remarks are drawn from a report the speaker coauthored with Jeff
6112Rothenberg, a computer scientist at The RAND Corporation.
6113
6114The speaker assesses 1) current scholarly use of information technology
6115and 2) the key trends in information technology most relevant to the
6116research process, in order to predict how social sciences and humanities
6117scholars are apt to use electronic texts.  In introducing the topic,
6118current use of electronic texts is explored broadly within the context of
6119scholarly communication.  From the perspective of scholarly
6120communication, the work of humanities and social sciences scholars
6121involves five processes:  1) identification of sources, 2) communication
6122with colleagues, 3) interpretation and analysis of data, 4) dissemination
6123of research findings, and 5) curriculum development and instruction.  The
6124extent to which computation currently permeates aspects of scholarly
6125communication represents a viable indicator of the prospects for
6126electronic texts.
6127
6128The discussion of current practice is balanced by an analysis of key
6129trends in the scholarly use of information technology.  These include the
6130trends toward end-user computing and connectivity, which provide a
6131framework for forecasting the use of electronic texts through this
6132millennium.  The presentation concludes with a summary of the ways in
6133which the nonscientific scholarly community can be expected to use
6134electronic texts, and the implications of that use for information
6135providers.
6136
6137Susan VECCIA and Joanne FREEMAN    Electronic Archives for the Public:
6138                                   Use of American Memory in Public and
6139                                   School Libraries
6140
6141This joint discussion focuses on nonscholarly applications of electronic
6142library materials, specifically addressing use of the Library of Congress
6143American Memory (AM) program in a small number of public and school
6144libraries throughout the United States.  AM consists of selected Library
6145of Congress primary archival materials, stored on optical media
6146(CD-ROM/videodisc), and presented with little or no editing.  Many
6147collections are accompanied by electronic introductions and user's guides
6148offering background information and historical context.  Collections
6149represent a variety of formats including photographs, graphic arts,
6150motion pictures, recorded sound, music, broadsides and manuscripts,
6151books, and pamphlets.
6152
6153In 1991, the Library of Congress began a nationwide evaluation of AM in
6154different types of institutions.  Test sites include public libraries,
6155elementary and secondary school libraries, college and university
6156libraries, state libraries, and special libraries.  Susan VECCIA and
6157Joanne FREEMAN will discuss their observations on the use of AM by the
6158nonscholarly community, using evidence gleaned from this ongoing
6159evaluation effort.
6160
6161VECCIA will comment on the overall goals of the evaluation project, and
6162the types of public and school libraries included in this study.  Her
6163comments on nonscholarly use of AM will focus on the public library as a
6164cultural and community institution, often bridging the gap between formal
6165and informal education.  FREEMAN will discuss the use of AM in school
6166libraries.  Use by students and teachers has revealed some broad
6167questions about the use of electronic resources, as well as definite
6168benefits gained by the "nonscholar."  Topics will include the problem of
6169grasping content and context in an electronic environment, the stumbling
6170blocks created by "new" technologies, and the unique skills and interests
6171awakened through use of electronic resources.
6172
6173SESSION II
6174
6175Elli MYLONAS             The Perseus Project:  Interactive Sources and
6176                         Studies in Classical Greece
6177
6178The Perseus Project (5) has just released Perseus 1.0, the first publicly
6179available version of its hypertextual database of multimedia materials on
6180classical Greece.  Perseus is designed to be used by a wide audience,
6181comprised of readers at the student and scholar levels.  As such, it must
6182be able to locate information using different strategies, and it must
6183contain enough detail to serve the different needs of its users.  In
6184addition, it must be delivered so that it is affordable to its target
6185audience.  [These problems and the solutions we chose are described in
6186Mylonas, "An Interface to Classical Greek Civilization," JASIS 43:2,
6187March 1992.]
6188
6189In order to achieve its objective, the project staff decided to make a
6190conscious separation between selecting and converting textual, database,
6191and image data on the one hand, and putting it into a delivery system on
6192the other.  That way, it is possible to create the electronic data
6193without thinking about the restrictions of the delivery system.  We have
6194made a great effort to choose system-independent formats for our data,
6195and to put as much thought and work as possible into structuring it so
6196that the translation from paper to electronic form will enhance the value
6197of the data. [A discussion of these solutions as of two years ago is in
6198Elli Mylonas, Gregory Crane, Kenneth Morrell, and D. Neel Smith, "The
6199Perseus Project:  Data in the Electronic Age," in Accessing Antiquity:
6200The Computerization of Classical Databases, J. Solomon and T. Worthen
6201(eds.),  University of Arizona Press, in press.]
6202
6203Much of the work on Perseus is focused on collecting and converting the
6204data on which the project is based.  At the same time, it is necessary to
6205provide means of access to the information, in order to make it usable,
6206and them to investigate how it is used.  As we learn more about what
6207students and scholars from different backgrounds do with Perseus, we can
6208adjust our data collection, and also modify the system to accommodate
6209them.  In creating a delivery system for general use, we have tried to
6210avoid favoring any one type of use by allowing multiple forms of access
6211to and navigation through the system.
6212
6213The way text is handled exemplifies some of these principles.  All text
6214in Perseus is tagged using SGML, following the guidelines of the Text
6215Encoding Initiative (TEI).  This markup is used to index the text, and
6216process it so that it can be imported into HyperCard.  No SGML markup
6217remains in the text that reaches the user, because currently it would be
6218too expensive to create a system that acts on SGML in real time.
6219However, the regularity provided by SGML is essential for verifying the
6220content of the texts, and greatly speeds all the processing performed on
6221them.  The fact that the texts exist in SGML ensures that they will be
6222relatively easy to port to different hardware and software, and so will
6223outlast the current delivery platform.  Finally, the SGML markup
6224incorporates existing canonical reference systems (chapter, verse, line,
6225etc.); indexing and navigation are based on these features.  This ensures
6226that the same canonical reference will always resolve to the same point
6227within a text, and that all versions of our texts, regardless of delivery
6228platform (even paper printouts) will function the same way.
6229
6230In order to provide tools for users, the text is processed by a
6231morphological analyzer, and the results are stored in a database.
6232Together with the index, the Greek-English Lexicon, and the index of all
6233the English words in the definitions of the lexicon, the morphological
6234analyses comprise a set of linguistic tools that allow users of all
6235levels to work with the textual information, and to accomplish different
6236tasks.  For example, students who read no Greek may explore a concept as
6237it appears in Greek texts by using the English-Greek index, and then
6238looking up works in the texts and translations, or scholars may do
6239detailed morphological studies of word use by using the morphological
6240analyses of the texts.  Because these tools were not designed for any one
6241use, the same tools and the same data can be used by both students and
6242scholars.
6243
6244NOTES:
6245     (5)  Perseus is based at Harvard University, with collaborators at
6246     several other universities.  The project has been funded primarily
6247     by the Annenberg/CPB Project, as well as by Harvard University,
6248     Apple Computer, and others.  It is published by Yale University
6249     Press.  Perseus runs on Macintosh computers, under the HyperCard
6250     program.
6251
6252Eric CALALUCA
6253
6254Chadwyck-Healey embarked last year on two distinct yet related full-text
6255humanities database projects.
6256
6257The English Poetry Full-Text Database and the Patrologia Latina Database
6258represent new approaches to linguistic research resources.  The size and
6259complexity of the projects present problems for electronic publishers,
6260but surmountable ones if they remain abreast of the latest possibilities
6261in data capture and retrieval software techniques.
6262
6263The issues which required address prior to the commencement of the
6264projects were legion:
6265
6266     1.   Editorial selection (or exclusion) of materials in each
6267          database
6268
6269     2.   Deciding whether or not to incorporate a normative encoding
6270          structure into the databases?
6271               A.  If one is selected, should it be SGML?
6272               B.  If SGML, then the TEI?
6273
6274     3.   Deliver as CD-ROM, magnetic tape, or both?
6275
6276     4.   Can one produce retrieval software advanced enough for the
6277          postdoctoral linguist, yet accessible enough for unattended
6278          general use?  Should one try?
6279
6280     5.   Re fair and liberal networking policies, what are the risks to
6281          an electronic publisher?
6282
6283     6.   How does the emergence of national and international education
6284          networks affect the use and viability of research projects
6285          requiring high investment?  Do the new European Community
6286          directives concerning database protection necessitate two
6287          distinct publishing projects, one for North America and one for
6288          overseas?
6289
6290From new notions of "scholarly fair use" to the future of optical media,
6291virtually every issue related to electronic publishing was aired.  The
6292result is two projects which have been constructed to provide the quality
6293research resources with the fewest encumbrances to use by teachers and
6294private scholars.
6295
6296Dorothy TWOHIG
6297
6298In spring 1988 the editors of the papers of George Washington, John
6299Adams, Thomas Jefferson, James Madison, and Benjamin Franklin were
6300approached by classics scholar David Packard on behalf of the Packard
6301Humanities Foundation with a proposal to produce a CD-ROM edition of the
6302complete papers of each of the Founding Fathers.  This electronic edition
6303will supplement the published volumes, making the documents widely
6304available to students and researchers at reasonable cost.  We estimate
6305that our CD-ROM edition of Washington's Papers will be substantially
6306completed within the next two years and ready for publication.  Within
6307the next ten years or so, similar CD-ROM editions of the Franklin, Adams,
6308Jefferson, and Madison papers also will be available.  At the Library of
6309Congress's session on technology, I would like to discuss not only the
6310experience of the Washington Papers in producing the CD-ROM edition, but
6311the impact technology has had on these major editorial projects.
6312Already, we are editing our volumes with an eye to the material that will
6313be readily available in the CD-ROM edition.  The completed electronic
6314edition will provide immense possibilities for the searching of documents
6315for information in a way never possible before.  The kind of technical
6316innovations that are currently available and on the drawing board will
6317soon revolutionize historical research and the production of historical
6318documents.  Unfortunately, much of this new technology is not being used
6319in the planning stages of historical projects, simply because many
6320historians are aware only in the vaguest way of its existence.  At least
6321two major new historical editing projects are considering microfilm
6322editions, simply because they are not aware of the possibilities of
6323electronic alternatives and the advantages of the new technology in terms
6324of flexibility and research potential compared to microfilm.  In fact,
6325too many of us in history and literature are still at the stage of
6326struggling with our PCs.  There are many historical editorial projects in
6327progress presently, and an equal number of literary projects.  While the
6328two fields have somewhat different approaches to textual editing, there
6329are ways in which electronic technology can be of service to both.
6330
6331Since few of the editors involved in the Founding Fathers CD-ROM editions
6332are technical experts in any sense, I hope to point out in my discussion
6333of our experience how many of these electronic innovations can be used
6334successfully by scholars who are novices in the world of new technology.
6335One of the major concerns of the sponsors of the multitude of new
6336scholarly editions is the limited audience reached by the published
6337volumes.  Most of these editions are being published in small quantities
6338and the publishers' price for them puts them out of the reach not only of
6339individual scholars but of most public libraries and all but the largest
6340educational institutions.  However, little attention is being given to
6341ways in which technology can bypass conventional publication to make
6342historical and literary documents more widely available.
6343
6344What attracted us most to the CD-ROM edition of The Papers of George
6345Washington was the fact that David Packard's aim was to make a complete
6346edition of all of the 135,000 documents we have collected available in an
6347inexpensive format that would be placed in public libraries, small
6348colleges, and even high schools.  This would provide an audience far
6349beyond our present 1,000-copy, $45 published edition.  Since the CD-ROM
6350edition will carry none of the explanatory annotation that appears in the
6351published volumes, we also feel that the use of the CD-ROM will lead many
6352researchers to seek out the published volumes.
6353
6354In addition to ignorance of new technical advances, I have found that too
6355many editors--and historians and literary scholars--are resistant and
6356even hostile to suggestions that electronic technology may enhance their
6357work.  I intend to discuss some of the arguments traditionalists are
6358advancing to resist technology, ranging from distrust of the speed with
6359which it changes (we are already wondering what is out there that is
6360better than CD-ROM) to suspicion of the technical language used to
6361describe electronic developments.
6362
6363Maria LEBRON
6364
6365The Online Journal of Current Clinical Trials, a joint venture of the
6366American Association for the Advancement of Science (AAAS) and the Online
6367Computer Library Center, Inc. (OCLC), is the first peer-reviewed journal
6368to provide full text, tabular material, and line illustrations on line.
6369This presentation will discuss the genesis and start-up period of the
6370journal.  Topics of discussion will include historical overview,
6371day-to-day management of the editorial peer review, and manuscript
6372tagging and publication.  A demonstration of the journal and its features
6373will accompany the presentation.
6374
6375Lynne PERSONIUS
6376
6377Cornell University Library, Cornell Information Technologies, and Xerox
6378Corporation, with the support of the Commission on Preservation and
6379Access, and Sun Microsystems, Inc., have been collaborating in a project
6380to test a prototype system for recording brittle books as digital images
6381and producing, on demand, high-quality archival paper replacements.  The
6382project goes beyond that, however, to investigate some of the issues
6383surrounding scanning, storing, retrieving, and providing access to
6384digital images in a network environment.
6385
6386The Joint Study in Digital Preservation began in January 1990.  Xerox
6387provided the College Library Access and Storage System (CLASS) software,
6388a prototype 600-dots-per-inch (dpi) scanner, and the hardware necessary
6389to support network printing on the DocuTech printer housed in Cornell's
6390Computing and Communications Center (CCC).
6391
6392The Cornell staff using the hardware and software became an integral part
6393of the development and testing process for enhancements to the CLASS
6394software system.  The collaborative nature of this relationship is
6395resulting in a system that is specifically tailored to the preservation
6396application.
6397
6398A digital library of 1,000 volumes (or approximately 300,000 images) has
6399been created and is stored on an optical jukebox that resides in CCC.
6400The library includes a collection of select mathematics monographs that
6401provides mathematics faculty with an opportunity to use the electronic
6402library.  The remaining volumes were chosen for the library to test the
6403various capabilities of the scanning system.
6404
6405One project objective is to provide users of the Cornell library and the
6406library staff with the ability to request facsimiles of digitized images
6407or to retrieve the actual electronic image for browsing.  A prototype
6408viewing workstation has been created by Xerox, with input into the design
6409by a committee of Cornell librarians and computer professionals.  This
6410will allow us to experiment with patron access to the images that make up
6411the digital library.  The viewing station provides search, retrieval, and
6412(ultimately) printing functions with enhancements to facilitate
6413navigation through multiple documents.
6414
6415Cornell currently is working to extend access to the digital library to
6416readers using workstations from their offices.  This year is devoted to
6417the development of a network resident image conversion and delivery
6418server, and client software that will support readers who use Apple
6419Macintosh computers, IBM windows platforms, and Sun workstations.
6420Equipment for this development was provided by Sun Microsystems with
6421support from the Commission on Preservation and Access.
6422
6423During the show-and-tell session of the Workshop on Electronic Texts, a
6424prototype view station will be demonstrated.  In addition, a display of
6425original library books that have been digitized will be available for
6426review with associated printed copies for comparison.  The fifteen-minute
6427overview of the project will include a slide presentation that
6428constitutes a "tour" of the preservation digitizing process.
6429
6430The final network-connected version of the viewing station will provide
6431library users with another mechanism for accessing the digital library,
6432and will also provide the capability of viewing images directly.  This
6433will not require special software, although a powerful computer with good
6434graphics will be needed.
6435
6436The Joint Study in Digital Preservation has generated a great deal of
6437interest in the library community.  Unfortunately, or perhaps
6438fortunately, this project serves to raise a vast number of other issues
6439surrounding the use of digital technology for the preservation and use of
6440deteriorating library materials, which subsequent projects will need to
6441examine.  Much work remains.
6442
6443SESSION III
6444
6445Howard BESSER                      Networking Multimedia Databases
6446
6447What do we have to consider in building and distributing databases of
6448visual materials in a multi-user environment?  This presentation examines
6449a variety of concerns that need to be addressed before a multimedia
6450database can be set up in a networked environment.
6451
6452In the past it has not been feasible to implement databases of visual
6453materials in shared-user environments because of technological barriers.
6454Each of the two basic models for multi-user multimedia databases has
6455posed its own problem.  The analog multimedia storage model (represented
6456by Project Athena's parallel analog and digital networks) has required an
6457incredibly complex (and expensive) infrastructure.  The economies of
6458scale that make multi-user setups cheaper per user served do not operate
6459in an environment that requires a computer workstation, videodisc player,
6460and two display devices for each user.
6461
6462The digital multimedia storage model has required vast amounts of storage
6463space (as much as one gigabyte per thirty still images).  In the past the
6464cost of such a large amount of storage space made this model a
6465prohibitive choice as well.  But plunging storage costs are finally
6466making this second alternative viable.
6467
6468If storage no longer poses such an impediment, what do we need to
6469consider in building digitally stored multi-user databases of visual
6470materials?  This presentation will examine the networking and
6471telecommunication constraints that must be overcome before such databases
6472can become commonplace and useful to a large number of people.
6473
6474The key problem is the vast size of multimedia documents, and how this
6475affects not only storage but telecommunications transmission time.
6476Anything slower than T-1 speed is impractical for files of 1 megabyte or
6477larger (which is likely to be small for a multimedia document).  For
6478instance, even on a 56 Kb line it would take three minutes to transfer a
64791-megabyte file.  And these figures assume ideal circumstances, and do
6480not take into consideration other users contending for network bandwidth,
6481disk access time, or the time needed for remote display.  Current common
6482telephone transmission rates would be completely impractical; few users
6483would be willing to wait the hour necessary to transmit a single image at
64842400 baud.
6485
6486This necessitates compression, which itself raises a number of other
6487issues.  In order to decrease file sizes significantly, we must employ
6488lossy compression algorithms.  But how much quality can we afford to
6489lose?  To date there has been only one significant study done of
6490image-quality needs for a particular user group, and this study did not
6491look at loss resulting from compression.  Only after identifying
6492image-quality needs can we begin to address storage and network bandwidth
6493needs.
6494
6495Experience with X-Windows-based applications (such as Imagequery, the
6496University of California at Berkeley image database) demonstrates the
6497utility of a client-server topology, but also points to the limitation of
6498current software for a distributed environment.  For example,
6499applications like Imagequery can incorporate compression, but current X
6500implementations do not permit decompression at the end user's
6501workstation.  Such decompression at the host computer alleviates storage
6502capacity problems while doing nothing to address problems of
6503telecommunications bandwidth.
6504
6505We need to examine the effects on network through-put of moving
6506multimedia documents around on a network.  We need to examine various
6507topologies that will help us avoid bottlenecks around servers and
6508gateways.  Experience with applications such as these raise still broader
6509questions. How closely is the multimedia document tied to the software
6510for viewing it?  Can it be accessed and viewed from other applications?
6511Experience with the MARC format (and more recently with the Z39.50
6512protocols) shows how useful it can be to store documents in a form in
6513which they can be accessed by a variety of application software.
6514
6515Finally, from an intellectual-access standpoint, we need to address the
6516issue of providing access to these multimedia documents in
6517interdisciplinary environments.  We need to examine terminology and
6518indexing strategies that will allow us to provide access to this material
6519in a cross-disciplinary way.
6520
6521Ronald LARSEN            Directions in High-Performance Networking for
6522                         Libraries
6523
6524The pace at which computing technology has advanced over the past forty
6525years shows no sign of abating.  Roughly speaking, each five-year period
6526has yielded an order-of-magnitude improvement in price and performance of
6527computing equipment.  No fundamental hurdles are likely to prevent this
6528pace from continuing for at least the next decade.  It is only in the
6529past five years, though, that computing has become ubiquitous in
6530libraries, affecting all staff and patrons, directly or indirectly.
6531
6532During these same five years, communications rates on the Internet, the
6533principal academic computing network, have grown from 56 kbps to 1.5
6534Mbps, and the NSFNet backbone is now running 45 Mbps.  Over the next five
6535years, communication rates on the backbone are expected to exceed 1 Gbps.
6536Growth in both the population of network users and the volume of network
6537traffic  has continued to grow geometrically, at rates approaching 15
6538percent per month.  This flood of capacity and use, likened by some to
6539"drinking from a firehose,"  creates immense opportunities and challenges
6540for libraries.  Libraries must anticipate the future implications of this
6541technology, participate in its development, and deploy it to ensure
6542access to the world's information resources.
6543
6544The infrastructure for the information age is being put in place.
6545Libraries face strategic decisions about their role in the development,
6546deployment, and use of this infrastructure.  The emerging infrastructure
6547is much more than computers and communication lines.  It is more than the
6548ability to compute at a remote site, send electronic mail to a peer
6549across the country, or move a file from one library to another.  The next
6550five years will witness substantial development of the information
6551infrastructure of the network.
6552
6553In order to provide appropriate leadership, library professionals must
6554have a fundamental understanding of and appreciation for computer
6555networking, from local area networks to the National Research and
6556Education Network (NREN).  This presentation addresses these
6557fundamentals, and how they relate to libraries today and in the near
6558future.
6559
6560Edwin BROWNRIGG               Electronic Library Visions and Realities
6561
6562The electronic library has been a vision desired by many--and rejected by
6563some--since Vannevar Bush coined the term memex to describe an automated,
6564intelligent, personal information system.  Variations on this vision have
6565included Ted Nelson's Xanadau, Alan Kay's Dynabook, and Lancaster's
6566"paperless library," with the most recent incarnation being the
6567"Knowledge Navigator" described by John Scully of Apple.  But the reality
6568of library service has been less visionary and the leap to the electronic
6569library has eluded universities, publishers, and information technology
6570files.
6571
6572The Memex Research Institute (MemRI), an independent, nonprofit research
6573and development organization, has created an Electronic Library Program
6574of shared research and development in order to make the collective vision
6575more concrete.  The program is working toward the creation of large,
6576indexed publicly available electronic image collections of published
6577documents in academic, special, and public libraries.  This strategic
6578plan is the result of the first stage of the program, which has been an
6579investigation of the information technologies available to support such
6580an effort, the economic parameters of electronic service compared to
6581traditional library operations, and the business and political factors
6582affecting the shift from print distribution to electronic networked
6583access.
6584
6585The strategic plan envisions a combination of publicly searchable access
6586databases, image (and text) document collections stored on network "file
6587servers," local and remote network access, and an intellectual property
6588management-control system.  This combination of technology and
6589information content is defined in this plan as an E-library or E-library
6590collection.  Some participating sponsors are already developing projects
6591based on MemRI's recommended directions.
6592
6593The E-library strategy projected in this plan is a visionary one that can
6594enable major changes and improvements in academic, public, and special
6595library service.  This vision is, though, one that can be realized with
6596today's technology.  At the same time, it will challenge the political
6597and social structure within which libraries operate:  in academic
6598libraries, the traditional emphasis on local collections, extending to
6599accreditation issues; in public libraries, the potential of electronic
6600branch and central libraries fully available to the public; and for
6601special libraries, new opportunities for shared collections and networks.
6602
6603The environment in which this strategic plan has been developed is, at
6604the moment, dominated by a sense of library limits.  The continued
6605expansion and rapid growth of local academic library collections is now
6606clearly at an end.  Corporate libraries, and even law libraries, are
6607faced with operating within a difficult economic climate, as well as with
6608very active competition from commercial information sources.  For
6609example, public libraries may be seen as a desirable but not critical
6610municipal service in a time when the budgets of safety and health
6611agencies are being cut back.
6612
6613Further, libraries in general have a very high labor-to-cost ratio in
6614their budgets, and labor costs are still increasing, notwithstanding
6615automation investments.  It is difficult for libraries to obtain capital,
6616startup, or seed funding for innovative activities, and those
6617technology-intensive initiatives that offer the potential of decreased
6618labor costs can provoke the opposition of library staff.
6619
6620However, libraries have achieved some considerable successes in the past
6621two decades by improving both their service and their credibility within
6622their organizations--and these positive changes have been accomplished
6623mostly with judicious use of information technologies.  The advances in
6624computing and information technology have been well-chronicled:  the
6625continuing precipitous drop in computing costs, the growth of the
6626Internet and private networks, and the explosive increase in publicly
6627available information databases.
6628
6629For example, OCLC has become one of the largest computer network
6630organizations in the world by creating a cooperative cataloging network
6631of more than 6,000 libraries worldwide.  On-line public access catalogs
6632now serve millions of users on more than 50,000 dedicated terminals in
6633the United States alone.  The University of California MELVYL on-line
6634catalog system has now expanded into an index database reference service
6635and supports more than six million searches a year.  And, libraries have
6636become the largest group of customers of CD-ROM publishing technology;
6637more than 30,000 optical media publications such as those offered by
6638InfoTrac and Silver Platter are subscribed to by U.S. libraries.
6639
6640This march of technology continues and in the next decade will result in
6641further innovations that are extremely difficult to predict.  What is
6642clear is that libraries can now go beyond automation of their order files
6643and catalogs to automation of their collections themselves--and it is
6644possible to circumvent the fiscal limitations that appear to obtain
6645today.
6646
6647This Electronic Library Strategic Plan recommends a paradigm shift in
6648library service, and demonstrates the steps necessary to provide improved
6649library services with limited capacities and operating investments.
6650
6651SESSION IV-A
6652
6653Anne KENNEY
6654
6655The Cornell/Xerox Joint Study in Digital Preservation resulted in the
6656recording of 1,000 brittle books as 600-dpi digital images and the
6657production, on demand, of high-quality and archivally sound paper
6658replacements.  The project, which was supported by the Commission on
6659Preservation and Access, also investigated some of the issues surrounding
6660scanning, storing, retrieving, and providing access to digital images in
6661a network environment.
6662
6663Anne Kenney will focus on some of the issues surrounding direct scanning
6664as identified in the Cornell Xerox Project.  Among those to be discussed
6665are:  image versus text capture; indexing and access; image-capture
6666capabilities; a comparison to photocopy and microfilm; production and
6667cost analysis; storage formats, protocols, and standards; and the use of
6668this scanning technology for preservation purposes.
6669
6670The 600-dpi digital images produced in the Cornell Xerox Project proved
6671highly acceptable for creating paper replacements of deteriorating
6672originals.  The 1,000 scanned volumes provided an array of image-capture
6673challenges that are common to nineteenth-century printing techniques and
6674embrittled material, and that defy the use of text-conversion processes.
6675These challenges include diminished contrast between text and background,
6676fragile and deteriorated pages, uneven printing, elaborate type faces,
6677faint and bold text adjacency, handwritten text and annotations, nonRoman
6678languages, and a proliferation of illustrated material embedded in text.
6679The latter category included high-frequency and low-frequency halftones,
6680continuous tone photographs, intricate mathematical drawings, maps,
6681etchings, reverse-polarity drawings, and engravings.
6682
6683The Xerox prototype scanning system provided a number of important
6684features for capturing this diverse material.  Technicians used multiple
6685threshold settings, filters, line art and halftone definitions,
6686autosegmentation, windowing, and software-editing programs to optimize
6687image capture.  At the same time, this project focused on production.
6688The goal was to make scanning as affordable and acceptable as
6689photocopying and microfilming for preservation reformatting.  A
6690time-and-cost study conducted during the last three months of this
6691project confirmed the economic viability of digital scanning, and these
6692findings will be discussed here.
6693
6694From the outset, the Cornell Xerox Project was predicated on the use of
6695nonproprietary standards and the use of common protocols when standards
6696did not exist.  Digital files were created as TIFF images which were
6697compressed prior to storage using Group 4 CCITT compression.  The Xerox
6698software is MS DOS based and utilizes off-the shelf programs such as
6699Microsoft Windows and Wang Image Wizard.  The digital library is designed
6700to be hardware-independent and to provide interchangeability with other
6701institutions through network connections.  Access to the digital files
6702themselves is two-tiered:  Bibliographic records for the computer files
6703are created in RLIN and Cornell's local system and access into the actual
6704digital images comprising a book is provided through a document control
6705structure and a networked image file-server, both of which will be
6706described.
6707
6708The presentation will conclude with a discussion of some of the issues
6709surrounding the use of this technology as a preservation tool (storage,
6710refreshing, backup).
6711
6712Pamela ANDRE and Judith ZIDAR
6713
6714The National Agricultural Library (NAL) has had extensive experience with
6715raster scanning of printed materials.  Since 1987, the Library has
6716participated in the National Agricultural Text Digitizing Project (NATDP)
6717a cooperative effort between NAL and forty-five land grant university
6718libraries.  An overview of the project will be presented, giving its
6719history and NAL's strategy for the future.
6720
6721An in-depth discussion of NATDP will follow, including a description of
6722the scanning process, from the gathering of the printed materials to the
6723archiving of the electronic pages.  The type of equipment required for a
6724stand-alone scanning workstation and the importance of file management
6725software will be discussed.  Issues concerning the images themselves will
6726be addressed briefly, such as image format; black and white versus color;
6727gray scale versus dithering; and resolution.
6728
6729Also described will be a study currently in progress by NAL to evaluate
6730the usefulness of converting microfilm to electronic images in order to
6731improve access.  With the cooperation of Tuskegee University, NAL has
6732selected three reels of microfilm from a collection of sixty-seven reels
6733containing the papers, letters, and drawings of George Washington Carver.
6734The three reels were converted into 3,500 electronic images using a
6735specialized microfilm scanner.  The selection, filming, and indexing of
6736this material will be discussed.
6737
6738Donald WATERS
6739
6740Project Open Book, the Yale University Library's effort to convert 10,
6741000 books from microfilm to digital imagery, is currently in an advanced
6742state of planning and organization.  The Yale Library has selected a
6743major vendor to serve as a partner in the project and as systems
6744integrator.  In its proposal, the successful vendor helped isolate areas
6745of risk and uncertainty as well as key issues to be addressed during the
6746life of the project.  The Yale Library is now poised to decide what
6747material it will convert to digital image form and to seek funding,
6748initially for the first phase and then for the entire project.
6749
6750The proposal that Yale accepted for the implementation of Project Open
6751Book will provide at the end of three phases a conversion subsystem,
6752browsing stations distributed on the campus network within the Yale
6753Library, a subsystem for storing 10,000 books at 200 and 600 dots per
6754inch, and network access to the image printers.  Pricing for the system
6755implementation assumes the existence of Yale's campus ethernet network
6756and its high-speed image printers, and includes other requisite hardware
6757and software, as well as system integration services.  Proposed operating
6758costs include hardware and software maintenance, but do not include
6759estimates for the facilities management of the storage devices and image
6760servers.
6761
6762Yale selected its vendor partner in a formal process, partly funded by
6763the Commission for Preservation and Access.  Following a request for
6764proposal, the Yale Library selected two vendors as finalists to work with
6765Yale staff to generate a detailed analysis of requirements for Project
6766Open Book.  Each vendor used the results of the requirements analysis to
6767generate and submit a formal proposal for the entire project.  This
6768competitive process not only enabled the Yale Library to select its
6769primary vendor partner but also revealed much about the state of the
6770imaging industry, about the varying, corporate commitments to the markets
6771for imaging technology, and about the varying organizational dynamics
6772through which major companies are responding to and seeking to develop
6773these markets.
6774
6775Project Open Book is focused specifically on the conversion of images
6776from microfilm to digital form.  The technology for scanning microfilm is
6777readily available but is changing rapidly.  In its project requirements,
6778the Yale Library emphasized features of the technology that affect the
6779technical quality of digital image production and the costs of creating
6780and storing the image library:  What levels of digital resolution can be
6781achieved by scanning microfilm?  How does variation in the quality of
6782microfilm, particularly in film produced to preservation standards,
6783affect the quality of the digital images?  What technologies can an
6784operator effectively and economically apply when scanning film to
6785separate two-up images and to control for and correct image
6786imperfections?  How can quality control best be integrated into
6787digitizing work flow that includes document indexing and storage?
6788
6789The actual and expected uses of digital images--storage, browsing,
6790printing, and OCR--help determine the standards for measuring their
6791quality.  Browsing is especially important, but the facilities available
6792for readers to browse image documents is perhaps the weakest aspect of
6793imaging technology and most in need of development.  As it defined its
6794requirements, the Yale Library concentrated on some fundamental aspects
6795of usability for image documents:  Does the system have sufficient
6796flexibility to handle the full range of document types, including
6797monographs, multi-part and multivolume sets, and serials, as well as
6798manuscript collections?  What conventions are necessary to identify a
6799document uniquely for storage and retrieval?  Where is the database of
6800record for storing bibliographic information about the image document?
6801How are basic internal structures of documents, such as pagination, made
6802accessible to the reader?  How are the image documents physically
6803presented on the screen to the reader?
6804
6805The Yale Library designed Project Open Book on the assumption that
6806microfilm is more than adequate as a medium for preserving the content of
6807deteriorated library materials.  As planning in the project has advanced,
6808it is increasingly clear that the challenge of digital image technology
6809and the key to the success of efforts like Project Open Book is to
6810provide a means of both preserving and improving access to those
6811deteriorated materials.
6812
6813SESSION IV-B
6814
6815George THOMA
6816
6817In the use of electronic imaging for document preservation, there are
6818several issues to consider, such as:  ensuring adequate image quality,
6819maintaining substantial conversion rates (through-put), providing unique
6820identification for automated access and retrieval, and accommodating
6821bound volumes and fragile material.
6822
6823To maintain high image quality, image processing functions are required
6824to correct the deficiencies in the scanned image.  Some commercially
6825available systems include these functions, while some do not.  The
6826scanned raw image must be processed to correct contrast deficiencies--
6827both poor overall contrast resulting from light print and/or dark
6828background, and variable contrast resulting from stains and
6829bleed-through.  Furthermore, the scan density must be adequate to allow
6830legibility of print and sufficient fidelity in the pseudo-halftoned gray
6831material.  Borders or page-edge effects must be removed for both
6832compactibility and aesthetics.  Page skew must be corrected for aesthetic
6833reasons and to enable accurate character recognition if desired.
6834Compound images consisting of both two-toned text and gray-scale
6835illustrations must be processed appropriately to retain the quality of
6836each.
6837
6838SESSION IV-C
6839
6840Jean BARONAS
6841
6842Standards publications being developed by scientists, engineers, and
6843business managers in Association for Information and Image Management
6844(AIIM) standards committees can be applied to electronic image management
6845(EIM) processes including:  document (image) transfer, retrieval and
6846evaluation; optical disk and document scanning; and document design and
6847conversion.  When combined with EIM system planning and operations,
6848standards can assist in generating image databases that are
6849interchangeable among a variety of systems.  The applications of
6850different approaches for image-tagging, indexing, compression, and
6851transfer often cause uncertainty concerning EIM system compatibility,
6852calibration, performance, and upward compatibility, until standard
6853implementation parameters are established.  The AIIM standards that are
6854being developed for these applications can be used to decrease the
6855uncertainty, successfully integrate imaging processes, and promote "open
6856systems."  AIIM is an accredited American National Standards Institute
6857(ANSI) standards developer with more than twenty committees comprised of
6858300 volunteers representing users, vendors, and manufacturers.  The
6859standards publications that are developed in these committees have
6860national acceptance and provide the basis for international harmonization
6861in the development of new International Organization for Standardization
6862(ISO) standards.
6863
6864This presentation describes the development of AIIM's EIM standards and a
6865new effort at AIIM, a database on standards projects in a wide framework
6866of imaging industries including capture, recording, processing,
6867duplication, distribution, display, evaluation, and preservation.  The
6868AIIM Imagery Database will cover imaging standards being developed by
6869many organizations in many different countries.  It will contain
6870standards publications' dates, origins, related national and
6871international projects, status, key words, and abstracts.  The ANSI Image
6872Technology Standards Board requested that such a database be established,
6873as did the ISO/International Electrotechnical Commission Joint Task Force
6874on Imagery.  AIIM will take on the leadership role for the database and
6875coordinate its development with several standards developers.
6876
6877Patricia BATTIN
6878
6879     Characteristics of standards for digital imagery:
6880
6881          * Nature of digital technology implies continuing volatility.
6882
6883          * Precipitous standard-setting not possible and probably not
6884          desirable.
6885
6886          * Standards are a complex issue involving the medium, the
6887          hardware, the software, and the technical capacity for
6888          reproductive fidelity and clarity.
6889
6890          * The prognosis for reliable archival standards (as defined by
6891          librarians) in the foreseeable future is poor.
6892
6893     Significant potential and attractiveness of digital technology as a
6894     preservation medium and access mechanism.
6895
6896     Productive use of digital imagery for preservation requires a
6897     reconceptualizing of preservation principles in a volatile,
6898     standardless world.
6899
6900     Concept of managing continuing access in the digital environment
6901     rather than focusing on the permanence of the medium and long-term
6902     archival standards developed for the analog world.
6903
6904     Transition period:  How long and what to do?
6905
6906          *  Redefine "archival."
6907
6908          *  Remove the burden of "archival copy" from paper artifacts.
6909
6910          *  Use digital technology for storage, develop management
6911          strategies for refreshing medium, hardware and software.
6912
6913          *  Create acid-free paper copies for transition period backup
6914          until we develop reliable procedures for ensuring continuing
6915          access to digital files.
6916
6917SESSION IV-D
6918
6919Stuart WEIBEL            The Role of SGML Markup in the CORE Project (6)
6920
6921The emergence of high-speed telecommunications networks as a basic
6922feature of the scholarly workplace is driving the demand for electronic
6923document delivery.  Three distinct categories of electronic
6924publishing/republishing are necessary to support access demands in this
6925emerging environment:
6926
6927     1.)  Conversion of paper or microfilm archives to electronic format
6928     2.)  Conversion of electronic files to formats tailored to
6929          electronic retrieval and display
6930     3.)  Primary electronic publishing (materials for which the
6931          electronic version is the primary format)
6932
6933OCLC has experimental or product development activities in each of these
6934areas.  Among the challenges that lie ahead is the integration of these
6935three types of information stores in coherent distributed systems.
6936
6937The CORE (Chemistry Online Retrieval Experiment) Project is a model for
6938the conversion of large text and graphics collections for which
6939electronic typesetting files are available (category 2).  The American
6940Chemical Society has made available computer typography files dating from
69411980 for its twenty journals.  This collection of some 250 journal-years
6942is being converted to an electronic format that will be accessible
6943through several end-user applications.
6944
6945The use of Standard Generalized Markup Language (SGML) offers the means
6946to capture the structural richness of the original articles in a way that
6947will support a variety of retrieval, navigation, and display options
6948necessary to navigate effectively in very large text databases.
6949
6950An SGML document consists of text that is marked up with descriptive tags
6951that specify the function of a given element within the document.  As a
6952formal language construct, an SGML document can be parsed against a
6953document-type definition (DTD) that unambiguously defines what elements
6954are allowed and where in the document they can (or must) occur.  This
6955formalized map of article structure allows the user interface design to
6956be uncoupled from the underlying database system, an important step
6957toward interoperability.  Demonstration of this separability is a part of
6958the CORE project, wherein user interface designs born of very different
6959philosophies will access the same database.
6960
6961NOTES:
6962     (6)  The CORE project is a collaboration among Cornell University's
6963     Mann Library, Bell Communications Research (Bellcore), the American
6964     Chemical Society (ACS), the Chemical Abstracts Service (CAS), and
6965     OCLC.
6966
6967Michael LESK                  The CORE Electronic Chemistry Library
6968
6969A major on-line file of chemical journal literature complete with
6970graphics is being developed to test the usability of fully electronic
6971access to documents, as a joint project of Cornell University, the
6972American Chemical Society, the Chemical Abstracts Service, OCLC, and
6973Bellcore (with additional support from Sun Microsystems, Springer-Verlag,
6974DigitaI Equipment Corporation, Sony Corporation of America, and Apple
6975Computers).  Our file contains the American Chemical Society's on-line
6976journals, supplemented with the graphics from the paper publication.  The
6977indexing of the articles from Chemical Abstracts Documents is available
6978in both image and text format, and several different interfaces can be
6979used.  Our goals are (1) to assess the effectiveness and acceptability of
6980electronic access to primary journals as compared with paper, and (2) to
6981identify the most desirable functions of the user interface to an
6982electronic system of journals, including in particular a comparison of
6983page-image display with ASCII display interfaces.  Early experiments with
6984chemistry students on a variety of tasks suggest that searching tasks are
6985completed much faster with any electronic system than with paper, but
6986that for reading all versions of the articles are roughly equivalent.
6987
6988Pamela ANDRE and Judith ZIDAR
6989
6990Text conversion is far more expensive and time-consuming than image
6991capture alone.  NAL's experience with optical character recognition (OCR)
6992will be related and compared with the experience of having text rekeyed.
6993What factors affect OCR accuracy?  How accurate does full text have to be
6994in order to be useful?  How do different users react to imperfect text?
6995These are questions that will be explored.  For many, a service bureau
6996may be a better solution than performing the work inhouse; this will also
6997be discussed.
6998
6999SESSION VI
7000
7001Marybeth PETERS
7002
7003Copyright law protects creative works.  Protection granted by the law to
7004authors and disseminators of works includes the right to do or authorize
7005the following:  reproduce the work, prepare derivative works, distribute
7006the work to the public, and publicly perform or display the work.  In
7007addition, copyright owners of sound recordings and computer programs have
7008the right to control rental of their works.  These rights are not
7009unlimited; there are a number of exceptions and limitations.
7010
7011An electronic environment places strains on the copyright system.
7012Copyright owners want to control uses of their work and be paid for any
7013use; the public wants quick and easy access at little or no cost.  The
7014marketplace is working in this area.  Contracts, guidelines on electronic
7015use, and collective licensing are in use and being refined.
7016
7017Issues concerning the ability to change works without detection are more
7018difficult to deal with.  Questions concerning the integrity of the work
7019and the status of the changed version under the copyright law are to be
7020addressed.  These are public policy issues which require informed
7021dialogue.
7022
7023
7024               ***   ***   ***   ******   ***   ***   ***
7025
7026
7027                Appendix III:  DIRECTORY OF PARTICIPANTS
7028
7029
7030PRESENTERS:
7031
7032     Pamela Q.J. Andre
7033     Associate Director, Automation
7034     National Agricultural Library
7035     10301 Baltimore Boulevard
7036     Beltsville, MD 20705-2351
7037     Phone:  (301) 504-6813
7038     Fax:  (301) 504-7473
7039     E-mail:  INTERNET:  PANDRE@ASRR.ARSUSDA.GOV
7040
7041     Jean Baronas, Senior Manager
7042     Department of Standards and Technology
7043     Association for Information and Image Management (AIIM)
7044     1100 Wayne Avenue, Suite 1100
7045     Silver Spring, MD 20910
7046     Phone:  (301) 587-8202
7047     Fax:  (301) 587-2711
7048
7049     Patricia Battin, President
7050     The Commission on Preservation and Access
7051     1400 16th Street, N.W.
7052     Suite 740
7053     Washington, DC 20036-2217
7054     Phone:  (202) 939-3400
7055     Fax:  (202) 939-3407
7056     E-mail:  CPA@GWUVM.BITNET
7057
7058     Howard Besser
7059     Centre Canadien d'Architecture
7060     (Canadian Center for Architecture)
7061     1920, rue Baile
7062     Montreal, Quebec H3H 2S6
7063     CANADA
7064     Phone:  (514) 939-7001
7065     Fax:  (514) 939-7020
7066     E-mail:  howard@lis.pitt.edu
7067
7068     Edwin B. Brownrigg, Executive Director
7069     Memex Research Institute
7070     422 Bonita Avenue
7071     Roseville, CA 95678
7072     Phone:  (916) 784-2298
7073     Fax:  (916) 786-7559
7074     E-mail:  BITNET:  MEMEX@CALSTATE.2
7075
7076     Eric M. Calaluca, Vice President
7077     Chadwyck-Healey, Inc.
7078     1101 King Street
7079     Alexandria, VA 223l4
7080     Phone:  (800) 752-05l5
7081     Fax:  (703) 683-7589
7082
7083     James Daly
7084     4015 Deepwood Road
7085     Baltimore, MD 21218-1404
7086     Phone:  (410) 235-0763
7087
7088     Ricky Erway, Associate Coordinator
7089     American Memory
7090     Library of Congress
7091     Phone:  (202) 707-6233
7092     Fax:  (202) 707-3764
7093
7094     Carl Fleischhauer, Coordinator
7095     American Memory
7096     Library of Congress
7097     Phone:  (202) 707-6233
7098     Fax:  (202) 707-3764
7099
7100     Joanne Freeman
7101     2000 Jefferson Park Avenue, No. 7
7102     Charlottesville, VA  22903
7103
7104     Prosser Gifford
7105     Director for Scholarly Programs
7106     Library of Congress
7107     Phone:  (202) 707-1517
7108     Fax:  (202) 707-9898
7109     E-mail:  pgif@seq1.loc.gov
7110
7111     Jacqueline Hess, Director
7112     National Demonstration Laboratory
7113       for Interactive Information Technologies
7114     Library of Congress
7115     Phone:  (202) 707-4157
7116     Fax:  (202) 707-2829
7117
7118     Susan Hockey, Director
7119     Center for Electronic Texts in the Humanities (CETH)
7120     Alexander Library
7121     Rutgers University
7122     169 College Avenue
7123     New Brunswick, NJ 08903
7124     Phone:  (908) 932-1384
7125     Fax:  (908) 932-1386
7126     E-mail:  hockey@zodiac.rutgers.edu
7127
7128     William L. Hooton, Vice President
7129     Business & Technical Development
7130       Imaging & Information Systems Group
7131     I-NET
7132     6430 Rockledge Drive, Suite 400
7133     Bethesda, MD 208l7
7134     Phone:  (301) 564-6750
7135     Fax:  (513) 564-6867
7136
7137     Anne R. Kenney, Associate Director
7138     Department of Preservation and Conservation
7139     701 Olin Library
7140     Cornell University
7141     Ithaca, NY 14853
7142     Phone:  (607) 255-6875
7143     Fax:  (607) 255-9346
7144     E-mail:  LYDY@CORNELLA.BITNET
7145
7146     Ronald L. Larsen
7147     Associate Director for Information Technology
7148     University of Maryland at College Park
7149     Room B0224, McKeldin Library
7150     College Park, MD 20742-7011
7151     Phone:  (301) 405-9194
7152     Fax:  (301) 314-9865
7153     E-mail:  rlarsen@libr.umd.edu
7154
7155     Maria L. Lebron, Managing Editor
7156     The Online Journal of Current Clinical Trials
7157     l333 H Street, N.W.
7158     Washington, DC 20005
7159     Phone:  (202) 326-6735
7160     Fax:  (202) 842-2868
7161     E-mail:  PUBSAAAS@GWUVM.BITNET
7162
7163     Michael Lesk, Executive Director
7164     Computer Science Research
7165     Bell Communications Research, Inc.
7166     Rm 2A-385
7167     445 South Street
7168     Morristown, NJ 07960-l9l0
7169     Phone:  (201) 829-4070
7170     Fax:  (201) 829-5981
7171     E-mail:  lesk@bellcore.com (Internet) or bellcore!lesk (uucp)
7172
7173     Clifford A. Lynch
7174     Director, Library Automation
7175     University of California,
7176        Office of the President
7177     300 Lakeside Drive, 8th Floor
7178     Oakland, CA 94612-3350
7179     Phone:  (510) 987-0522
7180     Fax:  (510) 839-3573
7181     E-mail:  calur@uccmvsa
7182
7183     Avra Michelson
7184     National Archives and Records Administration
7185     NSZ Rm. 14N
7186     7th & Pennsylvania, N.W.
7187     Washington, D.C. 20408
7188     Phone:  (202) 501-5544
7189     Fax:  (202) 501-5533
7190     E-mail:  tmi@cu.nih.gov
7191
7192     Elli Mylonas, Managing Editor
7193     Perseus Project
7194     Department of the Classics
7195     Harvard University
7196     319 Boylston Hall
7197     Cambridge, MA 02138
7198     Phone:  (617) 495-9025, (617) 495-0456 (direct)
7199     Fax:  (617) 496-8886
7200     E-mail:  Elli@IKAROS.Harvard.EDU or elli@wjh12.harvard.edu
7201
7202     David Woodley Packard
7203     Packard Humanities Institute
7204     300 Second Street, Suite 201
7205     Los Altos, CA 94002
7206     Phone:  (415) 948-0150 (PHI)
7207     Fax:  (415) 948-5793
7208
7209     Lynne K. Personius, Assistant Director
7210     Cornell Information Technologies for
7211      Scholarly Information Sources
7212     502 Olin Library
7213     Cornell University
7214     Ithaca, NY 14853
7215     Phone:  (607) 255-3393
7216     Fax:  (607) 255-9346
7217     E-mail:  JRN@CORNELLC.BITNET
7218
7219     Marybeth Peters
7220     Policy Planning Adviser to the
7221       Register of Copyrights
7222     Library of Congress
7223     Office LM 403
7224     Phone:  (202) 707-8350
7225     Fax:  (202) 707-8366
7226
7227     C. Michael Sperberg-McQueen
7228     Editor, Text Encoding Initiative
7229     Computer Center (M/C 135)
7230     University of Illinois at Chicago
7231     Box 6998
7232     Chicago, IL 60680
7233     Phone:  (312) 413-0317
7234     Fax:  (312) 996-6834
7235     E-mail:  u35395@uicvm..cc.uic.edu or u35395@uicvm.bitnet
7236
7237     George R. Thoma, Chief
7238     Communications Engineering Branch
7239     National Library of Medicine
7240     8600 Rockville Pike
7241     Bethesda, MD 20894
7242     Phone:  (301) 496-4496
7243     Fax:  (301) 402-0341
7244     E-mail:  thoma@lhc.nlm.nih.gov
7245
7246     Dorothy Twohig, Editor
7247     The Papers of George Washington
7248     504 Alderman Library
7249     University of Virginia
7250     Charlottesville, VA 22903-2498
7251     Phone:  (804) 924-0523
7252     Fax:  (804) 924-4337
7253
7254     Susan H. Veccia, Team leader
7255     American Memory, User Evaluation
7256     Library of Congress
7257     American Memory Evaluation Project
7258     Phone:  (202) 707-9104
7259     Fax:  (202) 707-3764
7260     E-mail:  svec@seq1.loc.gov
7261
7262     Donald J. Waters, Head
7263     Systems Office
7264     Yale University Library
7265     New Haven, CT 06520
7266     Phone:  (203) 432-4889
7267     Fax:  (203) 432-7231
7268     E-mail:  DWATERS@YALEVM.BITNET or DWATERS@YALEVM.YCC.YALE.EDU
7269
7270     Stuart Weibel, Senior Research Scientist
7271     OCLC
7272     6565 Frantz Road
7273     Dublin, OH 43017
7274     Phone:  (614) 764-608l
7275     Fax:  (614) 764-2344
7276     E-mail:  INTERNET:  Stu@rsch.oclc.org
7277
7278     Robert G. Zich
7279     Special Assistant to the Associate Librarian
7280       for Special Projects
7281     Library of Congress
7282     Phone:  (202) 707-6233
7283     Fax:  (202) 707-3764
7284     E-mail:  rzic@seq1.loc.gov
7285
7286     Judith A. Zidar, Coordinator
7287     National Agricultural Text Digitizing Program
7288     Information Systems Division
7289     National Agricultural Library
7290     10301 Baltimore Boulevard
7291     Beltsville, MD 20705-2351
7292     Phone:  (301) 504-6813 or 504-5853
7293     Fax:  (301) 504-7473
7294     E-mail:  INTERNET:  JZIDAR@ASRR.ARSUSDA.GOV
7295
7296
7297OBSERVERS:
7298
7299     Helen Aguera, Program Officer
7300     Division of Research
7301     Room 318
7302     National Endowment for the Humanities
7303     1100 Pennsylvania Avenue, N.W.
7304     Washington, D.C. 20506
7305     Phone:  (202) 786-0358
7306     Fax:  (202) 786-0243
7307
7308     M. Ellyn Blanton, Deputy Director
7309     National Demonstration Laboratory
7310       for Interactive Information Technologies
7311     Library of Congress
7312     Phone:  (202) 707-4157
7313     Fax:  (202) 707-2829
7314
7315     Charles M. Dollar
7316     National Archives and Records Administration
7317     NSZ Rm. 14N
7318     7th & Pennsylvania, N.W.
7319     Washington, DC 20408
7320     Phone:  (202) 501-5532
7321     Fax:  (202) 501-5512
7322
7323     Jeffrey Field, Deputy to the Director
7324     Division of Preservation and Access
7325     Room 802
7326     National Endowment for the Humanities
7327     1100 Pennsylvania Avenue, N.W.
7328     Washington, DC 20506
7329     Phone:  (202) 786-0570
7330     Fax:  (202) 786-0243
7331
7332     Lorrin Garson
7333     American Chemical Society
7334     Research and Development Department
7335     1155 16th Street, N.W.
7336     Washington, D.C. 20036
7337     Phone:  (202) 872-4541
7338     Fax:  E-mail:  INTERNET:  LRG96@ACS.ORG
7339
7340     William M. Holmes, Jr.
7341     National Archives and Records Administration
7342     NSZ Rm. 14N
7343     7th & Pennsylvania, N.W.
7344     Washington, DC 20408
7345     Phone:  (202) 501-5540
7346     Fax:  (202) 501-5512
7347     E-mail:  WHOLMES@AMERICAN.EDU
7348
7349     Sperling Martin
7350     Information Resource Management
7351     20030 Doolittle Street
7352     Gaithersburg, MD 20879
7353     Phone:  (301) 924-1803
7354
7355     Michael Neuman, Director
7356     The Center for Text and Technology
7357     Academic Computing Center
7358     238 Reiss Science Building
7359     Georgetown University
7360     Washington, DC 20057
7361     Phone:  (202) 687-6096
7362     Fax:  (202) 687-6003
7363     E-mail:  neuman@guvax.bitnet, neuman@guvax.georgetown.edu
7364
7365     Barbara Paulson, Program Officer
7366     Division of Preservation and Access
7367     Room 802
7368     National Endowment for the Humanities
7369     1100 Pennsylvania Avenue, N.W.
7370     Washington, DC 20506
7371     Phone:  (202) 786-0577
7372     Fax:  (202) 786-0243
7373
7374     Allen H. Renear
7375     Senior Academic Planning Analyst
7376     Brown University Computing and Information Services
7377     115 Waterman Street
7378     Campus Box 1885
7379     Providence, R.I. 02912
7380     Phone:  (401) 863-7312
7381     Fax:  (401) 863-7329
7382     E-mail:  BITNET:  Allen@BROWNVM or
7383     INTERNET:  Allen@brownvm.brown.edu
7384
7385     Susan M. Severtson, President
7386     Chadwyck-Healey, Inc.
7387     1101 King Street
7388     Alexandria, VA 223l4
7389     Phone:  (800) 752-05l5
7390     Fax:  (703) 683-7589
7391
7392     Frank Withrow
7393     U.S. Department of Education
7394     555 New Jersey Avenue, N.W.
7395     Washington, DC 20208-5644
7396     Phone:  (202) 219-2200
7397     Fax:  (202) 219-2106
7398
7399
7400(LC STAFF)
7401
7402     Linda L. Arret
7403     Machine-Readable Collections Reading Room LJ 132
7404     (202) 707-1490
7405
7406     John D. Byrum, Jr.
7407     Descriptive Cataloging Division LM 540
7408     (202) 707-5194
7409
7410     Mary Jane Cavallo
7411     Science and Technology Division LA 5210
7412     (202) 707-1219
7413
7414     Susan Thea David
7415     Congressional Research Service LM 226
7416     (202) 707-7169
7417
7418     Robert Dierker
7419     Senior Adviser for Multimedia Activities LM 608
7420     (202) 707-6151
7421
7422     William W. Ellis
7423     Associate Librarian for Science and Technology LM 611
7424     (202) 707-6928
7425
7426     Ronald Gephart
7427     Manuscript Division LM 102
7428     (202) 707-5097
7429
7430     James Graber
7431     Information Technology Services LM G51
7432     (202) 707-9628
7433
7434     Rich Greenfield
7435     American Memory LM 603
7436     (202) 707-6233
7437
7438     Rebecca Guenther
7439     Network Development LM 639
7440     (202) 707-5092
7441
7442     Kenneth E. Harris
7443     Preservation LM G21
7444     (202) 707-5213
7445
7446     Staley Hitchcock
7447     Manuscript Division LM 102
7448     (202) 707-5383
7449
7450     Bohdan Kantor
7451     Office of Special Projects LM 612
7452     (202) 707-0180
7453
7454     John W. Kimball, Jr
7455     Machine-Readable Collections Reading Room LJ 132
7456     (202) 707-6560
7457
7458     Basil Manns
7459     Information Technology Services LM G51
7460     (202) 707-8345
7461
7462     Sally Hart McCallum
7463     Network Development LM 639
7464     (202) 707-6237
7465
7466     Dana J. Pratt
7467     Publishing Office LM 602
7468     (202) 707-6027
7469
7470     Jane Riefenhauser
7471     American Memory LM 603
7472     (202) 707-6233
7473
7474     William Z. Schenck
7475     Collections Development LM 650
7476     (202) 707-7706
7477
7478     Chandru J. Shahani
7479     Preservation Research and Testing Office (R&T) LM G38
7480     (202) 707-5607
7481
7482     William J. Sittig
7483     Collections Development LM 650
7484     (202) 707-7050
7485
7486     Paul Smith
7487     Manuscript Division LM 102
7488     (202) 707-5097
7489
7490     James L. Stevens
7491     Information Technology Services LM G51
7492     (202) 707-9688
7493
7494     Karen Stuart
7495     Manuscript Division LM 130
7496     (202) 707-5389
7497
7498     Tamara Swora
7499     Preservation Microfilming Office LM G05
7500     (202) 707-6293
7501
7502     Sarah Thomas
7503     Collections Cataloging LM 642
7504     (202) 707-5333
7505
7506
7507                                   END
7508      *************************************************************
7509
7510Note:  This file has been edited for use on computer networks.  This
7511editing required the removal of diacritics, underlining, and fonts such
7512as italics and bold.
7513
7514kde 11/92
7515
7516[A few of the italics (when used for emphasis) were replaced by CAPS mh]
7517
7518*End of The Project Gutenberg Etext of LOC WORKSHOP ON ELECTRONIC ETEXTS
7519
7520