Texts and Topics

Working Group 3

Abstract

 

WG3 will complement the strategies for assembling and exploring data on correspondence and correspondents pursued in WGs 1–2 with study of digital means of engaging with the texts of letters themselves. This requires agreeing standards for the presentation of texts both as images (of manuscripts or printed books) and in digital form (I). It also requires developing tools to aid transcription, annotation, and collaborative editing, and for transforming digital editions into print (II). For exploring vast quantities of highly fragmentary textual material, text mining and topic modelling must be deployed (III). Since tools useful to scholars can only be developed by studying scholarly working methods, the Action’s Training Schools will be adapted to the purpose of such study, alongside the induction of a new generation into emerging tools and techniques (IV). WG 3 is also the place in which broader discussion of how the tools developed in all the WGs can be assembled to create an integrated and user-friendly ‘Virtual Research Environment’ (IV).

 

WG 3 led by Charles van den Heuvel, Professor of Digital Method in Historical Disciplines, University of Amsterdam; Head of the Research of the Group ‘History of Science and Scholarship’ at the Huygens Institute for the History of the Netherlands in The Hague; Senior Researcher in the Virtual Knowledge Studio for the Humanities and Social Sciences at the Royal Netherlands Academy for Arts and Sciences (KNAW). He also plays a central role in the pioneering DH project, ‘Circulation of Knowledge and Learned Practices in the 17th-century Dutch Republic’, responsible for the most advanced experiment with the application of IT to topic modelling of large and multilingual corpora of correspondence: the ePistolarium.

wg3s

Agenda

I. Fundamentals: Images of documents, text coding


  1. The point of entry of many letters into a digital resources is as images of manuscripts and printed books. Shared infrastructure requires standard means not only of publishing such images but also of handling them. Early modern letters normally consisted of one or more sheets of paper, folded to provide their own addressed envelope, sometimes tied with ribbon and often sealed with wax. Due to the cost of paper and postage, writing was often densely packed into every available area, including vertically in margins and even at right angles across previously written text. In allowing users to manipulate images of folder and double-sided paper easily, Shared Canvas promises to provide a valuable tool worthy of close study.
  2. Another basic precondition of shared texts and tools is uniformity in the presentation and encoding of digital text.

II. Transcription, Annotation, and Editing


  1. For generating digital from printed text, Optical Character Recognition is of steadily increasing utility. WG 3 will seek to pool knowledge on current capabilities and future refinements of OCR. Also explored will be the possibility of using crowd-sourced corrections to inform intelligent computing solutions to enhancing that capability for major early modern typography. A status report on plans to adapt such techniques to manuscript transcription is also needed.
  2. For manuscript text, a variety of transcription tools have been developed. WG 3 will collect, study, compare, and report on the best options for future use and further development. Means for dealing with multiple transcription standards also need to be devised.
  3. Given the extremely allusive nature of communication between frequent correspondents, annotation is needed to make learned letters readily intelligible. Annotations tools, suitable to individual and collaborative research, therefore need to be assembled and assessed as the basis for further development.
  4. Annotated transcriptions evolve into critical editions. An assessment of the existing states and future prospects for editorial platforms is also needed.
  5. Although searchable and changeable digital text has many advantages, the fixity, stability, and permanence of print offers countervailing attractions. A study should also be made of packages designed to ease the transition from digital to print media, including both pedagogical applications and print on demand.

III. Textual analytics: data mining and topic modelling

  1. One of the most valuable features of letters to the scholar is their tendency to contain information on a huge diversity of topics, much of it unavailable elsewhere. The corresponding difficulty is that letters can lurch unpredictably from one topic to another, making the identification of relevant material difficult. One means of rendering large quantities of letters more readily navigable is via text mining, which applies a variety of analytical strategies — including lexical analysis, pattern recognition, association analysis, and visualization — to extract information from unstructured texts.
  2. Even more closely adapted to scholarly needs is topic modelling. In theory, this technology should allow researchers to identify material on specific topics and general areas, even in the largest and most fragmentary collections of correspondence, based on the frequency with which certain terminology pertaining to the topic sought. In practice, corpora in several languages, each with its varying early modern orthography, pose significant challenges. In addressing these difficulties, WG 3 will build on the work already undertaken in the ePistolarium.

IV. Training Schools


  1. Reassembling the republic of letters requires broad-based collaboration. Broad-based collaboration requires the training of a network of scholars in the use of emerging tools, techniques, and methods. Preparing for future developments requires that this training be directed above all to the emerging generation of scholars in early career. One of the central objectives of WG 3 is therefore to coordinate the COST-funded training schools devoted to this purpose.
  2. Developing tools which scholars need and will actually use, on the other hand, requires intensive communication between users and system developers. With this necessity in mind, Training Schools in this Action will also be devoted to assessing scholars’ interaction with relevant digital tools. The first Training School (March 2015) was designed to monitor and assess the needs and experiences of scholars learning to use recently created tools.  The second Training School (in 2016/17) will also attempt to assess the future needs of scholarly users with regard to new tools being developed within and outside the community represented in the Action.

Download the first Training School Programme here.

V. Toward a Virtual Research Environment

Ultimately, all the tools envisaged and developed throughout all the WGs for dealing with the temporal, spatial, prosopographical, social, textual, topical, and physical aspects of correspondence need to be brought together to create a Virtual Research Environment, that is, an integrated online interface designed to help a distributed community of researchers collaborate in assembling and exploring this vast and fragmentary literary heritage, in publishing the fruits of their work in a variety of formats, and in projecting the results into the classroom and into the broader public domain. Work on planning such a VRE — one of the culminating objective of the Action as a whole — will also be coordinated by WG 3.