Building the Theological Commons at Princeton Theological Seminary
In early 2008 Princeton Theological Seminary (PTS) entered into an agreement with Microsoft to digitize public domain (pre-1923) print materials Project funding from Microsoft, materials from PTS’s print collections and digitization would be performed by the Internet Archive in a scanning center located in PTS’s library. The goal was to digitize thousands of volumes on theology and religion for inclusion in Microsoft’s Live Search Books service, which was Microsoft’s answer to Google Books. This project aligned perfectly with PTS’s vision of extending worldwide access to its historical collections as a means of contributing to the shape of a global digital library in which theological disciplines would be represented. Other institutions joining in Microsoft’s ambitious program included the British Library, Columbia University, Cornell University, the New York Public Library, the University of California, the University of Toronto, and Yale University. After only a few months,Microsoft abruptly ended the program leaving PTS and the other participants to consider how or whether to proceed without outside financial support. Recognizing the importance of the broader effort and remaining committed to its vision for a globally accessible digital library that includes theology and related fields, Princeton Theological Seminary decided to move its digital efforts forward by retaining its relationship with the Internet Archive, a nonprofit organization dedicated to building a free open and well preserved digital library. Thus began a fruitful partnership that continues to the present time. The Internet Archive continues to operate a regional scanning center housed in PTS’s library. Several other institutions that had joined the MS initiative took a similar approach and have continued to partner with the Internet Archive to digitize large quantities of materials from their public domain holdings. Through ongoing funding from Princeton Theological Seminary, the library routinely submits public domain print materials for digitization through the Internet Archive. No digitization is performed by way of PTS staff or equipment; instead, all digitization is performed in the onsite scanning center staffed, equipped and operated independently by the Internet Archive. Specifically, PTS librarians select content for digitization in alignment with the library’s collection development policy, gather the selected physical volumes, and deliver both the volumes and their accompanying catalog records to the Internet Archive scanning center. After the nondestructive scanning process, library staff re-shelve the physical volumes. When the Internet Archive completes its processing, which includes rigorous quality assurance procedures, each item enters the vast Internet Archive online library and becomes fully discoverable and searchable through its website. each item enters the vast Internet Archive online library and becomes fully discoverable and searchable through its website. Each volume can be read online using the Internet Archive’s BookReader, which provides a familiar reading experience using full-color images of each page of the volume, from cover to cover. Through this partnership with the Internet Archive, Princeton Theological Seminary has thus far digitized over 30,000 volumes — predominantly books along with historical periodicals, copyright cleared theses and dissertations, and a small number of manuscripts — totaling 25,000,000 pages of text. Because the Internet Archive has been so successful in carrying out its mission of building and preserving an online digital library, the sheer size of the library is as daunting as it is impressive. For the student of theology and related fields, navigating these seas of data to find relevant material can be challenging, because they include every conceivable subject matter, both academic and popular. In late 2010, Dr. Iain Torrance, then the president of Princeton Theological Seminary, asked a subset of library staff members known informally within the library as ‘the digital team,’ to consider how to maximize discoverability and access to the thousands of volumes PTS had digitized, with a focus on the needs of students, scholars, pastors, church leaders, interested laypersons, and other researchers, locally and worldwide, who would benefit from content particular to their domains of knowledge and practice. Starting from this seed, the digital library team began building an information system with these goals in mind. The first step toward realizing this vision would be content selection. In phase one of this endeavor, we harvested the metadata and full text of every item in the Internet Archive that had originated from Princeton Theological Seminary Library, and we imported the data into our own database. In phase two, we took a detailed list of Library of Congress subject headings provided by our Collection Development Librarian and performed searches in the Internet Archive system for digital books with those subjects, irrespective of their library of origin; we then imported those items into our database in the same manner. This procedure soon amassed 50,000 digital texts. Even this targeted subset, however, is large enough to require tools for finding works of interest to an individual researcher. First and foremost is the ability to search of the digital library for specific words. First and foremost is the ability to search of the digital library for specific words. More than any other factor, this is the principal, revolutionary advantage of the digital representation of text as opposed to its physical embodiment, whether stone tablet, scroll or printed book. This advantage is greatest if the digital book incorporates not only its metadata — its bibliographic description, as in a library catalog — but also its entire textual content. Knowing this, the Internet Archive not only produces a digital photograph of each page of a given volume but also runs OCR (optical character recognition) software on the page images to produce a digital text transcription of the book’s intellectual content. This step is indispensable, since a digital image of a page is no more searchable by keyword than ink on paper in a manuscript or book. Because the Theological Commons is a repurposing of Internet Archive data, it inherits this advantage; however, by downloading the textual transcriptions into our own database, we can provide very efficient searching across only our targeted subset of digital texts. Another important instrument for finding materials of interest is the use of facets, by which search results can be honed and refined according to predefined categories, such as date of publication, format of the physical item, language and subject. The subject facet in Theological Commons is noteworthy in that represents a balance between complexity and simplicity. The terms for this facet are derived not from the seemingly infinite array of Library of Congress subject headings, but rather from the Library of Congress classification scheme. In this approach, each digital resource in the Theological Commons is assigned to a broad taxonomy of knowledge containing about 40 entries, which is neither overly complex nor overly simple and is therefore maximally useful. With these core features, 50,000 digital texts, and the name Theological Commons, this resource was publicly released in March 2012. That milestone provided an opportunity to enhance and refine the system in ways that would further fulfill the original vision. Princeton Theological Seminary purchased new hardware, accompanied by software upgrades; the content expanded to more than 75,000 digital texts; and the digital library team undertook a series of functionality improvements to increase the usability of the Theological Commons in various ways. If refining search results using facets is good, giving a researcher more flexibility in using those facets would be even better. One important enhancement to the original system allows the user to select multiple values from each facet. For example, rather than having to choose one option either missions or bibliography as the subject, Latin or Italian as the language — the user can select all of these, or any other combination of classifications. Such combinations can even include a custom date range for the date of publication, such as 1700 to 1799, to see only those items published in the 18th century. This feature increases exponentially the paths by which are researcher can navigate this digital library, carving out multiple, overlapping slices of the database for searching or browsing. Another enhancement is the integration of multi-volume sets. In the Internet Archive system, each vol- -ume of a multi-volume set is a discreet unit, without linkages to the other volumes comprising the set. For the Theological Commons, library staff have exerted considerable effort to reconstitute multi-volume sets, whether monographic or periodical. Each volume in the Theological Commons ‘knows’ the parent set to which it belongs, allowing the user to call up the full set on demand. The Theological Commons also allows PTS librarians to define collections within the broader database, providing full set on demand. The Theological Commons also allows PTS librarians to define collections within the broader database, providing full set on demand. The Theological Commons also allows PTS librarians to define collections within the broader database, providing yet another way for researchers to divide and recombine the data to suit their research needs. At present, two specialized collections have been defined: the Benson Collection of Hymnals and Hymnology and the T. F. Torrance Collection of Antiquarian Books. When viewing a given collection, the user can perform searches and utilize facets as usual, but the search results are restricted to works from that collection. The system allows any number of collections to be defined, and we are currently developing additional collections. In these ways, the Theological Commons represents a blending of the aims and methods of mass digitization, in the manner of Internet Archive or Google Books, with the informed curation and accurate resource description that are the hallmarks of librarianship. The goal is to walk the tightrope between quantity and quality to achieve a balance that maximizes the reach and usefulness of these digital resources for research and ministry. The Theological Commons aspires to live up to its name as a ‘commons’ — a central location or shared resource available to an entire community — in part by incorporating content from outside PTS’s own library collections. As described above, we have imported into the Theological Commons tens of thousands of digital books from the collections of other libraries. Currently, just over 1/3 content in the Theological Commons originated from and was digitized by PTS; the remaining 2/3 originated from other libraries and archives that also digitize their materials through the Internet Archive. We have deliberately set up multiple places in the Theological Commons where the contributor of the digital resource is acknowledged: the ‘Frequently Asked Questions’ page; the ‘Browse by Contributor’ feature, which lists all contributing institutions alphabetically; and the main page for any given item in the system, which indicates the contributor name. In addition, we plan to utilize the ‘collections’ feature to organize and highlight material contributed by other institutions, possibly including digital collections developed collaboratively. For example, PTS Library is currently working with the Presbyterian Historical Society in Philadelphia to identify subject areas whereby we could combine materials from both institutions into a coherent collection. Although still in the planning stage, one promising idea is to build a digital collection of materials related to Korean missions, combining parts of the Moffett collection at PTS with the wide array of related materials in the archives of the Presbyterian Historical Society. A similar model could be applied to develop collections with other organizations. Another means by which materials outside PTS’s collections can enter the Theological Commons, and thereby become accessible worldwide, is for institutions or individuals to take their own digitized or born-digital content, upload it to the Internet Archive, and notify the library at PTS, so that we may incorporate it into the Theological Commons. For example, if a researcher has a set of oral history interviews recorded using a smartphone or other portable digital device, because the recordings are already in digital form, the researcher can directly upload them to the Internet Archive. PTS library staff could then import them into the Theological Commons, most likely as a distinct collection. In mid-2013, the Henry Luce Foundation awarded Princeton Theological Seminary a $1.5 million grant for the expansion of the Theological Commons in two important directions: (1) To move beyond text: Digitizing and providing access to audio/visual materials — photographs, audio recordings, video recordings, and possibly 3-D material objects — will allow resources in advanced digital media to become an integral part of the study of theology and related fields. (2) To move beyond the West: Digitizing and providing access to theological material published in or focused on Africa, Asia and Latin America will provide resources of relevance to diverse communities of faith around the globe. For the endeavor to move beyond text, our first project is to digitize, curate, and make accessible PTS’s archive of audio recordings. Since the early 1950s, PTS has maintained a consistent practice of recording the many public lecture series and institutes held at the Seminary. Many of the speakers are accomplished scholars, renowned preachers, or other prominent figures, and such recordings carry considerable scholarly and pastoral value. An estimated 5,500 of these recordings exist only on aging, obsolete reel-to-reel tapes, many of which are deteriorating and all of which will become unplayable in the foreseeable future if they are not reconditioned and transferred to other media. Thanks to the Luce Foundation’s funding, PTS has contracted with an external vendor specializing in the restoration and digitization of archival audio. This effort is currently well underway. Meanwhile, PTS library staff are actively coordinating and normalizing the metadata that identifies and contextualizes each recording, including title, author, date, and the series or event from which the recording originated. Over time, these digital audio recordings will be incorporated into the Theological Commons for public access. A future phase of this project will take a similar approach to PTS’s archive of video recordings. In addition, PTS Library is currently experimenting with methods for producing textual transcriptions of these recordings. If such transcriptions can be achieved at a reasonable level of effort and expense, they will greatly enhance the research value of these recordings, because not only the identifying metadata but also the intellectual content would be searchable. Manual transcription is exceedingly expensive and with an archive of this size we could not consider ourselves good stewards of the Luce Foundation’s funds, however generous, if we were to take this path. Software-generated transcriptions are affordable, but the state of the art is not yet sufficient to produce the accuracy this project requires, especially for the specialized fields of knowledge that are the domain of any seminary curriculum. For these reasons, PTS’s digital library team is currently working with a Silicon Valley startup company that is developing custom speech-to- text software. PTS has provided machine-readable texts from a variety of digitized books, and the company is using those texts to train the software in recognizing the vocabulary of theology, Biblical studies, and related fields. Though this work is still in the research and development stage, we are hopeful that the resulting speech-to-text software will serve a purpose analogous to OCR software for printed materials. Simultaneously, as a way to foray into visual content, PTS Library is currently working with the Internet Archive to digitize a large collection of postcards depicting church architecture throughout the United States. Collected over a lifetime and recently donated to PTS by Dr. James R. Tanis, these postcards will become the first visual image collection in the Theological Commons. In future, we fully expect to develop additional image collections. Equally important to the move into multiple media is the effort to shift the content of the Theological Commons beyond the West alone and toward a place of relevance to the majority world. This dimension of the Luce Foundation grant arose from PTS’s current Strategic Plan, adopted by the Board of Trustees in September 2012, which aims to reorient the Seminary as an institution in a global context and which emphasizes the importance of engagement with the world church. The expansion of the Theological Commons with content from the majority world, in its own languages, is one of the main ways in which PTS’s library is responding to this new orientation. In the first phase of this effort, Princeton Theological Seminary Library is beginning by working with materials in its own collections. Because PTS’s library holds a rich collection of Latin American periodicals, an obvious first step is to undertake digitization of as many of these materials as possible. The principal challenge here is that much of the content is recent enough to remain under copyright. For this reason, PTS’s library is actively seeking legal permissions from publishers in Latin America to allow us to digitize these materials and make them freely available online. Such permission does not require a transfer of copyright to PTS, only a written statement of permission to make them available globally in digital form. Meanwhile, digitization of public domain materials from PTS’s Latin American collection has already begun and is ongoing. This complex and multifaceted effort to shift the Theological Commons, however gradually, away from its current center of gravity — which is so heavily weighted by resources in English from North America and Western Europe — will not be easy and it will not be rapid. This undertaking is still evolving and remains in its first phase of development. Looking toward the future, however, through the Luce Foundation grant, the ultimate goal is to digitize and provide access to copyright-cleared materials in any language that are published in, focused on, or otherwise relevant to theological study or pastoral ministry in Africa, Asia, and Latin America.