This is a guest post from Andrea Thomer and Rob Guralnick from Notes From Nature. See more about the authors below. Find more citizen science projects about words.
Notes From Nature (official site) is a citizen science transcription project that launched in April 2013 and is part of the Zooniverse stable of projects. The goal behind Notes from Nature is to make a dent in a major endeavor – digitizing the information contained in biocollections scattered over the nation and world. More information about the project can be found here. Conservative estimates place the number of biology and paleontology museum specimens at over a billion in the United States alone. The task of digitizing these records cannot be completed without help, and Notes from Nature is one place where we can ask for that help. Because we know some specimen labels are hard to read, we don’t simply ask for one transcription per record. Instead, each record is seen by at least four pairs of eyes. This has its own challenges; how do we take those 4+ transcriptions and get a reconciled or ‘canonical’ version?
In a previous post on our blog “So You Think You Can Digitize,” we went through the mechanics of how to find consensus from a set of independently created transcriptions by citizen scientists — this involved a mash-up of bioinformatics tools for sequence alignment (repurposed for use with text strings) and natural language processing tools to find tokens and perform some word synonymizing. In the end, the informatics blender did indeed churn out a consensus — but this attempt at automation led us to realize that there’s more than one kind of consensus. In this post we want to to explore that issue a bit more.
So, lets return to our example text:
Some volunteers spelled out abbreviations (changing “SE” to “Southeast”) or corrected errors on the original label (changing “Biv” to “River”); but others did their best to transcribe each label verbatim – typos and all.
These differences in transcription style led us to ask — when we build “consensus,” what kind do we want? Do we want a verbatim transcription of each label (thus preserving a more accurate, historical record)? Or do we want to take advantage of our volunteers’ clever human brains, and preserve the far more legible, more georeferenceable strings that they (and the text clean-up algorithms described in our last post) were able to produce? Which string is more ‘canonical’?
Others have asked these questions before us — in fact, after doing a bit of research (read: googling and reading wikipedia), we realized we were essentially reinventing the wheel that is textual criticism, “the branch ofliterary criticism that is concerned with the identification and removal of transcription errors in the texts of manuscripts” (thanks, wikipedia!). Remember, before there were printing presses there were scribes: individuals tasked with transcribing sometimes messy, sometimes error-ridden texts by hand — sometimes introducing new errors in the process. Scholars studying these older, hand-duplicated texts often must resolve discrepancies across different copies of a manuscripts (or “witnesses”) in order to create either:
- a “critical edition” of the text, one which “most closely approximates the original”, or
- a “copy-text” edition, which “the critic examines the base text and makes corrections (called emendations) in places where the base text appears wrong” (thanks again, wikipedia).
Granted, the distinction between a “critical edition” and a “copy-text edition” may be a little unwieldy when applied to something like a specimen label as opposed to a manuscript. And while existing biodiversity data standards developers have recognized the issue — Darwin Core, for example, has “verbatim” and “interpreted” fields (e.g. dwc:verbatimLatitude) — those existing terms don’t necessarily capture the complexity of multiple interpretations, done multiple times, by multiple people and algorithms and then a further interpretation to compute some final “copy text”. Citizen science approaches place us right between existing standards-oriented thinking in biodiversity informatics and edition-oriented thinking in the humanities. This middle spot is a challenging but fascinating one — and another confirmation of the clear, and increasing, interdisciplinarity of fields like biodiversity informatics and the digital humanities.
In prior posts, we’ve talked about finding links between the sciences and humanities — what better example of cross-discipline-pollination than this? Before, we mentioned we’re not the first to meditate on the meaning of “consensus” — we’re also not the first to repurpose tools originally designed for phylogenetic analysis for use with general text; linguists and others in the field of phylomemetics (h/t to Nic Weber for the linked paper) have been doing the same for years. While the sciences and humanities may still have very different research questions and epistemologies, our informatics tools have much in common. Being aware of, if not making use of, one another’s conceptual frameworks may be a first step to sharing informatics tools, and building towards new, interesting collaborations.
Finally, back to our question about what we mean by “consensus”: we can now see that our volunteers and algorithms are currently better suited to creating “copy-text” editions, or interpreted versions of the specimen labels — which makes sense, given the many levels of human and machine interpretation that each label goes through. Changes to the NfN transcription workflow would need to be made if museums want a “critical edition,” or verbatim version of each label as well. Whether this is necessary is up for debate, however — would the preserved image, on which transcriptions were based be enough for museum curators’ and collection managers’ purposes? Could that be our most “canonical” representation of the label, to which we link later interpretations? More (interdisciplinary) work and discussion is clearly necessary — but we hope this first attempt to link a few disparate fields and methods will help open the door for future exchange of ideas and methods.
Image: Notes From Nature
References and links of potential interest:
If you’re interested in learning more about DH tools relevant to this kind of work, check out Juxta, an open source software package designed to support collation and comparison of different “witnesses” (or texts).
Howe, C. J., & Windram, H. F. (2011). Phylomemetics–evolutionary analysis beyond the gene. PLoS biology, 9(5), e1001069. doi:10.1371/journal.pbio.1001069
Andrea Thomer is a Ph.D. student in Library and Information Science at the University of Illinois at Urbana-Champaign, and is supported by the Center for Informatics Research in Science and Scholarship. Her research interests include text mining; scholarly communication; data curation; biodiversity, phylogenetic and natural history museum informatics; and mining and making available undiscovered public knowledge. She is particularly interested in information extraction from natural history field notes and texts, and improving methods of digitizing and publishing data about the world’s 3–4 billion museum specimen records so they can be used to better model evolutionary and ecological processes.