The potential for collaboration between data science and other disciplines to develop new research methods is being increasingly recognised. This is particularly evident in the development of cultural analytics in the field of Digital Humanities, where available datasets and other digital resources for humanities research have expanded rapidly in the last decade. Since 2011 there has been an active collaboration between the School of Computer Science and the School of English, Drama and Film at University College Dublin. Using advanced Data Science techniques, we have approached literary sources with new questions and come up with some surprising findings.
Initially, our work focused on developing network analysis methods to represent the associations between characters in the 19th and early 20th century Irish and British fiction. More recently, as part of the IRC-funded Contagion project, our focus has shifted to analysing historical trends at a larger scale. Through a collaboration with the British Library Labs, we have access to a much larger digital corpus from the British Library, covering 35,918 English language fiction and non-fiction books dating from 1700 to 1899. This is equivalent to over 12 million individual pages of printed text. For this project, we particularly wanted to explore historical understandings of disease, contagion, public health and migration, in order to better understand contemporary attitudes.
A project like this, where a huge text corpus is available, presents a wealth of possibilities for new research in the humanities. The scale and diversity of such collections however, presents fresh challenges in identifying and extracting relevant content, particularly for humanities scholars who are interested in studying highly-specific themes. For instance, researchers working with the British Library corpus have previously attempted to curate smaller sub-corpora related to specific topics or interests. This has often been a painstaking task requiring considerable manual effort to inspect the corpus.
To address these challenges, we have developed the Curatr platform, a web-based user interface designed to make the British Library corpus more accessible and useful to a wider group of researchers. The platform indexes all of this text and the associated metadata, allowing the corpus to be browsed, searched, and filtered by author, title, and year. The interface also incorporates a digitised version of the topical classification index of volumes used by the British Library from 1823-1985, which allows the texts to be further filtered by categories such as “fiction”, “drama”, and “geography”. The system supports a corpus curation workflow that addresses the requirements of scholars in the humanities who are increasingly working with large collections of unstructured text. Once a smaller curated sub-corpus of texts has been identified, the associated texts and metadata can be easily exported to other platforms for further research and for more traditional “close reading”.
Curatr incorporates a range of data analytics methodologies. For instance, the platform includes functionality to build word lexicons. These are lists of thematically-related keywords, which are used to locate niche research topics within little known or long unwieldy texts. To reduce the manual effort required to build new word lexicons, we provide users with automatic keyword recommendations, as generated by word embeddings. Word embeddings refer to a set of machine learning techniques, based on neural networks, which "map" the words in a corpus vocabulary to a numeric representation. In this new representation, words which frequently appear together in the original corpus will appear to be similar to one another, while words which do not frequently appear together will be dissimilar. So for example, for the input word “influenza”, one could automatically recommend similar words such as “pneumonia” and “bronchitis”. In this way, researchers can quickly build lexicons of related words, to assist them in identifying original texts relevant to their work for consultation in situ in the library.
In the next phase of our work, our plan is to make the Curatr platform publicly available for wider research use in late 2019, and to potentially extend the platform to include other text humanities corpora and different types of content, such as texts from historical newspaper archives.
For more details on the Curatr platform, see our MTSR2019 2019 paper [PDF].
This project is funded by the Irish Research Council (IRC), and is being undertaken by members of the UCD School of English, in collaboration with researchers from the SFI Insight Centre for Data Analytics at UCD.