The screenshot above documents an exciting moment for an ongoing collaboration between the HathiTrust Research Center and the Program Era Project. It is a screenshot of my computer, remotely connected to a HathiTrust machine, running my Program Era Project text-mining tools on sample HathiTrust text data. In short, it’s a proof-of-concept, confirmation that the PEP tools are ready to begin collecting data on thousands of works produced by Workshop-affiliated authors and that the PEP team can begin to join that data with the wealth of institutional, biographical, and demographic data they have already collected on the Iowa Writers’ Workshop and its authors.
This screenshot is a result of HathiTrust’s selection of the Program Era Project as a 2017 Advanced Collaborative Support award winner. HTRC is a collaboration between partner universities that houses an expansive digital library of written works. As HathiTrust’s site explains, the ACS program is:
a scholarly service at the HathiTrust Research Center (HTRC) offering collaboration between external scholars and HTRC staff to solve challenging problems related to computational analysis. By working together with scholars, we facilitate computational-oriented analytical access to HathiTrust based on individual scholarly or educational need.
For the HathiTrust/PEP collaboration, the approach chosen was to establish a “Data Capsule,” a machine maintained and secured by HathiTrust, that PEP team members can remotely access and then run text mining experiments on Workshop-affiliated works held in HathiTrust’s collections. The Data Capsule approach is crucial, as most works authored by Iowa Writers’ Workshop writers remain in copyright. Thus, they simply aren’t accessible in digital form for large-scale data collection. The Data Capsule configuration allows full texts of HathiTrust works to be measured by text mining software, but only the metrics collected by the tools can be moved off the Data Capsule machine. In PEP’s case, this means .csv spreadsheets of data on individual Workshop texts.
Now, thanks to the HathiTrust/PEP collaboration, the tools I created for the Program Era Project (described a bit more here) can now be employed on a large volume of digital texts. They can be used not just for experiments, but to begin building a database of metrics on features of Workshop writing. For the Data Capsule collection, the Program Era Project team assembled a list of roughly 400 authors, authors considered to be the most prominent writers or instructors associated with the Workshop. Since receiving this list, the HathiTrust team has endeavored to find all the works held by HTRC associated with these authors. At present, over 2000 volumes have been connected to the PEP author’s list. These items are then made accessible on the Data Capsule and, using the PEP tools, converted into metrics which are stored in the PEP database.
So, what data is the Program Era Project collecting? Currently, I’ve built two text mining tools for the Program Era Project. Both are written in python and draw on Stanford’s Natural Language Toolkit (NLTK). We call them Style Card and LitMap.
Style Card is a text analysis tool that enables the project team to measure formal features of Workshop writers’ literary styles like vocabulary size, sentence length, adverb and adjective usage, and even frequency of male and female pronouns. The last metric in particularly interesting, as it can give you a quick impression of gender representation trends in a work or a large collection of works. Additionally, by collecting the same metrics from multiple authors or multiple works from one author, stylistic comparisons can be made, comparisons such as between aspects of later or earlier works of a single author or the complete corpora of two authors. It is, in short, like creating baseball cards for authors and literary works, snapshots of information that can be evaluated at a glance.
LitMap is a software package that tracks location references in literary corpora, making it possible for the project to analyze regional representation trends in literary works. This allows us to see the influence of an author’s biography on their literary output as well as measure the influence of authors migrating to and from creative writing programs on the settings of their writing. Using LitMap, we’ve already made some interesting discoveries about the frequency that works written by authors who attended the Iowa Writers’ Workshop mention the state of Iowa. The team is looking forward to sharing more with you on that in the future.
What’s truly significant (and truly promising) about the data these tools collect is that it will be stored in the PEP team’s database. When the ACS data is incorporated into the PEP database and available to future users of PEP web presence, users, at a glance will be able to rank and compare features of Workshop writing.
The below images represent a proposal for the eventual look and feel of the Program Era Project web presence. The numbers used are drawn from data already collected with the PEP text mining tools. As the first figure shows, a user could rank Workshop writers by average sentence length, learning, at a glance, which authors typically create sprawling (or terse) sentences. The second sentence ranks authors based on the ratio of male pronouns to female pronouns found in their corpora. The larger the number, the more often male pronouns appear compared to female pronouns.
Users could also compare two authors—or an individual author to a control corpus—and look at differences such as first-person and third-person pronoun use (a potential indicator of narration) or adverb and adjective ratios (which can index spare or detailed prose). Scholars could see at a glance how an author’s stylistic features might compare to their Workshop advisor, how they compare to Workshop writers, as a whole, or even to a baseline corpus of writing in English.
In the following image, produced using plotly, our current platform used to visualize LitMap data, we see another way the PEP text tools will provide new insights into literary corpora. The image shows the strong representation of Iowa in a literary corpus comprising 75 novels by Workshop-affiliated writers, documenting how their time at Iowa has left a mark on where they write about.
The idea behind offering these metrics to future users of the Program Era Project website is that access to this information will prompt curiosity and exploration. Moreover, when users find an interesting pattern or phenomenon in the data, we hope it will prompt a direct investigation of the works included in the data. In short, beyond just presenting this information, we believe that the ability to skim over these metrics will inspire scholars to deep dive into the texts the data is drawn from. These objectives of encouraging emergent research and driving curiosity are at the heart of Style Card and LitMap’s other principal innovation: the use of clear, easy to understand metrics. The fields of stylometry and text analysis have developed techniques that allow for astounding technical and scholarly achievements, author attribution being a notable example. However, understanding how a piece of software or a quantitative approach arrived at the conclusions it did can be difficult for users not familiar with the theoretical foundations or technologies employed. To this end, metrics tracked by Style Card were selected so that users are offered information that is easy to understand and transparent. By using simple numbers, StyleCard metrics allow any scholar—whatever their experience and training with quantitative analysis—to benefit from the Program Era Project website, broadening the number of academic projects that might be inspired by quantitative analysis.
Even better, both the Style Card and Lit Map tools were developed in such a way that anyone can use them. You simply click the program file, type in the name of the author and work you are scanning and select name you want for the output file you will create. The tool does everything else. What this means is two-fold. First, this allows more collaboration with building our database of text metrics. If a team member can access the Data Capsule, they can easily run the software to collect metrics. Secondly, these tools will eventually be made freely available online. Therefore, any other project team that wishes to collect the same metrics the PEP team has will have the option available to them. Because the tools will be open source, users will also have the option to modify, adjust, and tweak the technology to their own needs. Moreover, any school that might be interested in learning more about their own history of Creative Writing, or any school that might wish to establish a partner project to the one here at the University of Iowa, will have the necessary technology.
All that said, I hope you can see why I and the Program Era Project team are so excited about the screenshot of our text tools running in conjunction with the HathiTrust Data Capsule. The image represents another significant step along the way to our goal of providing students and scholars of literature the ability to explore the history of the Iowa Writers’ Workshop—and the history of Creative Writing programs—in a way that was never before possible.