Design

This is the original design document for Winnow, the subcorpora tool.

Last revised 5/29/2019

Introduction

When faced with hundreds (or even thousands!) or oral history transcripts, it's hard to know where to look for relevant information. Thus, we created Python scripts to create subcorpora (subsets of transcripts) from our corpus that contain relevant information, which cut down the number of transcripts that we needed to look at. Using these scripts, we can easily filter transcripts by metadata, by keywords, etc.

The original scripts that we wrote to process and create subcorpora for the Oral History Project have to be run in the terminal. For many oral historians, this may not be an easy nor intuitive way to analyze transcripts.

The OHTAP use case

The subcorpora tool will be created based on the pain points of the OHTAP team, which include:

Current tools not being able to handle the size of our collection. Specifically, we were using tools hosted online that only allowed us to upload a certain number of transcripts at a time, which limited the amount of analysis we could do and added a lot of manual labor as our collection continued to grow.
Current tools not allowing for needed functionality, including flagging keywords in context, organizing our keywords into categories, generating aggregate statistics based on our metadata, easily storing and sharing information from past analyses, etc.
Needing an easier and more intuitive way to run our scripts. Only the software engineers were running the subcorpora tool through the terminal (which we called a "run"), which wasn't intuitive to someone without a technical background.
Needing a better data management methodology to keep track of changes to keyword lists.

Thus, the subcorpora tool will be represented through a front-end application that oral historians can run locally on their machine. We decided not to host it online because we cannot support a server that can store and process large amounts of transcript files; additionally, this allows OHTAP to continue running our tool with our increasing corpus. Additionally, we will not have a database (as traditional web applications use); instead, we will store all of our data into a JSON file which allows it to be easier shared (as a single file) between members of our team.

General use case

For this current implementation, we hope to first make it specific to OHTAP. As we continue implementation, we hope to expand the tool to a more general use case so that anyone can analyze any text files based on a set of keywords.

Technology

We will use React.js as the front-end and Node.js as the backend. We chose these two technologies because they are two of the more popular technologies for web applications.

High-Level design

The tool will consist of three front-end components for the user to interact with as well as additional functionality for users to export their session and reload it later.

Front-end

We split the web application into three front-end components: run subcorpora tool, keyword lists, and past runs.

Figure 1: Basic navigation of different views in the web application.

These sections are explained in more detail below.

Run subcorpora tool

This section will run the subcorpora tool to simulate running it in the terminal.

Figure 2: Workflow of running the subcorpora tool.

Essentially, the user will be able to first choose the corpus collections they want to use, the keyword lists, the metadata file, and then run the subcorpora tool to generate the results.

Additional functionalities in the future will include:

Ability to upload any sort of metadata file and have it parse through relevant columns. For our initial OHTAP use case, we will hard-code the column names into our code.
Ability to flag keywords in context through checkboxes, such as a false hits. For the future, there will also be automatic and manual distinctions for this (i.e. should the tool exclude all cases of a certain phrase or should it only do this once).
Saved sessions of all of the above in the JSON file so you can access and look at past runs.

Keyword lists

This section wil contain the list of keyword lists that the user has.

Figure 3: List of keyword lists.

It will contain the ability to edit, add, and delete keyword lists.

Past runs

The user will also be able to look up past runs that they did.

Figure 4: List of past runs.

Data

The data for each session will be saved into a JSON file. This JSON file can be shared among users so that they can initialize their session instead of starting from scratch. We chose the JSON file for these reasons:

One file is easier to share and manage than a giant database.
It doesn't require any additional set-up of databases, which is helpful for non-technical members.
You can manually edit the JSON file if needed.
We are not hosting the application on the web.

Granted, this method of storing data is probably not scalable, but for our specific use case, it is sufficient.

Report layouts

We have some report layouts that the user can look at after they conduct a run (running the subcorpora tool).

Report navigation

Users will need to be able to navigate through different reports. The navigation will be at the top.

Figure 5: Navigation through summary and individual reports (selected by collection and keyword list). When the "Summary" type is selected, "Collection" and "List" will be grayed out.

Summary report

We will have one report that will be a summary of the entire run (all collections and all keywords).

Figure 6: Sample summary report will cards of graphs and basic information at the very top.

Individual report

We will have one report per collection per keyword list.

Figure 7: The individual reports will also have similar graphs and basic information layout, but it’ll also have keyword contexts at the bottom where users can flag and mark incorrect keywords.

Comparing Reports

We will also be adding a functionality to be able to compare statistics across reports. This will require careful thought as we expand the tool, since some forms of statistics aren’t comparable depending on the metadata being shown.

High-Fidelity Prototypes

We are using Google Material UI for our front-end components. We created a quick design using Figma showcasing the flow of screens.

Figure 8: Figma layout of screens.

We can see how the menu will be on the side.

Figure 9: Figma layout of screens.

We can see the format of the run workflow.

Figure 10: Figma layout of choosing collections, keywords, and metadata for the run.

There is a progress bar that shows what is going on in the backend.

Figure 11: Figma layout of progress bar.

We can see an idea of what the reports will look like.

Figure 12: Figma layout of progress bar.

In general, this is also how the editing of collections and keyword lists will go.

Figure 13: Figma layout of progress bar.

Licensing and Permissions

As we continue on with the subcorpora tool, we can think about licensing and permissions and restrictions on what sorts of data that people would be able to access. For example, when we generate reports and want to share them, we can choose to exclude excerpts from certain transcripts, etc. Currently we don’t have any of this designed into the current version, but this could be something to think about for the future.

This code belongs to the Stanford Oral History Text Analysis Project and is licensed under The MIT License.