Apply natural language processing tools to raw text data (OCR) from Gale Primary Sources in a single research platform. By integrating an unmatched depth and breadth of digital primary source matter with the most popular Digital Humanities (DH) tools, Gale Digital Scholar Lab provides a new lens to explore history and empowers researchers to generate world-altering conclusions and outcomes. The Digital Scholar Lab offers advanced humanities computing tools that make natural language processing (NLP) for historical texts accessible, more efficient, and impactful, thus expanding the footprint of digital humanities across campus.
Build describes the initial creation and ongoing curation of a Content Set. A Content Set is a user-defined corpus or dataset of documents. Users can build their Content Set by searching Gale Primary Sources, by uploading .txt files, and by creating documents in the Lab. In this section, we provide a synopsis of how to build a Content Set by searching Gale’s Primary Sources and by uploading .txt files in Gale Digital Scholar Lab.
Build describes the ongoing curation of documents for a Content Set. A Content Set is a User defined dataset of documents. Users can build their Content Set by searching Gale Primary Sources, by uploading .txt files, and by creating documents in the Lab. In this section, we provide a synopsis of how to build a Content Set by searching Gale’s Primary Sources and by uploading .txt files in Gale Digital Scholar Lab.
.
Search
To start building your Content Set, you can search for documents using Gale Primary Sources available within your institution’s library. You can search for keywords or terms that characterize or best relate to your research interests and questions. You can search using words that appear in a document and/or the various metadata fields describing a document.
Review and Curate Results
Once the search results are presented, you can review the information of each document in order to determine if it’s appropriate for your Content Set.
Results show document titles, author, OCR confidence, and snippets of the text, and relevant metadata. You can use filters to further refine your search results.
If you are uncertain whether you would like to add a document to your Content Set, you can view documents by clicking on the title. An image of the original document along with its OCR text appears side-by-side. A side-by-side presentation provides occasion to examine the text in relation to the original document scan.
Add Documents to Content Sets
Once you have found and curated the documents you want to analyze, it’s time to add them to your Content Set.
To add documents to a Content Set, select any and all relevant documents. Once all documents are selected, click the “Add to Content Set” button in the toolbar. You can add the documents to either a new Content Set or a pre-existing Content Set. Additionally, you can select all results on the page or all of your search results (up to 10,000 documents) to add in bulk.
You can also add a document to your Content Set from the document view, or you can return back to the Results page and select the document title. After you have selected all the documents in the results list, click “Add to Content Set” in the toolbar and select the appropriate Content Set or create a new Content Set.
In addition to searching Gale’s Primary Sources, a user can upload .txt files to a Content Set. To upload files, drag a .txt file to the designated region, or click Browse to navigate your local disk. Once you have prepared document(s) for Upload, you can add the necessary metadata associated with the file and then add the files to a pre-existing Content Set or create a new one.
Configurations & Replication / Method
The Clean feature is designed as part of the broad commitment to transparency and method within Gale Digital Scholar Lab. We want you to be able to replicate or reproduce your results with the same Content Set or compare similar methods across different Content Sets. Clean allows you to build a Configuration you can reuse, offers a space to record the cleaning choices that you made, which enables you to return to your analysis easily after being away from Gale Digital Scholar Lab for a period of time. In short, a Configuration creates a kind of standardized method for preparing documents that you can send for Analysis, and lets you use that standardized preparation - like a cookie cutter - for any of your Content Sets, combined with any Analysis Tool.
Default & Custom Configurations
Clean allows you to create Configurations that you can reuse or associate with a specific analysis job for an Analysis Tool. Default Configurations can be used directly, or they can act as a template or starting point for the creation of new Custom Configurations. You can save Configurations at any time by providing a new name and description. An appropriately named Configuration will help you pick the correct Configuration from a dropdown menu when you choose configuration options for Analysis Tools.
Select the Text Cleaning Configurations
Each Configuration consists of a series of modification, removal and replacement, or substitution options, alongside a preliminary Stop Word List which can be edited or deselected entirely. In theory, you can have a Configuration that is empty. It will not do anything, but the DSLab would still use it before running an Analysis Job.
Corrections
The only text modification currently available is case correction, altering all text to lowercase. This is useful in contexts where an algorithm or tool might be case sensitive, for example Named Entity Recognition (NER), and there are no internal options to alter cases.
Removal
Document Sections
You can select which sections of a document you would like to segment for analysis or download.
Replacement
Replace _____ with _____ allows users to define what kinds of replacements they wouldd like to make on the spot. This function is useful for controlling orthography or spelling differences, e.g. all instances of “colour” can be altered to “color”, to make sure that the Analysis Tools treat them as the same token, not distinct words.
Stop words
You can also decide to use a Stop Word List as part of your configuration. The default Stop Word List contains English words. However, there are additional Starter Lists (many in different languages) you can select from to use and you have the ability to edit this list as you see fit. Each Stop Word should be listed on a separate line. You could also cut and paste an entirely new Stop Word List here.
Name and save the cleaning configurations
When you first open up the Clean feature you will see the “Default Configuration”, which contains an English Stop Word List, and the “Remove all tabs”, “Remove all line breaks”, “Reduce multiple spaces to one space” and “Remove all non-body content” options checked. You can alter these default selections and save the updated configuration by providing a new name. There is a space for you to keep Configuration Notes to keep a record of your reasoning as you customize your clean configuration. Each time you view and alter a Configuration, you can either save it, rewriting over the existing Configuration, or you can save it as a new set up, using “save as”. This allows you to create any number of customized Configurations for your work. A useful tip is to name your configurations so you can remember the settings you have chosen, e.g. “No Punctuation - Lower case - English” for a Configuration which removes all punctuation, transforms all text to lowercase, and uses a English Stop Word List.
Once you have created, curated, and cleaned the OCR texts in your Content Set, you are ready to move on to the Analyze phase. The Analysis tools allow you to take hundreds or thousands of documents and use digital tools to interrogate them in ways that would have been too time consuming without the help of computational algorithms. In this section, learn how to choose tools, run them, and interpret their output.
It’s important to know what tools are available and what they can do. In this section, you will learn about the questions you should ask yourself as you choose the right tool for your analysis.
Each tool has settings you can use to refine the results you can get. It is important to know how to use these options in order to return the best possible visualization results.