Skip to Main Content

Gale Primary Sources and Digital Scholar Lab

Getting Started with Gale Digital Scholar Lab

Build describes the initial creation and ongoing curation of a Content Set. A Content Set is a user-defined corpus or dataset of documents. Users can build their Content Set by searching Gale Primary Sources, by uploading .txt files, and by creating documents in the Lab. In this section, we provide a synopsis of how to build a Content Set by searching Gale’s Primary Sources and by uploading .txt files in Gale Digital Scholar Lab.

Build describes the ongoing curation of documents for a Content Set. A Content Set is a User defined dataset of documents. Users can build their Content Set by searching Gale Primary Sources, by uploading .txt files, and by creating documents in the Lab. In this section, we provide a synopsis of how to build a Content Set by searching Gale’s Primary Sources and by uploading .txt files in Gale Digital Scholar Lab.

.

Search

To start building your Content Set, you can search for documents using Gale Primary Sources available within your institution’s library. You can search for keywords or terms that characterize or best relate to your research interests and questions. You can search using words that appear in a document and/or the various metadata fields describing a document.

Review and Curate Results

Once the search results are presented, you can review the information of each document in order to determine if it’s appropriate for your Content Set.

Results show document titles, author, OCR confidence, and snippets of the text, and relevant metadata. You can use filters to further refine your search results.

If you are uncertain whether you would like to add a document to your Content Set, you can view documents by clicking on the title. An image of the original document along with its OCR text appears side-by-side. A side-by-side presentation provides occasion to examine the text in relation to the original document scan.

Add Documents to Content Sets

Once you have found and curated the documents you want to analyze, it’s time to add them to your Content Set.

To add documents to a Content Set, select any and all relevant documents. Once all documents are selected, click the “Add to Content Set” button in the toolbar. You can add the documents to either a new Content Set or a pre-existing Content Set. Additionally, you can select all results on the page or all of your search results (up to 10,000 documents) to add in bulk.

You can also add a document to your Content Set from the document view, or you can return back to the Results page and select the document title. After you have selected all the documents in the results list, click “Add to Content Set” in the toolbar and select the appropriate Content Set or create a new Content Set.

In addition to searching Gale’s Primary Sources, a user can upload .txt files to a Content Set. To upload files, drag a .txt file to the designated region, or click Browse to navigate your local disk. Once you have prepared document(s) for Upload, you can add the necessary metadata associated with the file and then add the files to a pre-existing Content Set or create a new one.

Configurations & Replication / Method

The Clean feature is designed as part of the broad commitment to transparency and method within Gale Digital Scholar Lab. We want you to be able to replicate or reproduce your results with the same Content Set or compare similar methods across different Content Sets. Clean allows you to build a Configuration you can reuse, offers a space to record the cleaning choices that you made, which enables you to return to your analysis easily after being away from Gale Digital Scholar Lab for a period of time. In short, a Configuration creates a kind of standardized method for preparing documents that you can send for Analysis, and lets you use that standardized preparation - like a cookie cutter - for any of your Content Sets, combined with any Analysis Tool.

Default & Custom Configurations

Clean allows you to create Configurations that you can reuse or associate with a specific analysis job for an Analysis Tool. Default Configurations can be used directly, or they can act as a template or starting point for the creation of new Custom Configurations. You can save Configurations at any time by providing a new name and description. An appropriately named Configuration will help you pick the correct Configuration from a dropdown menu when you choose configuration options for Analysis Tools.

Select the Text Cleaning Configurations

Each Configuration consists of a series of modification, removal and replacement, or substitution options, alongside a preliminary Stop Word List which can be edited or deselected entirely. In theory, you can have a Configuration that is empty. It will not do anything, but the DSLab would still use it before running an Analysis Job.

Corrections

The only text modification currently available is case correction, altering all text to lowercase. This is useful in contexts where an algorithm or tool might be case sensitive, for example Named Entity Recognition (NER), and there are no internal options to alter cases.

Removal

  • Remove all extended ASCII characters
  • Remove all number characters
  • Remove all special characters. Users can set specific special characters to remove, such as currency symbols, slashes, underscores, etc.
  • Remove all punctuation. Users can set specific punctuation to remove
  • Remove all tabs
  • Remove all line breaks

Document Sections

You can select which sections of a document you would like to segment for analysis or download.

  • Remove Body Text enables users to strip the main narrative text from a monograph and/or manuscript documents prior to analysis or download.
  • Remove non-Body Text enables users to strip the below listed segments from a monograph and/or manuscript documents prior to analysis or download.
    • Front Matter
    • Title Page
    • Preface
    • Table of Contents
    • Back Matter
    • Index

Replacement

  • Reduce multiple spaces to one space (ex: "hello there" becomes “hello there”)

Replace _____ with _____ allows users to define what kinds of replacements they wouldd like to make on the spot. This function is useful for controlling orthography or spelling differences, e.g. all instances of “colour” can be altered to “color”, to make sure that the Analysis Tools treat them as the same token, not distinct words.

Stop words

You can also decide to use a Stop Word List as part of your configuration. The default Stop Word List contains English words. However, there are additional Starter Lists (many in different languages) you can select from to use and you have the ability to edit this list as you see fit. Each Stop Word should be listed on a separate line. You could also cut and paste an entirely new Stop Word List here.

Name and save the cleaning configurations

When you first open up the Clean feature you will see the “Default Configuration”, which contains an English Stop Word List, and the “Remove all tabs”, “Remove all line breaks”, “Reduce multiple spaces to one space” and “Remove all non-body content” options checked. You can alter these default selections and save the updated configuration by providing a new name. There is a space for you to keep Configuration Notes to keep a record of your reasoning as you customize your clean configuration. Each time you view and alter a Configuration, you can either save it, rewriting over the existing Configuration, or you can save it as a new set up, using “save as”. This allows you to create any number of customized Configurations for your work. A useful tip is to name your configurations so you can remember the settings you have chosen, e.g. “No Punctuation - Lower case - English” for a Configuration which removes all punctuation, transforms all text to lowercase, and uses a English Stop Word List.

Once you have created, curated, and cleaned the OCR texts in your Content Set, you are ready to move on to the Analyze phase. The Analysis tools allow you to take hundreds or thousands of documents and use digital tools to interrogate them in ways that would have been too time consuming without the help of computational algorithms. In this section, learn how to choose tools, run them, and interpret their output.

    1. Selecting the Right Tool

It’s important to know what tools are available and what they can do. In this section, you will learn about the questions you should ask yourself as you choose the right tool for your analysis.

    1. Setting Up and Running

Each tool has settings you can use to refine the results you can get. It is important to know how to use these options in order to return the best possible visualization results.