Skip to Main Content

Text and Data Mining: Text and Data Mining Resources

This guide will help faculty and students with text mining questions, especially in relation to our databases.

Free Resources and Data Mining Tools

Online Sources

The American Presidency Project is a database of documents that include Washington's inauguration as well as speeches and other documents of former American Presidents.

This is a database of historical American newspapers created by the National Digital Newspaper program.  It has over 2000 full text searchable digitized newspapers. This database has its own free-to-use API.

This project provided the first free e-books.  Project Gutenberg now contains over 50,000 works in a plethora and range of topics.  Most works are dated before 1923 and are out of copyright.  The works include books and periodicals.

This database is "the largest freely-available corpus of English, and the only large and balanced corpus of American English. It contains more than 520 million words from fiction, popular magazines, newspapers, academic texts and spoken word." 

This archive is an online non-profit founded in 1996.  It contains over 10,000,000 texts and books.  The materials come in a variety of formats to download. 

This is a list of APIs created by the New York Times.  It is important to note that all users must request an API key for each API.  This API allows for retrieval of headlines, lead paragraphs as well as comments by registered users of the New York Times.

This digital library began in 2010 with funding provided by the Harvard Berkman Center and the Alfred P. Sloan Foundation.  The API contains items and collections.  Use of the API requires requesting on an API from the Digital Public Library of America.

PLOS is an open access publisher.  PLOS publishes scientific research focusing on the sciences and medicine.  There are two APIs. One works with the terms of the PLOS search function and the other works with article metric levels.

Current Online Sources

PubMed Central is a free archive that includes biomedical and life science literature from the U.S. National Institute of Health's National Library of Medicine.  PubMed began in 2000 with only two journals and now holds millions of articles.

Web-Based Tools

This inferface works with the Google Books corpus.  It allows comparison through different periods with functions that include frequencies, synonyms, collocates, and speech.

This tool is an open access took that focuses on text and network analysis from data on social media.

TAPorWare is a web-based analysis tool.  Functions include word frequency, concordance, collocation, and tokenization.

This tool was designed to work for the user's own texts.  They can be copied and pasted into the software.  Initial results include a list of works and frequency.  This tool can be used to discover how science and technology topics are covered in popular magazine and newspapers.

Desktop Tools

TextSTAT is an analysis program found on the Free University Berlin website.  This program allows users to build a corpus and find word frequencies.  

This program is an add-on citation management software Zotero that includes features like topic modeling and word clouds.

Apps Tools

This app is free for the IOS symptom.  It allows analysis from websites, tweets, and documents.  This app does require affiliation with a Twitter account.

Programming

This platform works for working human language in Python.  Functions include text mining word count, word frequency, collocations and named entity identification and more.  This is an open source project and can be downloaded for free.

This is an open source programming language software for topic modeling.  This program allows users to gain insights on topic trends.

This open source programming language can be used for data manipulation.  calculations and graphing.  It can be adapted to fit a humans needs.  It is compatible with platforms like Linux, Windows and Mac OS.

Visualization

This is a word cloud generating web site. Users can customize fonts, layouts and color schemes of the word cloud.  Wordle uses the Java browser plug-in.  It does require Java to be installed and the browser to be configured to use it.

This website is used for creating graphs and charts. There are over 30 different ones to choose from in this program. The data can be entered into the website or imported from various places.

 

Text Mining Library Databases and Journals

These library resources allow users to data mine or text mine their materials.  If you have questions about mining these resources, please contact the Research and Scholarly Communication Department at scholarlycomm@ecu.edu.

 

  • Brill Online Content
  • EBL E-Books
  • SAGE Publications
  • University of California Press

Free Text Analysis Tools