Text Data Mining (TDM)

Data mining uses computer software to conduct systematic analysis of data sets to identify patterns. Text mining applies the same type of systematic analysis but to textual documents and is increasingly used across all academic fields as well as commercial and business fields, to streamline processes.  

This research guide is intended to help with identifying public and licensed text corpora for text data mining as well as tools for textual analysis.

At the Leddy Library, text and data mining can be conducted using APIs provided by either openaccess providers or by the publishers of content to which the Library subscribes. The library is actively negotiating text mining rights with database vendors but unauthorized TDM can violate our license and result in a loss of the material for all Leddy Library users. For help with using APIs and/or to determine which materials in the library’s collection are available for data mining beyond what is listed below, please contact us at XXXX. 

Textual data from Leddy's licensed collections

APIs for Scholarly Resources

An API or Application Programming Interface is a tool used to share content and data between computers. Researchers can use software to communicate with an API provided by a publisher or content provider to collect, extract, and analyze data and text in a variety of formats. Some publishers limit the amount of content each search will result in and some require registration using your university email. Below is a list of APIs available for the licensed scholarly resources subscribed to by the Leddy Library as well as information on what content is available, the resulting format, and if registration is required. For a list of freely available API products, please consult MIT Library’s Resources and Tools for Computational Research page. If you have any questions or would like to suggest a missing resource for this list, please contact us at XXXX.

Adam Matthew Digital

What does it do?: The API can be used to return metadata and full-text from documents, images and sections for Adam Matthew Collections.  
Access: RESTful interface
Format: JSON
Registration: Registration for an API key is required and reviewed on a case-by-case basis for each researcher. Contact us at xxx to receive the Adam Matthews Text and Data Mining Information and Permission Request form. Once complete, return it to us and we will submit the form on your behalf to authenticate your affiliation with the University of Windsor.
Limitations: 180 requests per 15 minutes per collection.

Constellate 

What does it do?: Not a true API but a text and data analytics service that queries JSTOR, Project Muse, and SAGE databases for learning and performing text analysis, building datasets, and exporting vast amounts of data for TDM purposes.
Access: Via the Constellate platform rather than any interface
Format: JSON 
Registration:  Required, must use/create a JSTOR account. Free but with restrictions. Form available here: https://docs.google.com/forms/d/e/1FAIpQLSfSfrJCMpp6MZaHE7vPxmDafQHLNKmBCYCwqGvPaN-aeqhe-Q/viewform?pli=1  
Limitations: 25,000 items per dataset

CrossRef

What does it do?: Access to the metadata records for over 75 million scholarly works with CrossRef DOIs (Digital Object Identifiers) through various databases that contribute their data and make it available for scholarly use.
Access: REST API, XML API, OAI-PMH, OpenURL 
Format: JSON, Text, XML 
Registration: Not required. However, if you use HTTPS and include appropriate contact information in your query, you will be directed to a special pool of machines reserved for “polite” users.  
Limitations: None stated but the right is reserved to impose limits and/or block clients that are disrupting the public service.

Elsevier

What does it do?: Access to metadata from scholarly journals indexed by Scopus, full-text journals and books published by Elsevier on the ScienceDirect platform, research metrics on SciVal, and engineering resources from Engineering Village. 
Access: HTTP requests using structured queries. 
Format: JSON, XML 
Registration: Required, but free. Register for an API key using the Elsevier Developer Portal. Once registered, you need to request an institutional token from apisupport@elesvier.com using your university email address and include your API key. Full access is only available to researchers affiliated with organizations that have subscriptions to Elsevier products.  
Limitations: 200 results per query, API key valid for 6 months.

Érudit

What does it do?: Access to the full text of Éruidt’s collection of articles from over 200 scholarly journals and 39 cultural journals in the humanities and social sciences and the arts and letters from Québec and Canada
Format: PDF, XML, JPG, PNG
Registration: Required, but free for teaching or non-commercial scientific research purposes. Contact Éruidt at corpus@erudit.org requesting access. They will email you a Project Description Form to be completed and sent back. They will review your request and if approved, require a completed Research Corpora License Agreement before providing access.
Limitations: None listed.

IEEE Xplore

What does it do?: Query and retrieve metadata records including abstracts for more than 5 million documents in IEEE Xplore including journals, conference proceedings, books, courses, and technical standards. The Open Access API queries articles designated open access and the Digital Object Identifier (DOI) API queries up to 25 DOI numbers to retrieve metadata records including abstracts.  
Access: HTTP requests using structured URL queries.
Format: JSON, XML 
Registration: Required, must be affiliated with an institution that has an eligible subscription. Once registered, you will get an API key that must be used with every query.  
Limitations: a maximum of 200 results may be retrieved in a single query. A query term can only contain a maximum of ten words.

Scholars Portal Journals

Subscribed to databases available for querying: 
- Scholars Portal Journals 
- Scholars Portal Books 
- Taylor & Francis 
What does it do?: Gives programmatic access to the metadata and full text of over 65 million journal articles in the scholars portal journals collection. Articles are licensed from a variety of vendors with major sources including Springer Nature, Taylor & Francis, and Wiley.  
Access: RESTful interface, queries are made as HTTP GET requests. Sample Python scripts for harvester or generating a corpus are available.  
Format: JSONL 
Registration: Not required. Access is restricted to Ontario university IP addresses, so you must be on campus or using a university VPN.  
Limitations: Only articles licensed by your university are accessible.

Springer Nature API Portal

What does it do?: Meta API and Metadata API provide metadata for over 14 million online documents including journal articles, book chapters, and protocols. Open Access API provides metadata and full-text content where available for more than 649,000 online documents from Springer Nature open access XML.  
Access: RESTful interface, using structured URL requests 
Format: XML, JSON 
Registration: Free but required. API key required. If affiliated with a subscribing institution, results can include the full text of subscribed materials.  
Limitations: Maximum results for a single query is 100 for metadata queries, 20 for open access queries

Web of Science API LITE

What does it do?: Supports simple searching across the Web of Science to retrieve core item-level metadata. Primarily for populating an institutional repository.    
Access: SOAP protocol 
Format: JSON, XML 
Registration: Free but required, must be part of a subscribing institution.  
Limitations: 2 requests per second.

Library resources & training

Library Resources

Bramer, M.A. 2013. Principles of data mining. London: Springer. https://uwindsor.primo.exlibrisgroup.com/permalink/01UTON_UW/sgtkuc/alma99868221002181.

Lamba, M. & Madhusudhan, M. 2022. Text Mining for Information Professionals: An Unchartered Territory. Cham: Springer International Publishing. https://uwindsor.primo.exlibrisgroup.com/permalink/01UTON_UW/sgtkuc/alma991057071302181.

Ignatow, G. & Mihalcea, R. 2018. An introduction to text mining: Research design, collection, and analysis. SAGE Publications. https://uwindsor.primo.exlibrisgroup.com/permalink/01UTON_UW/sgtkuc/alma991256433902181.

Ignatow, G. & Mihalcea, R. 2017. Text mining: A guidebook for the social sciences. SAGE publications. https://uwindsor.primo.exlibrisgroup.com/permalink/01UTON_UW/sgtkuc/alma991256717302181.

Contributors

Annie Kavanagh

Berenica Vejvoda

Roger Reka

Dave Johnston

Textual data from Leddy's licensed collections

APIs for Scholarly Resources

Library resources & training

Library Resources

Contributors

Your Contact

Connect with your library