Answered By: Bobray Bordelon Last Updated: Dec 05, 2024 Views: 909
First see the Text Mining Guide. Scraping is often not permitted.
- RavenPack News Analytics use for financial and economic analysis.
- Voxgov: Provides access to real-time documents, publications, press releases, and social media posts from all branches, offices, agencies, and elected officials of the U.S. Federal Government. Extensive search and filtering options; most content is from 2000-present.
- ProQuest TDM Studio. Access to many historical newspapers and other content from ProQuest.
- Wall Street Journal (1889-1933) XML files
- United States Congress
- Detroit Free Press (1831 - 1999), Philadelphia Inquirer (1860 - 2001), HNP Pittsburgh Post Gazette (1786 - 2003) (ProQuest XML files)
- (Proquest XML Files: Ethnic Newspapers): Atlanta Daily World (1931-2003), Baltimore Afro-American (1893-1988), Chicago Defender (1910-1975), Cleveland Call & Post (1934-1991), Los Angeles Sentinel (1934-2005), New York Amsterdam News (1922-1993), Norfolk Journal & Guide (1921-2003), and Philadelphia Tribune (1912-2001), and Pittsburgh Courier (1911-2002),
-
Princeton also has XML files for Louisville Courier (1830-2000), Minneapolis Star Tribune (1867-2001), and St. Louis Post Dispatch (1874-2003) on a hard drive. For access, please email Bobray Bordelon.
-
The New York times annotated corpus. Contains over 1.8 million articles written and published by the New York Times between January 1, 1987 and June 19, 2007.
-
English Gigaword Fifth Edition. Comprehensive archive of newswire text data that has been acquired over several years by the Linguistic Data Consortium (LDC). Includes:
- Agence France-Presse, English Service (afp_eng) (1994-2010)
- Associated Press Worldstream, English Service (apw_eng) (1994-2010)
- Central News Agency of Taiwan, English Service (cna_eng) (1997-2010)
- Los Angeles Times/Washington Post Newswire Service (ltw_eng) (1994-1998, 2003-2009)
- Washington Post/Bloomberg Newswire Service (wpb_eng) (1995-2010)
- New York Times Newswire Service (nyt_eng) (1994-2010)
- Xinhua News Agency, English Service (xin_eng) (2003, 2005, 2007, 2009, 2011)
-
Lexis Nexis Web Services Kit: Mediated service for bulk downloads of Nexis UNI content (formerly Lexis Nexis Academic). Bulk downloads performed with assistance of engineers from Lexis Nexis. Allow time to initiate contact and coordinate search time and parameters. Click Access Resource to learn more about using Web Services Kit and contacting your subject librarian to initiate contact with Lexis Nexis.
-
Constellate. Text analytics service from ITHAKA (JSTOR and Portico). It is a platform for teaching, learning, and performing text analysis using archival repositories of scholarly and primary source content. PRODUCT IS SUNSETTING JULY 1, 2025.
- Newspaper Navigator (Library of Congress - Chronicling America): (select pre and post WWI coverage with most coverage from 1900-1925). Also see https://huggingface.co/datasets/dell-research-harvard/AmericanStories. Collection of full article texts extracted from historical U.S. newspaper images. It includes nearly 20 million scans from the public domain Chronicling America collection maintained by the Library of Congress.
- North American News Text, Complete Includes:
- Los Angeles Times & Washington Post May 1994-August 1997
- New York Times News & Syndicate July 1994-December 1996
- Reuters News Service (General & Financial) April 1994-December 1996
- Wall Street Journal (not in General Release) July 1994-December 1996
-
News API:API service that allows querying online news sources from the past month including major publications such as the New York Times, ABC News, and Al Jazeera. Register for a free API key to get started.
-
NY Times APIs. The Article Search API provides access to headlines, abstracts, lead paragraphs and more (but NOT full-text articles) from the New York Times, 1851+.
-
Integrum World Wide. Digital archive of the most influential sources of information of Russia as well as a range of analytical services for mass media and social networks monitoring.
-
Newswire. Contains 2.7 million unique public domain U.S. news wire articles, written between 1878 and 1977. Locations in these articles are georeferenced, topics are tagged using customized neural topic classification, named entities are recognized, and individuals are disambiguated to Wikipedia using a novel entity disambiguation model.
-
Stanford Cable TV News Analyzer. Includes near 24-7 recordings of CNN, Fox News, and MSNBC January 1, 2010+. The dataset updates daily, with approximately a 24-36 hour lag from the original content's air date. In total, the dataset consists of over 370,000 hours of video and includes both TV news programming and commercial segments.
Was this helpful? 13 1
Related Topics
Contact Us
Chat with a Librarian
Text a Librarian
Text (609) 277-3245 to get live help on your mobile phone (available the same hours as the Chat service)
Email a Librarian
You can email your research questions to refdesk@princeton.edu or you can request an individual appointment with a subject specialist.
Call a Librarian
Call (609) 258-5964 to speak to a reference librarian during most open hours of the Libraries.