Data for Computer Scientists to play with
Any machine-readable data can be used as fodder for computer programming. This page merely attempts to list some datasets of particular interest to computer science researchers: data on computer hardware performance and reliability, data on networks, datasets developed to investigate problems in computer science and artificial intelligence, and so on. Some of these datasets may also be of interest to researchers in mathematics.
Note that some of these datasets come in non-standard / custom formats and ADC staff may not always be able to provide assistance, though we are always willing to try.
Internet / Network Traffic Data
The Internet Traffic Archive
Repository of traces of Internet network traffic
Web Service Reliability and Quality of Service datasets
GPL'd web service research datasets from Zibin Zheng's PhD research work
Knowledge Discovery Lab: Can-o-sleep database
Data on mp3 shares and transfers collected from a campus network (the can-o-sleep) for P2P file sharing
Knowledge Discovery Lab: Mobile social networks
Data taken from a series of experiments in wireless mobile connections
Social Network Analysis
University of California - Irvine: Network Data Repository
Interpersonal network datasets (fraternity students, southern women, etc.), ham radio interactions, an email communication network derived from the USC Enron data, and several others.
Data from the Conference on Weblogs and Social Media (ICWSM): 2013, 2012, 2011.
Free after signing a license agreement. Includes data on Facebook, Twitter, wiki edits, weblog comments, etc.
The Facebook100 Data Set
A snapshot showing the complete set of people and friendships from the Facebook networks of 100 different colleges and universities from September 2005. The earlier Facebook5 (5 colleges/universities) is also available.
The Twitter Project Page at MPI-SWS
An anonymized topology of the Twitter social network as of August 2009.
Computer and Component Operations and Failure Data
Los Alamos National Laboratory: Computer Science Research Data
Filesystem statistics data, workloads, traces and other operational data.
Delft Technical University: Grid Workloads Archive
Anonymized workload traces from grid environments.
The Usenix Computer Failure Data Repository
Detailed component failure data from a variety of systems.
University of Western Sydney: Failure Trace Archive
Repository of availability traces of parallel and distributed systems.
ACM KDD Cup Data
Archive of ACM's annual Data Mining and Knowledge Discovery competition.
Causality Workbench: Challenges in Machine Learning Data Repository
Data from another annual challenge to test machine learning and causal discovery algorithms.