This page contains links to data sets available online, it has been copied from the Fall 2011 edition of the Stanford class CS448B
Public Data Repositories
In recent years, a number of web sites hosting public data repositories have been created. The available data sets include both user-generated content and official data from various organizations.
NationMaster and StateMaster statistics repositories
The Sunlight Foundation maintains a list of resources for political transparency.
NIST (National Institute for Standards and Technology) Scientific and Technical Databases
Statistical Science Data Sets - Large index of data sets from fully processed to raw.
LexisNexis Statistical Universe - Just about everything. Be sure to check the box that says "Limit to Documents with Excel Spreadsheets".
The Journalists Database of Databases - A good collection of interesting data, mostly government, social, and economic.
Fathom Data Sets - Various nice data sets meant for use with the visualization program fathom.
Computer Security
Computer Network Traffic Data - A ~500K CSV with summary of some real network traffic data from the past. The dataset has ~21K rows and covers 10 local workstation IPs over a three month period. Half of these local IPs were compromised at some point during this period and became members of various botnets. Can you discover when a compromise has occurred by a change in the pattern of communication?
- Each row consists of four columns:
- date: yyyy-mm-dd (from 2006-07-01 through 2006-09-30)
- l_ipn: local IP (coded as an integer from 0-9)
- r_asn: remote ASN (an integer which identifies the remote ISP)
- f: flows (count of connnections for that day)
- Reports of "odd" activity or suspicions about a machine's behavior triggered investigations on the following days (although the machine might have been compromised earlier)
- Date : IP
- 08-24 : 1
- 09-04 : 5
- 09-18 : 4
- 09-26 : 3 6
- Each row consists of four columns:
Agriculture, Food and Nutrition
World wine statistics - Information on worldwide wine production and consumption.
USDA food nutrient data - Information about the nutrients contained in a number of different foods and food groups.
USDA PLANTS Database - The PLANTS Database provides standardized information about the vascular plants, mosses, liverworts, hornworts, and lichens of the U.S. and its territories. It includes names, plant symbols, checklists, distributional data, species abstracts, characteristics, images, plant links, references, crop information, and automated tools.
Frequently occurring first and last names - U.S. Census Bureau genealogical data on names.
Popular baby names - Social Security Administration data on distributions of given names.
DHS Yearbook of Immigration Statistics "The Yearbook of Immigration Statistics is a compendium of tables that provides data on foreign nationals who, during a fiscal year, were granted lawful permanent residence (i.e., admitted as immigrants or became legal permanent residents), were admitted into the United States on a temporary basis (e.g., tourists, students, or workers), applied for asylum or refugee status, or were naturalized. The Yearbook also presents data on immigration law enforcement actions, including alien apprehensions, removals, and prosecutions."
Human Mortality Database - The Human Mortality Database (HMD) was created to provide detailed mortality and population data to researchers, students, journalists, policy analysts, and others interested in the history of human longevity.
National Surveys of 8th Graders
A nationally representative sample of eighth-graders were first surveyed in the spring of 1988. A sample of these respondents were then resurveyed through four follow-ups in 1990, 1992, 1994, and 2000. On the questionnaire, students reported on a range of topics including: school, work, and home experiences; educational resources and support; the role in education of their parents and peers; neighborhood characteristics; educational and occupational aspirations; and other student perceptions.
The .xls file contains 2000 records of students' responses to a variety of questions and at different points in time. The codebook explains the question and answer codes.
Bureau of Labor Statistics - From the Department of Labor.
King County department of assessments - Data on housing and properties in King County, Washington state.
Baseball Statistics - The Lahman baseball database, 1871-present.
Google Trends - Track the average worldwide traffic of any search term. Once you get the results, scroll to the bottom of the page and look for "Export this page as a CSV file". You must be logged into Google for the feature to work
Politics and Government
Florida 2000 Ballot Data
This data set is Florida election data from the CMU Statistical Data Repository. (Note: when downloading these files, be sure to use the correct "save-file" operation for your browser ... IE tends to add extra characters that confused the programs.)
U.S. House of Representatives Roll Call Data
This contains roll call data from the 108th House of Representatives: data about 1218 bills introduced in the House and how each of its 439 members voted on it. The data covers the years 2003 and 2004. The individual columns are a mix of information about the bills and about the legislators, so there's quite a bit of redundancy in the file for the sake of easier processing in Tableau.
Government Spending Data
Have you ever wanted to find more information on government spending? Have you ever wondered where federal contracting dollars and grant awards go? Or perhaps you would just like to know, as a citizen, what the government is really doing with your money.