—The best-known source of datasets for
machine learning is the University of California at Irvine. We used fewer
than 10 data sets in this book, but there are more than 200 datasets in this repository.
Many of these datasets are used to compare the performance of algorithms
so that researchers can have an objective comparison of performance.
—If you’re a big data cowboy, then
this is the link for you. Amazon has some really big datasets, including the
U.S. census data, the annotated human genome data, a 150 GB log of Wikipedia’s
page traffic, and a 500 GB database of Wikipedia’s link data.
—Data.gov is a website launched in 2009 to increase the
public’s access to government datasets. The site was intended to make all
government data public as long as the data was not private or restricted for
security reasons. In 2010, the site had over 250,000 datasets. It’s uncertain
how long the site will remain active. In 2011, the federal government
reduced funding for the Electronic Government Fund, which pays for
Data.gov. The datasets range from products recalled to a list of failed banks.
—Data.gov has a list of U.S. states, cities,
and countries that hold similar open data sites.
—Infochimps is a company that aims to give
everyone access to every dataset in the world. Currently, they have more
than 14,000 datasets available to download. Unlike other listed sites, some
of the datasets on Infochimps are for sale. You can sell your own datasets
here as well.
refer:《Machine Learning in Action.pdf》