Recently I’ve acquired an Nvidia GTX 980 graphics card - naturally I’m looking for fun datasets to explore and small projects involving medium-small scale data mining. In my exploration I’ve come across a few that seem promising for personal perusing:

  1. Internet Census 2012 - In 2012 a lone researcher took advantage of insecure embedded devices (e.g routers) to perform a scan of the entire internet. This researcher’s paper is well worth a read. The methodology is really interesting and the data is quite beautiful. There seems to be a lot to explore with this dataset.

  2. Reddit Public Comments 2007 October - 2015 May - Working in abuse I think this is an amazing dataset for working on for-fun spam / trolly-score / general comment quality classifiers. It could also be interesting to apply a variety of out-of-the-box algos on this set; definitely makes me feel creative & is something I’d like to explore!

  3. Common Crawl - Apparently you can look at entire crawls of the web! Cool to think anyone with a little money and some free time could have a decent search engine (…crawling infrastructure is another thing).

  4. ImageNet - Obviously ImageNet, but I never realized you had to go download all the images yourself. Damn copyright restrictions! Sometimes the lack of industrial-quality datasets in open source is a little depressing.

I think I will continue adding corpuses I’m interested in to this post as time goes on.

Last Updated