If you’re interested in medium-small scale data mining, here are some interesting sets you might not know about:

  1. Internet Census 2012 - In 2012 a lone researcher took advantage of insecure embedded devices (e.g routers) to perform a scan of the entire internet. This researcher’s paper is well worth a read. The methodology is really interesting and the data is quite beautiful. There seems to be a lot to explore with this dataset.

  2. Reddit Public Comments 2007 October - 2015 May - Working in abuse I think this is an amazing dataset for working on for-fun spam / trolly-score / general comment quality classifiers. It could also be interesting to apply a variety of out-of-the-box algos on this set.

  3. Common Crawl - You can look at entire crawls of the web! Cool to think anyone with a little money and some free time could have a decent search engine (…crawling infrastructure is another thing – if you do try to index the web try to be ready for the amount of porn your crawler will find).