Monday, October 10, 2011

Scrapy, Google Scholar and MongoDB

I have been deeply involved in text mining problems lately, If you watched my presentation a EuroScipy this year (http://www.slideshare.net/fccoelho/mining-legal-texts-with-python) you have an idea of some of the things I am up to.

Well part of problem of text mining is to get hold of the text you want to analyze in the first place. For many projects I am involved with, I already have mountains of text to analyze. For some however I have  to go get it. On the web...

This requires a technique (or should I say an art?) known as web scraping. Of course there are tons of sites which are happy to provide you with an API for all your data feeding needs. Some "evil" sites however, withhold unnecessarily, even public data. To be explicit, I am talking about Google here which despite a massive number of requests, has yet to provide an API for the consumption of their database of scientific literature. This database powers one of their search services called Google Scholar, which you are not likely to have heard about unless you are a researcher or a graduate student.

I am pretty sure that this database was built by means of their scraping other literature indexing sites and publishers websites. None of the information in it is subject of copyright since it is composed mainly of references to articles, books and similar resources hosted elsewhere in the web. Nevertheless, they insist in making the life of their fellow scrapers harder than it needs to be. But enough ranting about Google's evil policies.

My research involves analizing the evolution of the scientific literature on a given subject, and to achieve this goal, I need to download thousands of articles and analyze their content by means of natural language processing techniques (NLTK).

So I wrote a google Scholar scraping tool (https://bitbucket.org/fccoelho/scholarscrap - pun intended) which I is available to others facing the same challenge. It is a chalenge indeed, given the number of stealth techniques necessary to avoid early detection by Google's bot watchdogs. My crawler is still being detected, meaning I can only make a limited number of requests each day before being banned. I welcome new ideas to further evade detection.

I hope Google would give something back to us researchers after making so much money from our work.






Enhanced by Zemanta

ccp

Amazon