Contributed by Nathan Stevens. He enrolled in the NYC Data Science Academy 12-week full time Data Science Bootcamp program taking place between September 23, 2016 and December 23, 2016. The original article can be found here.
There a numerous use cases for having a searchable database of active STEM (Science Technology Engineering and Math) researchers. For example, targeted marketing by companies interested in selling services and products, or helping students select the best research institution and mentors based on their interest. Further, it can to provide an additional means of getting an overview of active research areas. Unfortunately, no so such database is readily available even though most of the needed information is readily on various websites i.e. faculty profiles, journal sites, etc.
An obvious approach for creating this database is of course manually searching through the above mentioned websites. However, this would be very time consuming. A more efficient method would be to use web scraping to automate the data extraction process. Unfortunately, a small scale test indicated that there is just too much format variation in the relevant webpages to make web scraping practical. For example, even within departments at the same institution, there is often different formats used for faculty profiles. Moreover, since these profiles are infrequently updated, the information they contain is often out of date.
To overcomes the challenges listed above, a much better approach is to Follow The Money by scraping grant award information from research funding agencies. Not only is award data provided in a consistent format, more importantly, such information provides a direct measure of a researcher's activity level. For this project, the award page of the NSF was scraped. The process, using the Scrapy Framework, is outlined in the figure below.
Briefly, a web spider visits links on the wards page, downloading zip files containing award information which is stored in XML files. After unzipping, relevant information is extracted by the processing pipeline then stored in a MongoDB. Use of MongoDB allows for a scalable, fulltext searching to be done.
As mentioned above, one use case for such a database is direct marketing. Imagine if you will, a small company, Acme AFMs, who developed a new instrument and want to know if it’s worth going into full-scale production. The first thing they do is establish there is active demand by searching for the grants which are related to AFMs for the last sixteen years. After establishing there is amply demand based on the number of grants awarded, the next step is to get a list of the most active faculty/institutions who use AFMs. Again, this information that's easily obtained from the database. The figures below provides a visualization of these search results.
Web scrapping of grant awards information to create a searchable database of active STEM researchers was successfully done using the Scrapy Framework and MongoDB. Test use cases clearly show the value of this system and its potential. The next steps for this project are to: