How to Scrape Data From Pfizer’s Doctor Payments Records
|
By David Xia
Update: If you just want to download the data, click here.
I was bored this weekend, so I decided to collect some data from the Internet, aka web scraping.
Not just any data. I just did a fun tutorial written by Dan Nguyen, a news application
developer at the non-profit investigative journalism unit ProPublica. This article
teaches you how to collect data on Pfizer’s disclosure of its payments to doctors as required by a
$2.3 billion lawsuit alleging it illegally promoted drugs for unapproved uses.
Why do this if the data’s already publicly available? The settlement, the “largest health care
fraud settlement in U.S. history” according to ProPublica, required Pfizer to disclose its
payments to doctors. But the website Pfizer setup is only good for the most superficial
examination. Sure, you can look up doctor’s by alphabetical order and see how much and why they
were paid. But what if you wanted to know who Pfizer paid the most and why? The data is not
downloadable or easily aggregated for further analysis. To get all the yummy data you’ll need to
bust out your favorite text editor and apply some elbow grease.
I’ve updated Nguygen’s original Ruby script because Pfizer’s webpages have changed causing the
original to break. I’ve simplified the three separate script files into one file. ProPublica
scraped data from Pfizer’s 2010 full year report. As of the date I’m blogging this, the unmodified
script will get data from the company’s 2011 Q1 report. To run the script, you’ll need a computer
with Ruby and the Nokogiri gem installed.
Objective
Download all of Pfizer’s doctor payment disclosure data by individual doctor names.
Two Steps to Data
Copy the scraper script below.
Run ruby pfizer_payment_scraper.rb in your command line.
Simply change the BASE_LIST_URL constant. Add an extra URL query parameter on line 43:
payments_report.jsp?report=22&enPdNm=All&iPageNo=
report=22 will get you 2010 and report=21 will get 2009. Be sure to update LAST_PAGE_NUMBER
to the correct number of pages by clicking “Last” on their website. E.g. for 2010 there are 19,994
pages.
Nguygen wrote a related article documenting the technical motivation or “coder’s cause”
behind ProPublica’s Dollars for Docs project and other cool ways to get the delicious
digital data you covet.
This is also my first time using any sort of license. I just thought, “What they hell? It looks
cool. Let’s slap it on.” I’ve read only a bit about the difference between the
GNU General Public License and the MIT license. I went with the GPL because it ensures
derivative works are also in the public domain. Seems like this would prevent people from taking
the scraper and trying to make money off of it. Cuz that would be contrary to the whole point of
making this data freely available, right? Let me know if I picked the wrong one.
## Pfizer doctor payment records web scraper# Copyright (C) 2011 David Xia david@davidxia.com## This program is free software; you can redistribute it and/or# modify it under the terms of the GNU General Public License# as published by the Free Software Foundation; either version 2# of the License, or (at your option) any later version.## This program is distributed in the hope that it will be useful,# but WITHOUT ANY WARRANTY; without even the implied warranty of# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the# GNU General Public License for more details.#g# You should have received a copy of the GNU General Public License# along with this program; If not, see <http: //www.gnu.org/licenses/>.#g# davidxia.com## We call the necessary Ruby modules that allow us to# retrieve webpages and parse the HTML.require'open-uri'require'rubygems'require'nokogiri'# A helper function to convert tabs, carriage returns, and nbsp into spaces.classStringdefastripself.gsub(/([302|240|s|n|t])|( ?){1,}/,' ').stripendend################################################################################# We set constants here for file and directory names.################################################################################ This is the base address of Pfizer's list of payment records.# Appending the "iPageNo" parameter will get us from page to page.BASE_LIST_URL='http://pfizer.com/HCPDisclosureWebApplication/jsp/'g+'payments_report.jsp?report=32&enPdNm=All&iPageNo='# This is the base address of Pfizer's disclosure websiteBASE_URL='http://pfizer.com/HCPDisclosureWebApplication/jsp/'# We got this number from going to the last page on Pfizer's site.LAST_PAGE_NUMBER=9551# Store links to individual doctors in this file.DOC_URL_FILE='doc-pages-urls.txt'# Since a doctor's name could appear on different pages, we use a temporary array# to hold all the links as we iterate through each page and then call its# uniq method to filter out repeated URLs before writing to file.all_links=[]# The final file to which we save aggregated data.COMPILED_FILE_NAME='all-payments.txt'################################################################################# Loop through every Pfizer record page and extract unique doctor links.################################################################################ Go from page 1 to 19994forpage_numberin1..LAST_PAGE_NUMBERbeginputs"Parsing page #{page_number}"n_page=Nokogiri::HTML(open("#{BASE_LIST_URL}#{page_number}"))all_links+=n_page.css('table#hcpPayments td:nth-child(4) a').map{|link|link['href']}.select{|href|href.match('hcpDisplayName')!=nil}# Wait 2 seconds before hitting Pfizer servers so they won't think you're# attacking or DOSing their boxes.sleep2rescueputs"Couldn't get page #{page_number}"endendFile.open(DOC_URL_FILE,'w'){|f|f.putsall_links.uniq}################################################################################# Now we get the data from the pages of individual doctors and save to a file.###############################################################################compiled_file=File.open(COMPILED_FILE_NAME,'w')File.open(DOC_URL_FILE).readlines.eachdo|url|doc_name=url.match(/hcpDisplayName=(.+)/)[1]beginurl=url.gsub(' ','%20')puts"Retrieving #{doc_name}"html=Nokogiri::HTML(open("#{BASE_URL}#{url}")).css('#hcpPayments')rescueException=>eputs"Error trying to retrieve page #{url}"puts"Sleeping..."sleep100retry# go back to the begin clauseelse# All paid entities are in rows in which the first cell is filled.# Each entity's associated payments have a blank first cell.rows=html.css('tr')ifrows.length>0gentity_paid,city,state=nil# Iterate through each row. Skip the row 0, the header row.rows[1..-1].eachdo|row|# Rows that have a name in the first cell (entity paid)# denote distict entities that were paid.cells=row.css('td')if!cells[0].text.astrip.empty?# We have a new entity, city, and state here.entity_paid,city,state=cells[0..2].map{|c|c.text.astrip}end# The fourth column should always have at least# one link that corresponds to the doctor's name.doc_link=cells[3].css('a')g.select{|a|a['href']=="payments_report.jsp?hcpDisplayName=#{doc_name}"}ifdoc_link.length==1role=cells[3].css('span').text.sub(':','')# This is a cell that contains a doctor's name and a payment.# It should also contain exactly one link that describes# the service provided in a tooltip.service_link=cells[3].css('a')g.select{|a|a.attributes['onmouseover']g&&a.attributes['onmouseover'].value.match(/showTooltip/)}raise"Not exactly one service link for payee #{entity_paid}: #{url}"gifservice_link.length!=1# Now capture the cash or non-cash amounts.ifcells[4].text.strip!='---'amount=cells[4].textpayment_type='cash'elseamount=cells[5].textpayment_type='non-cash'end# Write this row to the filecompiled_file.puts([doc_link[0].text,entity_paid,city,state,service_link[0].text,role,amount,payment_type]g.map{|t|t.astrip}.join('; '))else## This means that none or more than one doctors' name was found.## So the cell was either blank or contained an unexpected name.## You should write some test conditions here and log the records## that don't fit your assumptions about the data.end# end of if doc_link.length==1end# end of rows.eachend# end of if rows.length > 0ensuresleep4end# end of beginend# end of File.open(DOC_URL_FILE).readlinescompiled_file.close