How to Scrape Data From Pfizer’s Doctor Payments Records

Update: If you just want to download the data, click here.

I was bored this weekend, so I decided to collect some data from the Internet, aka web scraping. Not just any data. I just did a fun tutorial written by Dan Nguyen, a news application developer at the non-profit investigative journalism unit ProPublica. This article teaches you how to collect data on Pfizer’s disclosure of its payments to doctors as required by a $2.3 billion lawsuit alleging it illegally promoted drugs for unapproved uses.

Why do this if the data’s already publicly available? The settlement, the “largest health care fraud settlement in U.S. history” according to ProPublica, required Pfizer to disclose its payments to doctors. But the website Pfizer setup is only good for the most superficial examination. Sure, you can look up doctor’s by alphabetical order and see how much and why they were paid. But what if you wanted to know who Pfizer paid the most and why? The data is not downloadable or easily aggregated for further analysis. To get all the yummy data you’ll need to bust out your favorite text editor and apply some elbow grease.

I’ve updated Nguygen’s original Ruby script because Pfizer’s webpages have changed causing the original to break. I’ve simplified the three separate script files into one file. ProPublica scraped data from Pfizer’s 2010 full year report. As of the date I’m blogging this, the unmodified script will get data from the company’s 2011 Q1 report. To run the script, you’ll need a computer with Ruby and the Nokogiri gem installed.

Objective

Download all of Pfizer’s doctor payment disclosure data by individual doctor names.

Two Steps to Data

Copy the scraper script below.
Run ruby pfizer_payment_scraper.rb in your command line.

Explanation of the Script

Get all the pages here from page 1 to page 9,551.
Extract links to individual doctor pages.
Scrape data from individual doctor pages.

How to Get Previous Years’ Data

Simply change the BASE_LIST_URL constant. Add an extra URL query parameter on line 43: payments_report.jsp?report=22&enPdNm=All&iPageNo=

report=22 will get you 2010 and report=21 will get 2009. Be sure to update LAST_PAGE_NUMBER to the correct number of pages by clicking “Last” on their website. E.g. for 2010 there are 19,994 pages.

Nguygen wrote a related article documenting the technical motivation or “coder’s cause” behind ProPublica’s Dollars for Docs project and other cool ways to get the delicious digital data you covet.

This is also my first time using any sort of license. I just thought, “What they hell? It looks cool. Let’s slap it on.” I’ve read only a bit about the difference between the GNU General Public License and the MIT license. I went with the GPL because it ensures derivative works are also in the public domain. Seems like this would prevent people from taking the scraper and trying to make money off of it. Cuz that would be contrary to the whole point of making this data freely available, right? Let me know if I picked the wrong one.

#
# Pfizer doctor payment records web scraper
# Copyright (C) 2011 David Xia david@davidxia.com
#
# This program is free software; you can redistribute it and/or
# modify it under the terms of the GNU General Public License
# as published by the Free Software Foundation; either version 2
# of the License, or (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#g
# You should have received a copy of the GNU General Public License
# along with this program; If not, see <http: //www.gnu.org/licenses/>.
#g
# davidxia.com
#


# We call the necessary Ruby modules that allow us to
# retrieve webpages and parse the HTML.
require 'open-uri'
require 'rubygems'
require 'nokogiri'


# A helper function to convert tabs, carriage returns, and nbsp into spaces.
class String
  def astrip
    self.gsub(/([302|240|s|n|t])|(&nbsp;?){1,}/, ' ').strip
  end
end


###############################################################################
## We set constants here for file and directory names.
###############################################################################
# This is the base address of Pfizer's list of payment records.
# Appending the "iPageNo" parameter will get us from page to page.
BASE_LIST_URL = 'http://pfizer.com/HCPDisclosureWebApplication/jsp/'g
  +'payments_report.jsp?report=32&enPdNm=All&iPageNo='
# This is the base address of Pfizer's disclosure website
BASE_URL = 'http://pfizer.com/HCPDisclosureWebApplication/jsp/'

# We got this number from going to the last page on Pfizer's site.
LAST_PAGE_NUMBER = 9551
# Store links to individual doctors in this file.
DOC_URL_FILE = 'doc-pages-urls.txt'
# Since a doctor's name could appear on different pages, we use a temporary array
# to hold all the links as we iterate through each page and then call its
# uniq method to filter out repeated URLs before writing to file.
all_links = []

# The final file to which we save aggregated data.
COMPILED_FILE_NAME = 'all-payments.txt'


###############################################################################
## Loop through every Pfizer record page and extract unique doctor links.
###############################################################################
# Go from page 1 to 19994
for page_number in 1..LAST_PAGE_NUMBER
  begin
    puts "Parsing page #{page_number}"
    n_page = Nokogiri::HTML(open("#{BASE_LIST_URL}#{page_number}"))
    all_links += n_page.css('table#hcpPayments td:nth-child(4) a')
      .map{|link| link['href'] }.select{|href| href.match('hcpDisplayName') != nil}

    # Wait 2 seconds before hitting Pfizer servers so they won't think you're
    # attacking or DOSing their boxes.
    sleep 2
  rescue
    puts "Couldn't get page #{page_number}"
  end
end

File.open(DOC_URL_FILE, 'w'){|f| f.puts all_links.uniq}


###############################################################################
## Now we get the data from the pages of individual doctors and save to a file.
###############################################################################
compiled_file = File.open(COMPILED_FILE_NAME, 'w')
File.open(DOC_URL_FILE).readlines.each do |url|
  doc_name = url.match(/hcpDisplayName=(.+)/)[1]

  begin
    url = url.gsub(' ', '%20')
    puts "Retrieving #{doc_name}"
    html = Nokogiri::HTML(open("#{BASE_URL}#{url}")).css('#hcpPayments')
  rescue Exception=>e
    puts "Error trying to retrieve page #{url}"
    puts "Sleeping..."
    sleep 100
    retry # go back to the begin clause
  else
    # All paid entities are in rows in which the first cell is filled.
    # Each entity's associated payments have a blank first cell.
    rows = html.css('tr')
    if rows.length > 0g
      entity_paid,city,state = nil

      # Iterate through each row. Skip the row 0, the header row.
      rows[1..-1].each do |row|

        # Rows that have a name in the first cell (entity paid)
        # denote distict entities that were paid.
        cells = row.css('td')
        if !cells[0].text.astrip.empty?
          # We have a new entity, city, and state here.
          entity_paid,city,state = cells[0..2].map{|c| c.text.astrip}
        end

        # The fourth column should always have at least
        # one link that corresponds to the doctor's name.
        doc_link = cells[3].css('a')g
          .select{|a| a['href']=="payments_report.jsp?hcpDisplayName=#{doc_name}"}

        if doc_link.length == 1
          role = cells[3].css('span').text.sub(':', '')
          # This is a cell that contains a doctor's name and a payment.
          # It should also contain exactly one link that describes
          # the service provided in a tooltip.
          service_link = cells[3].css('a')g
            .select{|a| a.attributes['onmouseover']g
            && a.attributes['onmouseover'].value.match(/showTooltip/)}

          raise "Not exactly one service link for payee #{entity_paid}: #{url}"g
            if service_link.length != 1

          # Now capture the cash or non-cash amounts.
          if cells[4].text.strip != '---'
            amount = cells[4].text
            payment_type = 'cash'
          else
            amount = cells[5].text
            payment_type = 'non-cash'
          end

          # Write this row to the file
          compiled_file.puts([doc_link[0].text,entity_paid,city,state,
            service_link[0].text,role,amount,payment_type]g
            .map{|t| t.astrip}.join('; '))

        else
          ## This means that none or more than one doctors' name was found.
          ## So the cell was either blank or contained an unexpected name.
          ## You should write some test conditions here and log the records
          ## that don't fit your assumptions about the data.
        end # end of if doc_link.length==1
      end # end of rows.each
    end # end of if rows.length > 0
  ensure
    sleep 4
  end # end of begin
end # end of File.open(DOC_URL_FILE).readlines

compiled_file.close