I’ve improved my Python New York Times web scraper that extracts the global home page’s top articles. The latest version doesn’t clumsily replace HTML character codes like “é” with “é”. I wondered if there was a way for Python to convert it. It turns out there is.
Here’s the trick:
- encode the raw HTML in UTF-8:
1
|
|
- unescape the HTML special characters using this function. (Not sure how this works yet.)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 |
|
- and then create a BeautifulSoup object with this unescaped utf-8 raw HTML:
1
|
|
Voila! Any strings returned by BeautifulSoup methods will render smart quotation marks, em-dashes, and any letter with accent marks correctly. My updated script is here.