非标准HTML无法被解析的问题解决
当爬虫请求一个网页,这个网页是非标准HTML的时候,那么一般方式都是无法正常解析成dom的;比如: 错误示例 1、此种方式将得到None from lxml import etree tree = etree.HTML(res.text) 2、此种方式也是得到None from lxml import etree parser = etree.HTMLParser() tree = etree.fromstring(res.text, parser) 3、此种方式也是得到None from bs4 import BeautifulSoup tree = BeautifulSoup(html, ‘html.parser’) 正确示例 1、解决方案 from lxml.html import soupparser tree = soupparser.fromstring(res.text) 2、解决方案 from lxml import etree tree = etree.HTML(res.text.encode(“ascii”, “xmlcharrefreplace”).decode(“ascii”))