非标准HTML无法被解析的问题解决

RMAG news

当爬虫请求一个网页,这个网页是非标准HTML的时候,那么一般方式都是无法正常解析成dom的;比如:

错误示例

1、此种方式将得到None

from lxml import etree
tree = etree.HTML(res.text)

2、此种方式也是得到None

from lxml import etree

parser = etree.HTMLParser()
tree = etree.fromstring(res.text, parser)

3、此种方式也是得到None

from bs4 import BeautifulSoup
tree = BeautifulSoup(html, ‘html.parser’)

正确示例

1、解决方案

from lxml.html import soupparser
tree = soupparser.fromstring(res.text)

2、解决方案

from lxml import etree
tree = etree.HTML(res.text.encode(“ascii”, “xmlcharrefreplace”).decode(“ascii”))

Leave a Reply

Your email address will not be published. Required fields are marked *