NEWS

Chicago woman charged with biting cop at Hammond Walmart
1 week ago
Así ha sido el último punto de Nadal en el Mutua Madrid Open y sus partidos contra Djokovic y Federer en la Caja Mágica
1 week ago
Daily News boys athlete of the week: Dylan Volantis, Westlake
1 week ago
The Cheyenne Supercomputer is going for a fraction of its list price at auction right now
1 week ago
City celebrates townhome transformation in Nob Hill
1 week ago
Top battleground Senate race heats up as party-backed Republican faces onslaught from former Trump official
1 week ago

Software

非标准HTML无法被解析的问题解决

Claudio Ctin2 weeks ago01 mins

当爬虫请求一个网页，这个网页是非标准HTML的时候，那么一般方式都是无法正常解析成dom的；比如：

错误示例

1、此种方式将得到None

from lxml import etree

tree = etree.HTML(res.text)

2、此种方式也是得到None

from lxml import etree

parser = etree.HTMLParser()
tree = etree.fromstring(res.text, parser)

3、此种方式也是得到None

from bs4 import BeautifulSoup

tree = BeautifulSoup(html, ‘html.parser’)

正确示例

1、解决方案

from lxml.html import soupparser

tree = soupparser.fromstring(res.text)

2、解决方案

from lxml import etree

tree = etree.HTML(res.text.encode(“ascii”, “xmlcharrefreplace”).decode(“ascii”))

Related

Leave a Reply Cancel reply

Stiri similare

RMAG news

Solving the Localhost Development Headache with Nanocl

Claudio Ctin36 mins ago 0

RMAG news

Understanding Android Architecture Patterns: MVC, MVP, and MVVM

Claudio Ctin40 mins ago 0

RMAG news

First Project

Claudio Ctin46 mins ago 0

Generative AI leading from Front

Generative AI leading from Front

Claudio Ctin46 mins ago 0