XPATH - Html With A Lot Of Children
Consider the html in the page variable. How do I access the tds ? I want to access them like xpath('/table/tr/td/text())' I don't want to indicate the other trs Unfortunately this
Solution 1:
Use xpath //td/text():
things = tree.xpath('//td/text()')
The //td stands for "find any td element in any depth.
Works for me.
Printing td elements grouped per table:
doc = html.fromstring(page)
for table_elm in doc.xpath("//table"):
print "another table"
things = table_elm.xpath('.//td/text()')
print(things)
Note, that in this case is the . in xpath significant.
Solution 2:
You don'have to convert BeautifulSoup to str:
soup = str(BeautifulSoup(page, 'html.parser'))
You can use something like this:
>>> soup = BeautifulSoup(page, 'html.parser')
>>> for td in soup.find_all('td'):
... print(td)
...
<td>table1 td1</td>
<td>table1 td2</td>
<td>table2 td1</td>
<td>table2 td2</td>
<td>table3 td1</td>
<td>table3 td2</td>
Or, you can also use print(td.text) if you want the text inside the element.
Solution 3:
tr inside of tr is invalid HTML.
And this seems to be "fixed" by the html.fromstring() parser.
You can test this with this xpath:
things = tree.xpath('//table/tr/*')
And output with:
for thing in things:
print(thing.tag)
Which generates:
td
td
td
td
td
Post a Comment for "XPATH - Html With A Lot Of Children"