Skip to content Skip to sidebar Skip to footer

Extract Data Between Html Tags Using BeautifulSoup In Python

I want to extract the data between the html tag 'title' and in the 'meta' tag, I want to extract value of URL attribute and that too the text just before the '?'. Copy

Note that the single quotes only appear because you are asking the interactive interpreter to display a string value. You will find that

>>> print(soup.title.contents[0])

displays

" CARA CEPAT BELAJAR BAHASA INGGRIS MUDAH DAN MENYENANGKAN "

and that is actually the contents of the title tag. You will observe that Beautiful Soup has converted the " HTML entities into the required double-quote characters. To lose the quotes and adjacent spaces you can use

soup.title.contents[0][2:-2]

The meta tag is a little tricker. I make the assumption that there is only one <meta> tag with an http-equiv attribute whose value is "refresh", so the retrieval returns a list of one element. You retrieve that element like so:

>>> meta = soup.findAll("meta", {"http-equiv": "refresh"})[0]
>>> meta
<meta content="0; URL=/notes/kursus-belajar-bahasa-inggris/bahasa-inggris-siapa-takut-/685004288208871?_fb_noscript=1" http-equiv="refresh"/>

Note, by the way, that meta isn't a string but a soup element:

>>> type(meta)
<class 'bs4.element.Tag'>

You can retrieve attributes of a soup element using indexing just like Python dicts, so you can get the value of the contentattribute as follows:

>>> content = meta["content"]
>>> content
u'0; URL=/notes/kursus-belajar-bahasa-inggris/bahasa-inggris-siapa-takut-/685004288208871?_fb_noscript=1'

In order to extract the URL value you could just look for the first equals sign and take the rest of the string. I prefer to use a rather more disciplined approach, splitting at the semicolon and then splitting the right-hand element of that split on (only one) equals sign.

>>> url = content.split(";")[1].split("=", 1)[1]
>>> url
u'/notes/kursus-belajar-bahasa-inggris/bahasa-inggris-siapa-takut-/685004288208871?_fb_noscript=1'

Solution 2:

To get substring from url of meta tag you need to use some regex. I think you can try this out soup = BeautifulSoup(<your html string>) meta_url = soup.noscript.meta['content'] url = re.search('\-\/(.*)\?', meta_url).group(1) print url print soup.title.text

Hope above code solves your problem.


Post a Comment for "Extract Data Between Html Tags Using BeautifulSoup In Python"