How To Remove
Tag But Keep Everything Within The Same Paragraph
It's my first time posting so hopefully I'm able to make this as clear as possible.
For an assignment I have to use BeautifulSoup to crawl a made-up webpage and extract all the titles and abstracts from each publication page. In general I've been able to do this by finding the abstract paragraph on each page and just appending that to an empty list. However, one of the pages has broken the abstract into several small chunks separated by
This is annoying because instead of considering it as one abstract, it is considered as 5 different ones, so it affects all the following publications and the title-abstracts don't match up.
I've tried extracting the
with the following code:
#text from abstract abstracttext =  for url in final_list: page = requests.get(url) soup = BeautifulSoup(page.content, "html.parser") # print(soup.prettify()) necessarytext = soup.find("p") for e in soup.find_all('br'): e.extract() # print(necessarytext) for text in necessarytext: abstracttext.append(text) #print(abstracttext)
If I look at 'necessarytext' now it appears the problem is solved as all the sentences are within the same paragraph. However once I go ahead and append everything to the empty list, the sentences are again separated out as if they were different paragraphs and throw everything off.
Does anybody have any ideas why this might be? Is there any way of removing the
but ensuring everything stays within the same paragraph, or is there a general-purpose way of concatenating all these sentences together? Sorry if I've been a bit unclear and I appreciate any help you can send my way.
EDIT: The 'url' in the code comes from a previous web-scraping I did. The publications are grouped by topic so I was able to go through each of the topics and extract the publication pages from there. All the unique URLs were added to a list called 'final_list' and so this for loop is supposed to go through each of these publication pages to extract the abstract. Hope that is clearer.
To remove the
<br>s from the
<p> you could use
... necessarytext = soup.find("p") for x in necessarytext: if x.name == 'br': x.extract() ##or ##x.decompose() abstracttext.append(necessarytext) ...
NoteCause it was not that clear - if you do not need the
<p> at all just call
abstracttext.append(soup.find("p").text) this will give plain text of
import requests from bs4 import BeautifulSoup abstracttext =  html='''<p>a <br/> b <br/> c</p>''' soup = BeautifulSoup(html, "html.parser") necessarytext = soup.find("p") for x in necessarytext: if x.name == 'br': x.decompose() abstracttext.append(necessarytext) print(abstracttext)
[<p>a b c</p>]
- → What are the pluses/minuses of different ways to configure GPIOs on the Beaglebone Black?
- → Django, code inside <script> tag doesn't work in a template
- → React - Django webpack config with dynamic 'output'
- → GAE Python app - Does URL matter for SEO?
- → Put a Rendered Django Template in Json along with some other items
- → session disappears when request is sent from fetch
- → Python Shopify API output formatted datetime string in django template
- → Shopify app: adding a new shipping address via webhook
- → Shopify + Python library: how to create new shipping address
- → shopify python api: how do add new assets to published theme?
- → Access 'HTTP_X_SHOPIFY_SHOP_API_CALL_LIMIT' with Python Shopify Module