Scrapping <a> along with <p> using BeautifulSoup

Scrapping <a> along with <p> using BeautifulSoup

Problem Description:

I am a journalist by profession and learned python for news article scrapping. Using BeautifulSoup I am able to get

patagraphs from a news website however if there is a paragraph with a hyperlink , it does not scrap that line of text. Is there anyway I can get that line of text too?

`!pip3 install requests
from bs4 import BeautifulSoup as BS
import requests as req
import io

url = "https://www.geo.tv/latest/456848-ronaldo-eyes-world-cup-quarters-as-morocco-dare-to-dream"
webpage = req.get(url)
trav = BS(webpage.content, "html.parser")
M = 1
attributes_container = []

for link in trav.find_all('p'):
    
    # PASTE THE CLASS TYPE THAT WE GET
    # FROM THE ABOVE CODE IN THIS
    if(str(type(link.string)) == "<class 'bs4.element.NavigableString'>"
    and len(link.string) > 35):
        x=str(link.string)
        print (x)
        attributes_container.append(x)
        
text_df = pd.DataFrame(attributes_container, columns=["Text"])
text_df

`

In this case for example, the news article has "Cristiano Ronaldo" as a hyperlink so that line does not get scrraped.

Solution – 1

Simply use .text. It will consider every text under that tag. Also, to ensure that irrelevant text will not include in your list, you should only focus on the specific div (i.e., class content-area).

content_area = trav.find("div", {"class": "content-area"})
attributes_container = []

for link in content_area.find_all('p'):
    text = link.text
    if len(text) > 35:
        print(text)
        attributes_container.append(text)

output:

                                                 Text
0   DOHA: Cristiano Ronaldo will aim to fire Portu...
1   Just two last-eight slots remain to be filled ...
2   Ronaldo was hogging the headlines at the tourn...
3   Following an exit by "mutual agreement" he is ...
4   The 37-year-old superstar forward, who is appe...
5   After scoring a penalty in his team's opening ...
6                                                ....
Rate this post
We use cookies in order to give you the best possible experience on our website. By continuing to use this site, you agree to our use of cookies.
Accept
Reject