How can I get BeautifulSoup running within another for loop?

How can I get BeautifulSoup running within another for loop?

Problem Description:

I’m currently trying to put together an article scraper for a website, but I’m running into an issue that I don’t know how to solve. This is the code:

import newspaper
from newspaper import Article
import pandas as pd
import datetime
from datetime import datetime, timezone
import requests
from bs4 import BeautifulSoup
import re

urls = open("urls_test.txt").readlines()

final_df = pd.DataFrame()

for url in urls:
    article = newspaper.Article(url="%s" % (url), language='en')
    article.download()
    article.parse()
    article.nlp()

    # scrape html part
    page = requests.get(url)
    soup = BeautifulSoup(page.content, "html.parser")
    results = soup.find(id="main-content")
    texts = results.find_all("div", class_="component article-body-text")
    paragraphs = []
    for snippet in texts:
        paragraphs.append(str(snippet))
    CLEANR = re.compile('<.*?>')
    def remove_html(input):
        cleantext = re.sub(CLEANR, '', input)
        return cleantext
    paragraphs_string = ' '.join(paragraphs)
    paragraphs_clean = remove_html(paragraphs_string)
    #

    temp_df = pd.DataFrame(columns=['Title', 'Authors', 'Text', 'Summary', 'published_date', 'URL'])

    temp_df['Authors'] = article.authors
    temp_df['Title'] = article.title
    temp_df['Text'] = paragraphs_clean
    temp_df['Summary'] = article.meta_description
    publish_date = article.publish_date
    publish_date = publish_date.replace(tzinfo=None)
    temp_df['published_date'] = publish_date
    temp_df['URL'] = article.url

    final_df = pd.concat([final_df, temp_df], ignore_index=True)

final_df.to_excel('Telegraph_test.xlsx')

My problem appears in the #scrape html part. Both codes (main code without the #scrape html part and only the #scrape html part) run fine on their own. More specifically, the code as a whole runs until line results = soup.find(id="main-content") (returning the results variable as a bs4.element.Tag containing the scraped material), but as it continues the results variable turns into NoneType. This is the error message I get:

AttributeError: 'NoneType' object has no attribute 'find_all'

Solution – 1

Without knowing any of the urls and structure of HTML I would say that there is one without an element using id="main-content" as attribute – So you should always check if the element you are looking for is available:

soup = BeautifulSoup(page.content, "html.parser")
results = soup.find(id="main-content")
if results:
    text = ' '.join([e.get_text(strip=True) for e in results.find_all("div", class_="component article-body-text")])
else:
    text = ''
    

There is no need for your remove_html() simply use .get_text() to extract the text from your element/s.

Rate this post
We use cookies in order to give you the best possible experience on our website. By continuing to use this site, you agree to our use of cookies.
Accept
Reject