How can I get BeautifulSoup running within another for loop?
I’m currently trying to put together an article scraper for a website, but I’m running into an issue that I don’t know how to solve. This is the code:
import newspaper from newspaper import Article import pandas as pd import datetime from datetime import datetime, timezone import requests from bs4 import BeautifulSoup import re urls = open("urls_test.txt").readlines() final_df = pd.DataFrame() for url in urls: article = newspaper.Article(url="%s" % (url), language='en') article.download() article.parse() article.nlp() # scrape html part page = requests.get(url) soup = BeautifulSoup(page.content, "html.parser") results = soup.find(id="main-content") texts = results.find_all("div", class_="component article-body-text") paragraphs =  for snippet in texts: paragraphs.append(str(snippet)) CLEANR = re.compile('<.*?>') def remove_html(input): cleantext = re.sub(CLEANR, '', input) return cleantext paragraphs_string = ' '.join(paragraphs) paragraphs_clean = remove_html(paragraphs_string) # temp_df = pd.DataFrame(columns=['Title', 'Authors', 'Text', 'Summary', 'published_date', 'URL']) temp_df['Authors'] = article.authors temp_df['Title'] = article.title temp_df['Text'] = paragraphs_clean temp_df['Summary'] = article.meta_description publish_date = article.publish_date publish_date = publish_date.replace(tzinfo=None) temp_df['published_date'] = publish_date temp_df['URL'] = article.url final_df = pd.concat([final_df, temp_df], ignore_index=True) final_df.to_excel('Telegraph_test.xlsx')
My problem appears in the #scrape html part. Both codes (main code without the #scrape html part and only the #scrape html part) run fine on their own. More specifically, the code as a whole runs until line
results = soup.find(id="main-content") (returning the
results variable as a
bs4.element.Tag containing the scraped material), but as it continues the
results variable turns into
NoneType. This is the error message I get:
AttributeError: 'NoneType' object has no attribute 'find_all'
Solution – 1
Without knowing any of the urls and structure of HTML I would say that there is one without an element using
id="main-content" as attribute – So you should always check if the element you are looking for is available:
soup = BeautifulSoup(page.content, "html.parser") results = soup.find(id="main-content") if results: text = ' '.join([e.get_text(strip=True) for e in results.find_all("div", class_="component article-body-text")]) else: text = ''
There is no need for your
remove_html() simply use
.get_text() to extract the text from your element/s.