Experiencing data loss while utilizing BeautifulSoup

Question

Experiencing data loss while utilizing BeautifulSoup

I am currently working on a tutorial from the book 'Automate the Boring Stuff with Python' where I am practicing a project titled 'Project: “I’m Feeling Lucky” Google Search'

Unfortunately, the CSS selector used in the project is not returning any results

import requests,sys,webbrowser,bs4,pyperclip
if len(sys.argv) > 1:
    address = ' '.join(sys.argv[1:])
else:
    address = pyperclip.paste()

res = requests.get('http://google.com/search?q=' + str(address))
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text,"html.parser")
linkElems = soup.select('.r a')
for i in range (5):
    webbrowser.open('http://google.com' + linkElems[i].get('href'))**

Previously, I ran the same code successfully in the IDLE shell

However, it appears that

linkElems = soup.select('.r')

is returning empty results

Upon further inspection of the Beautiful Soup returned value

soup = bs4.BeautifulSoup(res.text,"html.parser")

I noticed that all elements with class='r' and class='rc' have disappeared inexplicably, even though they were present in the raw HTML file.

Could you please advise on the reasons behind this issue and how to prevent similar problems in the future?

css python-3.x beautifulsoup css-selectors

Answer 1

Answer №1

In order to retrieve the HTML version that includes the class r, you must adjust the User-Agent in the headers section:

import requests
from bs4 import BeautifulSoup

address = 'linux'

headers={'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:68.0) Gecko/20100101 Firefox/68.0'}

res = requests.get('http://google.com/search?q=' + str(address), headers=headers)
res.raise_for_status()
soup = BeautifulSoup(res.text,"html.parser")

linkElems = soup.select('.r a')

for a in linkElems:
    if a.text.strip() == '':
        continue
    print(a.text)

Output:

Linux.orghttps://www.linux.org/
Puhverdatud
Tõlgi see leht
Linux – Vikipeediahttps://et.wikipedia.org/wiki/Linux
Puhverdatud
Sarnased
Linux - Wikipediahttps://en.wikipedia.org/wiki/Linux

...and so forth.

Answer 2

In order to retrieve the HTML version that includes the class r, you must adjust the User-Agent in the headers section:

import requests
from bs4 import BeautifulSoup

address = 'linux'

headers={'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:68.0) Gecko/20100101 Firefox/68.0'}

res = requests.get('http://google.com/search?q=' + str(address), headers=headers)
res.raise_for_status()
soup = BeautifulSoup(res.text,"html.parser")

linkElems = soup.select('.r a')

for a in linkElems:
    if a.text.strip() == '':
        continue
    print(a.text)

Output:

Linux.orghttps://www.linux.org/
Puhverdatud
Tõlgi see leht
Linux – Vikipeediahttps://et.wikipedia.org/wiki/Linux
Puhverdatud
Sarnased
Linux - Wikipediahttps://en.wikipedia.org/wiki/Linux

...and so forth.

Answer 3

Answer №2

Google might be blocking your request due to the default user-agent being set to python-requests. To avoid this, check your user-agent to prevent your request from being blocked and receiving inconsistent HTML with different elements and selectors. Sometimes, changing the user-agent might result in a different HTML response with varied selectors.

It is beneficial to understand more about the user-agent and HTTP request headers.

Include the user-agent in the request headers:

headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

requests.get('YOUR_URL', headers=headers)

Consider using the lxml parser for faster results,install it here.

Find the code and a comprehensive example in the online IDE:

from bs4 import BeautifulSoup
import requests

headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

params = {
  "q": "My query goes here"
}

html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')

for result in soup.select('.tF2Cxc'):
  link = result.select_one('.yuRUbf a')['href']
  print(link)

-----

'''
https://dev.mysql.com/doc/refman/8.0/en/entering-queries.html
https://www.benlcollins.com/spreadsheets/google-sheets-query-sql/
https://www.exoscale.com/syslog/explaining-mysql-queries/
https://blog.hubspot.com/marketing/sql-tutorial-introduction
https://mode.com/sql-tutorial/sql-sub-queries/
https://www.mssqltips.com/sqlservertip/1255/getting-io-and-time-statistics-for-sql-server-queries/
https://stackoverflow.com/questions/2698401/how-to-store-mysql-query-results-in-another-table
https://www.khanacademy.org/computing/computer-programming/sql/relational-queries-in-sql/a/more-efficient-sql-with-query-planning-and-optimization
http://cidrdb.org/cidr2011/Papers/CIDR11_Paper7.pdf
https://www.sommarskog.se/query-plan-mysteries.html
'''

Another approach is to utilize the Google Organic Results API from SerpApi. This is a paid API with a free plan available.

The benefit in this case is that you only need to extract the desired data from a JSON string without worrying about handling or circumventing Google blocks.

Integrate the following code:



params = {
    "engine": "google",
    "q": "My query goes here",
    "hl": "en",
    "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
results = search.get_dict()

for result in results["organic_results"]:
  print(result['link'])

-------
'''
https://dev.mysql.com/doc/refman/8.0/en/entering-queries.html
https://www.benlcollins.com/spreadsheets/google-sheets-query-sql/
https://www.exoscale.com/syslog/explaining-mysql-queries/
https://blog.hubspot.com/marketing/sql-tutorial-introduction
https://mode.com/sql-tutorial/sql-sub-queries/
https://www.mssqltips.com/sqlservertip/1255/getting-io-and-time-statistics-for-sql-server-queries/
https://stackoverflow.com/questions/2698401/how-to-store-mysql-query-results-in-another-table
https://www.khanacademy.org/computing/computer-programming/sql/relational-queries-in-sql/a/more-efficient-sql-with-query-planning-and-optimization
http://cidrdb.org/cidr2011/Papers/CIDR11_Paper7.pdf
https://www.sommarskog.se/query-plan-mysteries.html
'''

Full Disclosure: I am associated with SerpApi.

Answer 4

Google might be blocking your request due to the default user-agent being set to python-requests. To avoid this, check your user-agent to prevent your request from being blocked and receiving inconsistent HTML with different elements and selectors. Sometimes, changing the user-agent might result in a different HTML response with varied selectors.

It is beneficial to understand more about the user-agent and HTTP request headers.

Include the user-agent in the request headers:

headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

requests.get('YOUR_URL', headers=headers)

Consider using the lxml parser for faster results,install it here.

Find the code and a comprehensive example in the online IDE:

from bs4 import BeautifulSoup
import requests

headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

params = {
  "q": "My query goes here"
}

html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')

for result in soup.select('.tF2Cxc'):
  link = result.select_one('.yuRUbf a')['href']
  print(link)

-----

'''
https://dev.mysql.com/doc/refman/8.0/en/entering-queries.html
https://www.benlcollins.com/spreadsheets/google-sheets-query-sql/
https://www.exoscale.com/syslog/explaining-mysql-queries/
https://blog.hubspot.com/marketing/sql-tutorial-introduction
https://mode.com/sql-tutorial/sql-sub-queries/
https://www.mssqltips.com/sqlservertip/1255/getting-io-and-time-statistics-for-sql-server-queries/
https://stackoverflow.com/questions/2698401/how-to-store-mysql-query-results-in-another-table
https://www.khanacademy.org/computing/computer-programming/sql/relational-queries-in-sql/a/more-efficient-sql-with-query-planning-and-optimization
http://cidrdb.org/cidr2011/Papers/CIDR11_Paper7.pdf
https://www.sommarskog.se/query-plan-mysteries.html
'''

Another approach is to utilize the Google Organic Results API from SerpApi. This is a paid API with a free plan available.

The benefit in this case is that you only need to extract the desired data from a JSON string without worrying about handling or circumventing Google blocks.

Integrate the following code:



params = {
    "engine": "google",
    "q": "My query goes here",
    "hl": "en",
    "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
results = search.get_dict()

for result in results["organic_results"]:
  print(result['link'])

-------
'''
https://dev.mysql.com/doc/refman/8.0/en/entering-queries.html
https://www.benlcollins.com/spreadsheets/google-sheets-query-sql/
https://www.exoscale.com/syslog/explaining-mysql-queries/
https://blog.hubspot.com/marketing/sql-tutorial-introduction
https://mode.com/sql-tutorial/sql-sub-queries/
https://www.mssqltips.com/sqlservertip/1255/getting-io-and-time-statistics-for-sql-server-queries/
https://stackoverflow.com/questions/2698401/how-to-store-mysql-query-results-in-another-table
https://www.khanacademy.org/computing/computer-programming/sql/relational-queries-in-sql/a/more-efficient-sql-with-query-planning-and-optimization
http://cidrdb.org/cidr2011/Papers/CIDR11_Paper7.pdf
https://www.sommarskog.se/query-plan-mysteries.html
'''

Full Disclosure: I am associated with SerpApi.

Experiencing data loss while utilizing BeautifulSoup

Answer №1

Answer №2

Similar questions

The program is running smoothly, but the runtime is longer than desired. Are there any suggestions on how to optimize and reduce

Customize the appearance of a React component depending on its unique identifier

Using JQuery to trigger CSS3 animations on one element when clicking on another

Customizing text appearance with innerHTML in JavaScript: A guide to styling

I would like to take the first two letters of the first and last name from the text box and display them somewhere on the page

Highlighting Navbar Items

Guide to importing a scss file into a scss class

The identical web address functions as a point of reference for design, however it does not function as an AJAX request

Accessing the parent element within a hover declaration without explicitly naming it

Expanding text area size dynamically in Nuxt3 using Tailwind CSS

CSS form not aligning properly within wrapper div and appears to be floating outside of it

Is there a way to rigorously validate my HTML, CSS, and JavaScript files against specific standards?

Understanding CSS classes in ReactJS Material UI componentsJust wanted to clarify some things

Learn how to apply formatting to all textboxes on an ASP.NET website using CSS without having to set the CSS class property for each individual textbox

Struggling to show labels in the correct position for each button?

Internet Explorer Fails to Display CSS Background Images

Can you modify the color of the dots within the letters "i"?

ElegantSoup stew.locate inside a dynamic loop

How can I remove the blue highlight on text fields in a Chrome Extension?

Encountered a naming issue when attempting to use Selenium to access the Google