Experiencing data loss while utilizing BeautifulSoup

I am currently working on a tutorial from the book 'Automate the Boring Stuff with Python' where I am practicing a project titled 'Project: “I’m Feeling Lucky” Google Search'

Unfortunately, the CSS selector used in the project is not returning any results

import requests,sys,webbrowser,bs4,pyperclip
if len(sys.argv) > 1:
    address = ' '.join(sys.argv[1:])
else:
    address = pyperclip.paste()

res = requests.get('http://google.com/search?q=' + str(address))
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text,"html.parser")
linkElems = soup.select('.r a')
for i in range (5):
    webbrowser.open('http://google.com' + linkElems[i].get('href'))**

Previously, I ran the same code successfully in the IDLE shell

However, it appears that

linkElems = soup.select('.r') 

is returning empty results

Upon further inspection of the Beautiful Soup returned value

soup = bs4.BeautifulSoup(res.text,"html.parser")

I noticed that all elements with class='r' and class='rc' have disappeared inexplicably, even though they were present in the raw HTML file.

Could you please advise on the reasons behind this issue and how to prevent similar problems in the future?

Answer №1

In order to retrieve the HTML version that includes the class r, you must adjust the User-Agent in the headers section:

import requests
from bs4 import BeautifulSoup

address = 'linux'

headers={'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:68.0) Gecko/20100101 Firefox/68.0'}

res = requests.get('http://google.com/search?q=' + str(address), headers=headers)
res.raise_for_status()
soup = BeautifulSoup(res.text,"html.parser")

linkElems = soup.select('.r a')

for a in linkElems:
    if a.text.strip() == '':
        continue
    print(a.text)

Output:

Linux.orghttps://www.linux.org/
Puhverdatud
Tõlgi see leht
Linux – Vikipeediahttps://et.wikipedia.org/wiki/Linux
Puhverdatud
Sarnased
Linux - Wikipediahttps://en.wikipedia.org/wiki/Linux

...and so forth.

Answer №2

Google might be blocking your request due to the default user-agent being set to python-requests. To avoid this, check your user-agent to prevent your request from being blocked and receiving inconsistent HTML with different elements and selectors. Sometimes, changing the user-agent might result in a different HTML response with varied selectors.

It is beneficial to understand more about the user-agent and HTTP request headers.

Include the user-agent in the request headers:

headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

requests.get('YOUR_URL', headers=headers)

Consider using the lxml parser for faster results,install it here.


Find the code and a comprehensive example in the online IDE:

from bs4 import BeautifulSoup
import requests

headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

params = {
  "q": "My query goes here"
}

html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')

for result in soup.select('.tF2Cxc'):
  link = result.select_one('.yuRUbf a')['href']
  print(link)

-----

'''
https://dev.mysql.com/doc/refman/8.0/en/entering-queries.html
https://www.benlcollins.com/spreadsheets/google-sheets-query-sql/
https://www.exoscale.com/syslog/explaining-mysql-queries/
https://blog.hubspot.com/marketing/sql-tutorial-introduction
https://mode.com/sql-tutorial/sql-sub-queries/
https://www.mssqltips.com/sqlservertip/1255/getting-io-and-time-statistics-for-sql-server-queries/
https://stackoverflow.com/questions/2698401/how-to-store-mysql-query-results-in-another-table
https://www.khanacademy.org/computing/computer-programming/sql/relational-queries-in-sql/a/more-efficient-sql-with-query-planning-and-optimization
http://cidrdb.org/cidr2011/Papers/CIDR11_Paper7.pdf
https://www.sommarskog.se/query-plan-mysteries.html
'''

Another approach is to utilize the Google Organic Results API from SerpApi. This is a paid API with a free plan available.

The benefit in this case is that you only need to extract the desired data from a JSON string without worrying about handling or circumventing Google blocks.

Integrate the following code:



params = {
    "engine": "google",
    "q": "My query goes here",
    "hl": "en",
    "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
results = search.get_dict()

for result in results["organic_results"]:
  print(result['link'])

-------
'''
https://dev.mysql.com/doc/refman/8.0/en/entering-queries.html
https://www.benlcollins.com/spreadsheets/google-sheets-query-sql/
https://www.exoscale.com/syslog/explaining-mysql-queries/
https://blog.hubspot.com/marketing/sql-tutorial-introduction
https://mode.com/sql-tutorial/sql-sub-queries/
https://www.mssqltips.com/sqlservertip/1255/getting-io-and-time-statistics-for-sql-server-queries/
https://stackoverflow.com/questions/2698401/how-to-store-mysql-query-results-in-another-table
https://www.khanacademy.org/computing/computer-programming/sql/relational-queries-in-sql/a/more-efficient-sql-with-query-planning-and-optimization
http://cidrdb.org/cidr2011/Papers/CIDR11_Paper7.pdf
https://www.sommarskog.se/query-plan-mysteries.html
'''

Full Disclosure: I am associated with SerpApi.

Similar questions

If you have not found the answer to your question or you are interested in this topic, then look at other similar questions below or use the search

The program is running smoothly, but the runtime is longer than desired. Are there any suggestions on how to optimize and reduce

Discovering pairs of elements with the smallest absolute difference in an array of distinct integers arr. Outputting a list of pairs in ascending order, where each pair [a, b] satisfies: a, b are elements from arr a < b The absolute difference b - a e ...

Customize the appearance of a React component depending on its unique identifier

After creating a simple TODO app with drag and drop functionality, I am now looking to apply a line-through CSS style to any task added or dragged into the second column of my app. What is the best way to target these tasks using their unique ID: <div c ...

Using JQuery to trigger CSS3 animations on one element when clicking on another

My Rails app has a feature similar to reddit where users can upvote or downvote content. I'm currently working on creating a CSS3 animation that triggers when a user clicks the downvote button. Here's how I'm using JQuery: <div class="a ...

Customizing text appearance with innerHTML in JavaScript: A guide to styling

Below is the code I have for a header: <div id="title-text"> The Cuttlefisher Paradise </div> <div id="choices"> <ul> <li id="home"><a href="#">Home</a></li> <li id="contact">&l ...

I would like to take the first two letters of the first and last name from the text box and display them somewhere on the page

Can anyone help me with creating a code that will display the first two letters of a person's first name followed by their last name in a text box? For example, if someone enters "Salman Shaikh," it should appear somewhere on my page as "SASH." I woul ...

Highlighting Navbar Items

Can anyone provide advice on how to highlight a navbar item when clicked? I'm unsure if I should use Angular or CSS for this. Any guidance would be greatly appreciated. <div class="collapse navbar-collapse" id="navbarNav"> <ul class ...

Guide to importing a scss file into a scss class

Is there a way to apply a different theme by adding the "dark-theme" class to the body? I've attempted the following implementation: @import '../../../../node_modules/angular-grids/styles/material.scss'; .app-dark { @import '../../. ...

The identical web address functions as a point of reference for design, however it does not function as an AJAX request

This piece of code is functioning properly: {html} {head> {**link rel="stylesheet" href="http://localhost:3000/CSS/mystyle.css"**} {/head} {body} {/body} {/html} However, when I use the same URL in this ...

Accessing the parent element within a hover declaration without explicitly naming it

I am facing a challenge with three nested divs, where I want to change the color of the children when the parent is hovered. However, I prefer not to assign a specific class name to the parent element. The issue arises because the children have their own ...

Expanding text area size dynamically in Nuxt3 using Tailwind CSS

Is there a way to expand the chat input field as more lines are entered? I want the textarea height to automatically increase up to a maximum of 500px, and adjust the height of .chat-footer accordingly. Below is a snippet of my code. <div v-if="ac ...

CSS form not aligning properly within wrapper div and appears to be floating outside of it

I am in the process of creating a website that includes a wrapper div containing a: -Header -Content -Footer The issue I am encountering is when I insert regular text into the content section, everything displays properly and the background stretches s ...

Is there a way to rigorously validate my HTML, CSS, and JavaScript files against specific standards?

Can modern browsers suppress errors in HTML, CSS, and JS sources? Is there a method to uncover all mistakes, no matter how small they may be? ...

Understanding CSS classes in ReactJS Material UI componentsJust wanted to clarify some things

When it comes to styling Material UI components, the recommended approach is to utilize their useStyles function. Here's an example: const useStyles = makeStyles(theme => ({ root: { marginTop: '15px', display: 'f ...

Learn how to apply formatting to all textboxes on an ASP.NET website using CSS without having to set the CSS class property for each individual textbox

I am currently developing a website using Asp.Net. I am looking for a way to customize the font, color, and background color of all the textboxes on my site using CSS. However, I would like to avoid having to assign a "cssclass" property to each individua ...

Struggling to show labels in the correct position for each button?

I need help with aligning labels under buttons. I have 3 buttons, each with a corresponding label. I want the labels to be displayed in line under the buttons, and if the length of a label exceeds the button width, it should wrap to the next line. I'v ...

Internet Explorer Fails to Display CSS Background Images

Requesting assistance, The issue I am facing is that the background image is not displaying in Internet Explorer, although it appears perfectly fine in Safari. I have conducted thorough checks with W3C CSS validation and HTML validation, confirming that ...

Can you modify the color of the dots within the letters "i"?

Is there a method to generate text like the one shown in the image using only css/html (or any other technique)? The dots in any "i's" should have a distinct color. Ideally, I'd like this to be inserted via a Wordpress WYSIWYG editor (since the ...

ElegantSoup stew.locate inside a dynamic loop

Here's a functional code snippet: try: summary = section.find('p', {'data-testid': 'vuln-summary-0'}) summary = summary.text except AttributeError: summary = 'N/A' #print (sum ...

How can I remove the blue highlight on text fields in a Chrome Extension?

I am currently developing a Chrome extension that allows users to edit a small text field. However, I am encountering an issue with the blue highlight that appears around the text field when clicked - I find it quite bothersome. Is there a way, possibly t ...

Encountered a naming issue when attempting to use Selenium to access the Google

Currently experimenting with the Selenium webdriver to launch Google using Chrome, but encountering an error. from selenium import webdriver driver = web.driver.Chrome(r'C:\Users\sahay\Downloads\chromedriver.exe') driver.ma ...