RSelenium: Issue with extracting hyperlinks from webpage following button click操作

Question

RSelenium: Issue with extracting hyperlinks from webpage following button click操作

My goal is to automate web scraping with RSelenium in R. I've managed to find and click a button on a webpage using RSelenium, but I'm struggling to extract href attributes from the page after clicking the button.

Although I have a list of 4000 species, here is an example:

Species <- c("Abies balsamea", "Alchemilla glomerulans", "Antennaria dioica",
"Atriplex glabriuscula", "Brachythecium salebrosum")

Here's my current code:

library(RSelenium)
remDr <- remoteDriver(
  remoteServerAddr = "localhost",
  port = 4445L,
  browserName = "firefox"
)

remDr$open()

remDr$navigate("https://ser-sid.org/")

webElem <- remDr$findElement(using = "class", "flex")

# Find the input field and button within webElem
input_element <- webElem$findChildElement(using = "css selector", value = "input[type='text']")
button_element <- webElem$findChildElement(using = "css selector", value = "button")

# Input species name into the input field

input_element$sendKeysToElement(list("Abies balsamea"))

# Click the button to submit the form
button_element$clickElement()



Sys.sleep(5)

# Locate all <a> elements with species information
species_links <- remDr$findElements(using = "css selector", value = "a[href^='/species/']")

# Extract href attributes from the species links
hrefs <- sapply(species_links, function(link) {
  link$getElementAttribute("href")
})

# Remove NULL values (in case some links don't have href attributes)
hrefs <- hrefs[!is.na(hrefs)]

# Print the extracted hrefs
print(hrefs)

The code doesn't throw any errors but species_links ends up empty, indicating that the elements with species information are not being found.

I attempted waiting for the page to load after clicking the button, but it appears that the page content isn't fully loading or as expected.

When I manually search for 'Abies balsamea' on the webpage, I find this:

https://i.sstatic.net/lFnLP.png

From there, I aim to retrieve this link at least:

Inspecting it in the webpage, brings me to this image below:

https://i.sstatic.net/ZcNXU.png

Any suggestions on how to troubleshoot this issue and ensure successful extraction of hrefs after button clicks?

My end goal would be to iterate through a species list like Species and create a data.frame containing the links to each species

Edit based on Brett Donald's answer

Brett's solution seems better, but I haven't located the API documentation yet.

This is what I've tried:

library(httr)

# Define the API endpoint URL
url <- "https://fyxheguykvewpdeysvoh.supabase.co/rest/v1/species_summary"

# Define query parameters
params <- list(
  select = "*",
  or = "(has_germination.eq.true,has_oil.eq.true,has_protein.eq.true,has_dispersal.eq.true,has_seed_weights.eq.true,has_storage_behaviour.eq.true,has_morphology.eq.true)",
  genus = "ilike.Abies%",
  epithet = "ilike.balsamea%",
  order = "genus.asc.nullslast,epithet.asc.nullslast"
)

# Set request headers with the correct API key
headers <- add_headers(
  `Content-Type` = "application/json",
  Authorization = "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJzdXBhYmFzZSIsInJlZiI6ImZ5eGhlZ3V5a3Zld3BkZXlzdm9oIiwicm9sZSI6ImFub24iLCJpYXQiOjE2NDc0MTY1MzQsImV4cCI6MTk2Mjk5MjUzNH0.XhJKVijhMUidqeTbH62zQ6r8cS6j22TYAKfbbRHMTZ8"
)

# Make a GET request
response <- GET(url, query = params, headers = headers)

# Check if the request was successful
if (http_type(response) == "application/json") {
  # Parse JSON response
  data <- content(response, "parsed")
  print(data)
} else {
  print("Error: Failed to retrieve data")
}

But I receive:

$message
[1] "No API key found in request"

$hint
[1] "No `apikey` request header or url param was found."

javascript html css r rselenium

Answer 1

Answer №1

Upon reviewing your code, I see no issues with it, although I am not well-versed in RSelenium.

If I were in your shoes, I might consider obtaining the data differently by mimicking the website's API calls rather than scraping it with a robotic browser.

By analyzing the network tab of your browser inspector when conducting a search on ser-sid.org, you can uncover both the API endpoint URL being accessed and the API key.

API endpoint URL (with parameters included)

https://fyxheguykvewpdeysvoh.supabase.co/rest/v1/species_summary?select=*&or=%28has_germination.eq.true%2Chas_oil.eq.true%2Chas_protein.eq.true%2Chas_dispersal.eq.true%2Chas_seed_weights.eq.true%2Chas_storage_behaviour.eq.true%2Chas_morphology.eq.true%29&genus=ilike.Abies%25&epithet=ilike.balsamea%25&order=genus.asc.nullslast%2Cepithet.asc.nullslast

API key (found in request headers)

eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJzdXBhYmFzZSIsInJlZiI6ImZ5eGhlZ3V5a3Zld3BkZXlzdm9oIiwicm9sZSI6ImFub24iLCJpYXQiOjE2NDc0MTY1MzQsImV4cCI6MTk2Mjk5MjUzNH0.XhJKVijhMUidqeTbH62zQ6r8cS6j22TYAKfbbRHMTZ8

After replicating these details in a new Postman Get request, I received a JSON response like this:

[
  {
    "genus": "Abies",
    "epithet": "balsamea",
    "id": "ef741ce8-6911-4286-b79e-3ff0804520fb",
    "infraspecies_rank": null,
    "infraspecies_epithet": null,
    "has_germination": false,
    "has_oil": true,
    "has_protein": false,
    "has_dispersal": true,
    "has_seed_weights": true,
    "has_storage_behaviour": true,
    "has_morphology": false
  },
  {
    "genus": "Abies",
    "epithet": "balsamea",
    "id": "024cde5f-7cc5-48b7-89fd-be95638c8f2a",
    "infraspecies_rank": "var.",
    "infraspecies_epithet": "balsamea",
    "has_germination": true,
    "has_oil": false,
    "has_protein": false,
    "has_dispersal": false,
    "has_seed_weights": true,
    "has_storage_behaviour": true,
    "has_morphology": false
  }
]

You could easily automate these requests using any language of your choice. Personally, I would opt for Node.js. Wouldn't that be simpler than resorting to web scraping with a robotic browser?

PS. Since the data in this database is reportedly under a Creative Commons License, you might have luck contacting the Society for Ecological Restoration to access the data directly instead of having to extract it species by species.

Answer 2

Upon reviewing your code, I see no issues with it, although I am not well-versed in RSelenium.

If I were in your shoes, I might consider obtaining the data differently by mimicking the website's API calls rather than scraping it with a robotic browser.

By analyzing the network tab of your browser inspector when conducting a search on ser-sid.org, you can uncover both the API endpoint URL being accessed and the API key.

API endpoint URL (with parameters included)

https://fyxheguykvewpdeysvoh.supabase.co/rest/v1/species_summary?select=*&or=%28has_germination.eq.true%2Chas_oil.eq.true%2Chas_protein.eq.true%2Chas_dispersal.eq.true%2Chas_seed_weights.eq.true%2Chas_storage_behaviour.eq.true%2Chas_morphology.eq.true%29&genus=ilike.Abies%25&epithet=ilike.balsamea%25&order=genus.asc.nullslast%2Cepithet.asc.nullslast

API key (found in request headers)

eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJzdXBhYmFzZSIsInJlZiI6ImZ5eGhlZ3V5a3Zld3BkZXlzdm9oIiwicm9sZSI6ImFub24iLCJpYXQiOjE2NDc0MTY1MzQsImV4cCI6MTk2Mjk5MjUzNH0.XhJKVijhMUidqeTbH62zQ6r8cS6j22TYAKfbbRHMTZ8

After replicating these details in a new Postman Get request, I received a JSON response like this:

[
  {
    "genus": "Abies",
    "epithet": "balsamea",
    "id": "ef741ce8-6911-4286-b79e-3ff0804520fb",
    "infraspecies_rank": null,
    "infraspecies_epithet": null,
    "has_germination": false,
    "has_oil": true,
    "has_protein": false,
    "has_dispersal": true,
    "has_seed_weights": true,
    "has_storage_behaviour": true,
    "has_morphology": false
  },
  {
    "genus": "Abies",
    "epithet": "balsamea",
    "id": "024cde5f-7cc5-48b7-89fd-be95638c8f2a",
    "infraspecies_rank": "var.",
    "infraspecies_epithet": "balsamea",
    "has_germination": true,
    "has_oil": false,
    "has_protein": false,
    "has_dispersal": false,
    "has_seed_weights": true,
    "has_storage_behaviour": true,
    "has_morphology": false
  }
]

You could easily automate these requests using any language of your choice. Personally, I would opt for Node.js. Wouldn't that be simpler than resorting to web scraping with a robotic browser?

PS. Since the data in this database is reportedly under a Creative Commons License, you might have luck contacting the Society for Ecological Restoration to access the data directly instead of having to extract it species by species.

Answer 3

Answer №2

If you want to utilize the rsDriver launch and xpath for searching with contains(), follow these steps:

library(RSelenium)

port <- 4567
#to terminate port for reuse use system(paste0("sudo kill -9 $(lsof -t -i:",port," -sTCP:LISTEN)"))

#my standard launch function with enhanced privacy features and image blocking for quicker loading
eCaps = list(`moz:firefoxOptions` = list(
  args = list("--disable-gpu","--no-sandbox","--disable-application-cache","--disable-dev-shm-usage", "--disable-extensions"),
  prefs =list(
    "browser.cache.disk.enable" = FALSE,
    "browser.cache.memory.enable" = FALSE,
    "browser.cache.offline.enable" = FALSE,
    "browser.sessionstore.max_tabs_undo" = 0,
    "network.http.use-cache" = FALSE,
    "permissions.default.image"= 2,
    "privacy.clearOnShutdown.cache" = TRUE,
    "privacy.clearOnShutdown.cookies" = TRUE)
)
)

rD <- rsDriver( browser = "firefox", extraCapabilities = eCaps, port=as.integer(port), check=F)
remDr <- rD$client

remDr$navigate("https://ser-sid.org/")

webElem <- remDr$findElement(using = "class", "flex")

# Locate the input field and button within webElem
input_element <- webElem$findChildElement(using = "css selector", value = "input[type='text']")
button_element <- webElem$findChildElement(using = "css selector", value = "button")

# Input the species name into the input field

input_element$sendKeysToElement(list("Abies balsamea"))

# Click the button to submit the form
button_element$click()

Sys.sleep(5)

# Find all <a> elements with species information
species_links <- remDr$findElements(using = "xpath", "//a[contains(@href,'species')]")

# Extract the href attributes from the species links
hrefs <- sapply(species_links, function(link) {
  link$getElementAttribute("href")
})

# Filter out NULL values (in case some links don't have href attributes)
hrefs <- hrefs[!is.na(hrefs)]

# Display the extracted hrefs
print(hrefs)

[[1]]
[1] "https://ser-sid.org/species/ef741ce8-6911-4286-b79e-3ff0804520fb"

[[2]]
[1] "https://ser-sid.org/species/024cde5f-7cc5-48b7-89fd-be95638c8f2a"

Answer 4

If you want to utilize the rsDriver launch and xpath for searching with contains(), follow these steps:

library(RSelenium)

port <- 4567
#to terminate port for reuse use system(paste0("sudo kill -9 $(lsof -t -i:",port," -sTCP:LISTEN)"))

#my standard launch function with enhanced privacy features and image blocking for quicker loading
eCaps = list(`moz:firefoxOptions` = list(
  args = list("--disable-gpu","--no-sandbox","--disable-application-cache","--disable-dev-shm-usage", "--disable-extensions"),
  prefs =list(
    "browser.cache.disk.enable" = FALSE,
    "browser.cache.memory.enable" = FALSE,
    "browser.cache.offline.enable" = FALSE,
    "browser.sessionstore.max_tabs_undo" = 0,
    "network.http.use-cache" = FALSE,
    "permissions.default.image"= 2,
    "privacy.clearOnShutdown.cache" = TRUE,
    "privacy.clearOnShutdown.cookies" = TRUE)
)
)

rD <- rsDriver( browser = "firefox", extraCapabilities = eCaps, port=as.integer(port), check=F)
remDr <- rD$client

remDr$navigate("https://ser-sid.org/")

webElem <- remDr$findElement(using = "class", "flex")

# Locate the input field and button within webElem
input_element <- webElem$findChildElement(using = "css selector", value = "input[type='text']")
button_element <- webElem$findChildElement(using = "css selector", value = "button")

# Input the species name into the input field

input_element$sendKeysToElement(list("Abies balsamea"))

# Click the button to submit the form
button_element$click()

Sys.sleep(5)

# Find all <a> elements with species information
species_links <- remDr$findElements(using = "xpath", "//a[contains(@href,'species')]")

# Extract the href attributes from the species links
hrefs <- sapply(species_links, function(link) {
  link$getElementAttribute("href")
})

# Filter out NULL values (in case some links don't have href attributes)
hrefs <- hrefs[!is.na(hrefs)]

# Display the extracted hrefs
print(hrefs)

[[1]]
[1] "https://ser-sid.org/species/ef741ce8-6911-4286-b79e-3ff0804520fb"

[[2]]
[1] "https://ser-sid.org/species/024cde5f-7cc5-48b7-89fd-be95638c8f2a"

RSelenium: Issue with extracting hyperlinks from webpage following button click操作

Edit based on Brett Donald's answer

Answer №1

Answer №2

Similar questions

passport.initialize() function is currently inactive

What is the best way to insert a line break following a font awesome icon within a list?

Turn off transparency for the child element if the parent element has transparency

particular shade for button border and background

Unable to retrieve a state property within a Vue template

Creating vertical barplots that face each other in R can be achieved by using specific functions and

Save to a JSON file

Tips for obtaining the identifier of a div element while employing the bind() function in jQuery

Inside the Promise.then() function, iterate through the values using a for loop

Using Key Press to Rotate Messages - Jquery

Is there a way to alter the footer across all my pages using just one document?

Trying out the Send feature of Gmail API using Postman

AngularJS's $resource module returns an empty array as a response

Use CSS Grid to anchor the final element to the right side of a horizontal navigation menu

Tips for centering or aligning a component to the right using Material UI?

Require a more efficient strategy for iterating through lines of input

modifying a mongodb array without actually updating it

Is there a way to make a TABLE expand to match the height of its surrounding element? (or, tackling sluggishness in IE with JavaScript)

Guide to setting up a dropdown menu with Material UI in React JS

Unraveling the mystery of "??=" in Javascript/Typescript code