RSelenium: Issue with extracting hyperlinks from webpage following button click

My goal is to automate web scraping with RSelenium in R. I've managed to find and click a button on a webpage using RSelenium, but I'm struggling to extract href attributes from the page after clicking the button.

Although I have a list of 4000 species, here is an example:

Species <- c("Abies balsamea", "Alchemilla glomerulans", "Antennaria dioica",
"Atriplex glabriuscula", "Brachythecium salebrosum")

Here's my current code:

remDr <- remoteDriver(
  remoteServerAddr = "localhost",
  port = 4445L,
  browserName = "firefox"



webElem <- remDr$findElement(using = "class", "flex")

# Find the input field and button within webElem
input_element <- webElem$findChildElement(using = "css selector", value = "input[type='text']")
button_element <- webElem$findChildElement(using = "css selector", value = "button")

# Input species name into the input field

input_element$sendKeysToElement(list("Abies balsamea"))

# Click the button to submit the form


# Locate all <a> elements with species information
species_links <- remDr$findElements(using = "css selector", value = "a[href^='/species/']")

# Extract href attributes from the species links
hrefs <- sapply(species_links, function(link) {

# Remove NULL values (in case some links don't have href attributes)
hrefs <- hrefs[!]

# Print the extracted hrefs

The code doesn't throw any errors but species_links ends up empty, indicating that the elements with species information are not being found.

I attempted waiting for the page to load after clicking the button, but it appears that the page content isn't fully loading or as expected.

When I manually search for 'Abies balsamea' on the webpage, I find this:

From there, I aim to retrieve this link at least:

Inspecting it in the webpage, brings me to this image below:

Any suggestions on how to troubleshoot this issue and ensure successful extraction of hrefs after button clicks?

My end goal would be to iterate through a species list like Species and create a data.frame containing the links to each species

Edit based on Brett Donald's answer

Brett's solution seems better, but I haven't located the API documentation yet.

This is what I've tried:


# Define the API endpoint URL
url <- ""

# Define query parameters
params <- list(
  select = "*",
  or = "(has_germination.eq.true,has_oil.eq.true,has_protein.eq.true,has_dispersal.eq.true,has_seed_weights.eq.true,has_storage_behaviour.eq.true,has_morphology.eq.true)",
  genus = "ilike.Abies%",
  epithet = "ilike.balsamea%",
  order = "genus.asc.nullslast,epithet.asc.nullslast"

# Set request headers with the correct API key
headers <- add_headers(
  `Content-Type` = "application/json",
  Authorization = "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJzdXBhYmFzZSIsInJlZiI6ImZ5eGhlZ3V5a3Zld3BkZXlzdm9oIiwicm9sZSI6ImFub24iLCJpYXQiOjE2NDc0MTY1MzQsImV4cCI6MTk2Mjk5MjUzNH0.XhJKVijhMUidqeTbH62zQ6r8cS6j22TYAKfbbRHMTZ8"

# Make a GET request
response <- GET(url, query = params, headers = headers)

# Check if the request was successful
if (http_type(response) == "application/json") {
  # Parse JSON response
  data <- content(response, "parsed")
} else {
  print("Error: Failed to retrieve data")

But I receive:

[1] "No API key found in request"

[1] "No `apikey` request header or url param was found."

Answer №1

Upon reviewing your code, I see no issues with it, although I am not well-versed in RSelenium.

If I were in your shoes, I might consider obtaining the data differently by mimicking the website's API calls rather than scraping it with a robotic browser.

By analyzing the network tab of your browser inspector when conducting a search on, you can uncover both the API endpoint URL being accessed and the API key.

API endpoint URL (with parameters included)*&or=%28has_germination.eq.true%2Chas_oil.eq.true%2Chas_protein.eq.true%2Chas_dispersal.eq.true%2Chas_seed_weights.eq.true%2Chas_storage_behaviour.eq.true%2Chas_morphology.eq.true%29&genus=ilike.Abies%25&epithet=ilike.balsamea%25&order=genus.asc.nullslast%2Cepithet.asc.nullslast

API key (found in request headers)


After replicating these details in a new Postman Get request, I received a JSON response like this:

    "genus": "Abies",
    "epithet": "balsamea",
    "id": "ef741ce8-6911-4286-b79e-3ff0804520fb",
    "infraspecies_rank": null,
    "infraspecies_epithet": null,
    "has_germination": false,
    "has_oil": true,
    "has_protein": false,
    "has_dispersal": true,
    "has_seed_weights": true,
    "has_storage_behaviour": true,
    "has_morphology": false
    "genus": "Abies",
    "epithet": "balsamea",
    "id": "024cde5f-7cc5-48b7-89fd-be95638c8f2a",
    "infraspecies_rank": "var.",
    "infraspecies_epithet": "balsamea",
    "has_germination": true,
    "has_oil": false,
    "has_protein": false,
    "has_dispersal": false,
    "has_seed_weights": true,
    "has_storage_behaviour": true,
    "has_morphology": false

You could easily automate these requests using any language of your choice. Personally, I would opt for Node.js. Wouldn't that be simpler than resorting to web scraping with a robotic browser?

PS. Since the data in this database is reportedly under a Creative Commons License, you might have luck contacting the Society for Ecological Restoration to access the data directly instead of having to extract it species by species.

Answer №2

If you want to utilize the rsDriver launch and xpath for searching with contains(), follow these steps:


port <- 4567
#to terminate port for reuse use system(paste0("sudo kill -9 $(lsof -t -i:",port," -sTCP:LISTEN)"))

#my standard launch function with enhanced privacy features and image blocking for quicker loading
eCaps = list(`moz:firefoxOptions` = list(
  args = list("--disable-gpu","--no-sandbox","--disable-application-cache","--disable-dev-shm-usage", "--disable-extensions"),
  prefs =list(
    "browser.cache.disk.enable" = FALSE,
    "browser.cache.memory.enable" = FALSE,
    "browser.cache.offline.enable" = FALSE,
    "browser.sessionstore.max_tabs_undo" = 0,
    "network.http.use-cache" = FALSE,
    "permissions.default.image"= 2,
    "privacy.clearOnShutdown.cache" = TRUE,
    "privacy.clearOnShutdown.cookies" = TRUE)

rD <- rsDriver( browser = "firefox", extraCapabilities = eCaps, port=as.integer(port), check=F)
remDr <- rD$client


webElem <- remDr$findElement(using = "class", "flex")

# Locate the input field and button within webElem
input_element <- webElem$findChildElement(using = "css selector", value = "input[type='text']")
button_element <- webElem$findChildElement(using = "css selector", value = "button")

# Input the species name into the input field

input_element$sendKeysToElement(list("Abies balsamea"))

# Click the button to submit the form


# Find all <a> elements with species information
species_links <- remDr$findElements(using = "xpath", "//a[contains(@href,'species')]")

# Extract the href attributes from the species links
hrefs <- sapply(species_links, function(link) {

# Filter out NULL values (in case some links don't have href attributes)
hrefs <- hrefs[!]

# Display the extracted hrefs

[1] ""

[1] ""

