Using rvest to extract picture data

I've been grappling with this issue for a few weeks now and haven't had any success. My ultimate goal is to extract each image from the website provided (link:). To start, I am attempting to retrieve just one instance of the image stored in the 'img alt' property within the HTML code.

The snippet of HTML code looks like this:

<div class="l-grid__item l-grid__item--3/12 l-grid__item--12/12@mobile--sm l-grid__item--4/12@desktop l-grid__item--6/12@tablet"><div tabindex="0" class="c-card u-flex u-flex--column u-height--100% u-cursor--pointer u-bxs--dark-lg:hover c-card--@print"><div class="u-height--100% u-width--100% u-p u-flex u-flex--centered u-mb--auto"><div aria-hidden="true" class="u-max-width--80% u-max-height--250px"><img alt="/photo/66c88d1d7401a93215e0b225.jpg" class="u-max-height--250px u-height--auto u-width--auto u-block" src="/photo/66c88d1d7401a93215e0b225.jpg"></div></div><div class="u-flex u-flex--column u-flex--no-shrink u-p u-bg--off-white u-fw--bold u-color--primary u-text--center u-bt--light-gray"><div class="u-cursor--pointer u-mb--xs">AANDAHL, Fred George</div><div class="u-fz--sm u-fw--semibold">1897 – 1966</div></div></div></div>

I have tried using the R code below, but I keep getting character(0):

library(httr)
library(rvest)

# Fetch the HTML content with a custom User-Agent
response <- GET("https://bioguide.congress.gov/search", 
                user_agent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36"))

# Parse the content
page <- read_html(content(response, as = "text", encoding = "UTF-8"))

# Navigate to the div with class starting with 'l-grid__item' and extract img alt attributes
img_alt_values <- page %>
  html_nodes(xpath = "//div[starts-with(@class, 'l-grid__item')]") %>
  html_nodes(xpath = ".//img") %>
  html_attr("alt")

Does anyone have any suggestions on how to overcome this hurdle?

Answer №1

Examining the network traffic reveals that data is fetched from an API where the search function on the page triggers a POST request with a JSON payload. By utilizing httr2, it becomes possible to send these requests and retrieve a maximum of 100 records at once, although in the provided code snippet, each request is limited to returning only 3 records.

The URL and payload are as follows:

library(httr2)
library(jsonlite)
library(tidyverse)

# API address
url <- "https://app-elastic-prod-eus2-001.azurewebsites.net/search"

# JSON payload  
payload_string <- "{\"index\": \"bioguideprofiles\", \"aggregations\":{\"field\":\"jobPositions.congressAffiliation.congress.name\",\"subFields\":[\"jobPositions.congressAffiliation.congress.startDate\",\"jobPositions.congressAffiliation.congress.endDate\"]},{\"field\":\"jobPositions.congressAffiliation.partyAffiliation.party.name\"},{\"field\":\"jobPositions.job.name\"},{\"field\":\"jobPositions.congressAffiliation.represents.regionCode\"}],\"size\":12,\"from\":0,\"sort\":[{\"_score\":true},{\"field\":\"unaccentedFamilyName\",
\"order\":\"asc\"},{\"field\":\"unaccentedGivenName\",\"order\":\"asc\"},{\"field\":\"unaccentedMiddleName\",\"order\":\"asc\"}],\"keyword\":\"\",\"filters\":{},\"matches\":[],\"searchType\":\"OR\",\"applicationName\":\"bioguide.house.gov\"}"

To easily modify the from argument in the request, the payload needs to be converted to an R list using req_body_json_modify():

# Convert to R list
payload_list <- fromJSON(payload_string)

# Retrieve n records of first x records
request_size <- 3L         # Max 100 per request
total_records <- 15L       # 12953 records available
from <- seq(1L, total_records, request_size) - 1L  # Sequence of start positions

# Generate base request
req <- request(url) |>
    req_method("POST") |>
    req_body_json(payload_list) 

# Generate list of requests (5 requests of 3 records each)
requests <- from |> 
   lapply(\(n) req |> req_body_json_modify(from = n, size = request_size))

# Execute requests
responses <- req_perform_sequential(requests, on_error = "return")

# Parse responses and extract image URL
results <- resps_data(
  responses,
  \(r) r |>
    resp_body_json(simplifyDataFrame = TRUE) |>
    pluck("filteredHits")  |>
    select(starts_with("unaccented"), any_of("image"))
  ) |>
  bind_rows() |>
  hoist("image", "contentUrl") |> 
  select(-image) |> 
  mutate(image_url = ifelse(is.na(contentUrl), NA, paste0("https://bioguide.congress.gov/photo/", basename(contentUrl))), .keep = "unused") |> 
  as_tibble()

Now, the results variable holds the extracted image URLs:

# A tibble: 15 × 4
   unaccentedFamilyName unaccentedGivenName unaccentedMiddleName image_url                                       
   <chr>                <chr>               <chr>                <chr>                                           
 1 Aandahl              Fred                George               https://bioguide.congress.gov/photo/66c88d1d740…
 2 Abbitt               Watkins             Moorman              https://bioguide.congress.gov/photo/ad79716f164…
 3 Abbot                Joel                NA                   NA                                              
 4 Abbott               Amos                NA                   NA                                              
 5 Abbott               Joseph              Carter               https://bioguide.congress.gov/photo/39253c461f2…
 6 Abbott               Joseph              NA                   https://bioguide.congress.gov/photo/43ba0fd5299…
 7 Abbott               Josiah              Gardner              https://bioguide.congress.gov/photo/470dc5df4ba…
 8 Abbott               Nehemiah            NA                   NA                                              
 9 Abdnor               James               NA                   https://bioguide.congress.gov/photo/a32ba2ea44f…
10 Abel                 Hazel               Hempel               https://bioguide.congress.gov/photo/07f3a896ce1…
11 Abele                Homer               E.                   https://bioguide.congress.gov/photo/a58aa67c32f…
12 Abercrombie          James               NA                   NA                                              
13 Abercrombie          John                William              https://bioguide.congress.gov/photo/76a90e5795f…
14 Abercrombie          Neil                NA                   https://bioguide.congress.gov/photo/66cbb14989f…
15 Abernethy            Charles             Laban                https://bioguide.congress.gov/photo/00ff9ca93d0…

Additional data accompanies each query but the process of organizing and manipulating it is left for you to handle.

Similar questions

If you have not found the answer to your question or you are interested in this topic, then look at other similar questions below or use the search

starter tablets

Currently, I am using Bootstrap pills for my navigation bar. However, I am experiencing an issue where the active pill is not displaying with a blue background color as it should according to the default Bootstrap styling. Here is my code: <div style=" ...

Adaptive Navigation using jquery

My jquery code is not working as I want it to. When I click on "openmenu", I would like "closemenu" to move to left: 50% instead of left: 85% $("#openMenu").click(function(){ $("#closeMenu").animate({left:"50%"}); rather than $("#closeMenu").animat ...

How to ensure consistent x-axis scales in upper and lower plots by utilizing layout with base graphics

I need to combine 3 plots with the same y-axis scale, but with the third plot having a longer x-axis than the other two. My goal is to arrange the first two plots side by side on the top row and position the third plot on the second row aligned to the righ ...

Choose the section of the date input that represents the day

Is there a way to focus an HTML input element with type="date" and select only the day part of the date? I attempted: $("#entry-date-text").focus(); var dt = $("#entry-date-text")[0]; dt.setSelectionRange(3, 5); However, this resulted in the following e ...

Is there a way to submit the value of a textbox that has been generated dynamically using jQuery?

I am currently using the CodeIgniter framework and am attempting to incorporate textboxes in an HTML form. When the user clicks a button, a new textbox should be generated automatically using jQuery. However, when I submit the form, the values of the dynam ...

Swapping out data with a list in R

I am facing a challenge with a large dataframe, where I need to calculate the scores of multiple questions. Below is a snippet of the data: Q1 = c("apple", "banana", "cider", "muffin", "chocolate") Q2 = c(& ...

position the input and span elements side by side

Need help aligning an input and a span element next to each other. The span is currently positioned below the input, but I want it to be on the right side of the input. I am trying to create a search bar using this input and span tag, so they need to be al ...

The JSON ticker for BTC/LTC has stopped functioning

I've been relying on this JSON ticker for the past month and it's been smooth sailing. However, today it suddenly stopped working. Any ideas on what might have caused this issue? $(function () { startRefresh(); }); function startRefresh() { ...

Angular 2 failing to display background images

I have been working on an Angular 2 app with just one component. I recently tried to set a background image for the entire page using the following code: app.component.html <dashboard class="dash"></dashboard> app.component.css .dash { b ...

Sharing a PHP-filled form with a client: A step-by-step guide

Is there a way to send the same PHP filled form to the client as well? For example, the code below sends the file to the website owner in HTML/PHP forms $to = "<a href="/cdn-cgi/l/email-protection" class="__cf_email__" data-cfemail="d8b7afb6bdaa98bfb5 ...

Is there a way to apply the 'absolute' or 'fixed' with a width of 100% to the parent div specifically, rather than to the window size?

I would like to create a banner similar to the one shown here: https://i.sstatic.net/SZLr9.png I am attempting to achieve this using 'absolute' or 'fixed' positioning. However, when I tried using 'left: 0 right: 0 top: 0', i ...

How to utilize flexbox for rearranging content in Bootstrap 4

Currently in the process of developing a website that includes a team section designed to resemble the following: https://i.sstatic.net/xHE3y.png Here is the current scss file being used: .team-member-card { &__text-content { background-color: ...

An unexpected page transition occurs when attempting to delete a link

I've successfully created an HTML table that dynamically adds rows and provides an option to delete the current row. Each row represents data retrieved from MongoDB, and upon clicking the delete button, I aim to delete the corresponding item from the ...

What is the best way to align HTML elements in a single row?

I have the following code snippet... <div class="header"> <div class="mainh"> <div class="table"> <ul> <li><a>smth</a></li> ...

Attempting to develop a code that generates a list of prime numbers within the range up to

My R coding skills are not the best, but I've been tasked with writing a code that lists all prime numbers up to 10,000. In a previous assignment we had to determine if a number is prime or not like this: n <- 4 prime <- TRUE for (i in 2:floor ...

Locate the subset in a matrix that is organized based on the absolute difference between rows within certain columns

I'm seeking assistance with the code snippet below. set.seed(5) matrix <- matrix(round(rnorm(100,100,50)), nrow = 4, ncol = 2, byrow = TRUE, dimnames = list(c("r1", "r2", "r3","r4"),c("c1","c2"))) I am looking to extract a subset of ...

Combine extensive data sets using a computer cluster

Currently, I am facing a challenge in merging multiple large tables, each up to 10Gb in size, into a single comprehensive table. To tackle this task, I have access to a robust computer cluster equipped with over 50 cores and more than 10Gb of RAM, all runn ...

Can someone help me troubleshoot this issue with my code so that my website can open in a blank page using about:blank?

I'm currently facing an issue while trying to make one of the pages on my website open with an about:blank URL upon loading. Despite embedding the code in my index.html file, it doesn't seem to be functioning properly. Here's the code I&apos ...

Is there a way to extend this section to fill the entire screen height without any white space at the bottom?

We've been working on extending the content to the full height of the screen, but no matter what we try, there's always some white space lingering at the bottom. Any suggestions on how to fix this issue? section { height: 100vh; background ...

What is the best way to showcase validation errors based on required fields in PHP?

When dealing with a form, the challenge arises when there are multiple validation errors to display. The dilemma is whether to show all errors at once or individually as per each field requirement. For instance, consider having fields like Name, Email, an ...