Using rvest to extract picture data

Question

Using rvest to extract picture data

I've been grappling with this issue for a few weeks now and haven't had any success. My ultimate goal is to extract each image from the website provided (link:). To start, I am attempting to retrieve just one instance of the image stored in the 'img alt' property within the HTML code.

The snippet of HTML code looks like this:

<div class="l-grid__item l-grid__item--3/12 l-grid__item--12/12@mobile--sm l-grid__item--4/12@desktop l-grid__item--6/12@tablet"><div tabindex="0" class="c-card u-flex u-flex--column u-height--100% u-cursor--pointer u-bxs--dark-lg:hover c-card--@print"><div class="u-height--100% u-width--100% u-p u-flex u-flex--centered u-mb--auto"><div aria-hidden="true" class="u-max-width--80% u-max-height--250px"><img alt="/photo/66c88d1d7401a93215e0b225.jpg" class="u-max-height--250px u-height--auto u-width--auto u-block" src="/photo/66c88d1d7401a93215e0b225.jpg"></div></div><div class="u-flex u-flex--column u-flex--no-shrink u-p u-bg--off-white u-fw--bold u-color--primary u-text--center u-bt--light-gray"><div class="u-cursor--pointer u-mb--xs">AANDAHL, Fred George</div><div class="u-fz--sm u-fw--semibold">1897 – 1966</div></div></div></div>

I have tried using the R code below, but I keep getting character(0):

library(httr)
library(rvest)

# Fetch the HTML content with a custom User-Agent
response <- GET("https://bioguide.congress.gov/search", 
                user_agent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36"))

# Parse the content
page <- read_html(content(response, as = "text", encoding = "UTF-8"))

# Navigate to the div with class starting with 'l-grid__item' and extract img alt attributes
img_alt_values <- page %>
  html_nodes(xpath = "//div[starts-with(@class, 'l-grid__item')]") %>
  html_nodes(xpath = ".//img") %>
  html_attr("alt")

Does anyone have any suggestions on how to overcome this hurdle?

html css r rvest

Answer 1

Answer №1

Examining the network traffic reveals that data is fetched from an API where the search function on the page triggers a POST request with a JSON payload. By utilizing httr2, it becomes possible to send these requests and retrieve a maximum of 100 records at once, although in the provided code snippet, each request is limited to returning only 3 records.

The URL and payload are as follows:

library(httr2)
library(jsonlite)
library(tidyverse)

# API address
url <- "https://app-elastic-prod-eus2-001.azurewebsites.net/search"

# JSON payload  
payload_string <- "{\"index\": \"bioguideprofiles\", \"aggregations\":{\"field\":\"jobPositions.congressAffiliation.congress.name\",\"subFields\":[\"jobPositions.congressAffiliation.congress.startDate\",\"jobPositions.congressAffiliation.congress.endDate\"]},{\"field\":\"jobPositions.congressAffiliation.partyAffiliation.party.name\"},{\"field\":\"jobPositions.job.name\"},{\"field\":\"jobPositions.congressAffiliation.represents.regionCode\"}],\"size\":12,\"from\":0,\"sort\":[{\"_score\":true},{\"field\":\"unaccentedFamilyName\",
\"order\":\"asc\"},{\"field\":\"unaccentedGivenName\",\"order\":\"asc\"},{\"field\":\"unaccentedMiddleName\",\"order\":\"asc\"}],\"keyword\":\"\",\"filters\":{},\"matches\":[],\"searchType\":\"OR\",\"applicationName\":\"bioguide.house.gov\"}"

To easily modify the from argument in the request, the payload needs to be converted to an R list using req_body_json_modify():

# Convert to R list
payload_list <- fromJSON(payload_string)

# Retrieve n records of first x records
request_size <- 3L         # Max 100 per request
total_records <- 15L       # 12953 records available
from <- seq(1L, total_records, request_size) - 1L  # Sequence of start positions

# Generate base request
req <- request(url) |>
    req_method("POST") |>
    req_body_json(payload_list) 

# Generate list of requests (5 requests of 3 records each)
requests <- from |> 
   lapply(\(n) req |> req_body_json_modify(from = n, size = request_size))

# Execute requests
responses <- req_perform_sequential(requests, on_error = "return")

# Parse responses and extract image URL
results <- resps_data(
  responses,
  \(r) r |>
    resp_body_json(simplifyDataFrame = TRUE) |>
    pluck("filteredHits")  |>
    select(starts_with("unaccented"), any_of("image"))
  ) |>
  bind_rows() |>
  hoist("image", "contentUrl") |> 
  select(-image) |> 
  mutate(image_url = ifelse(is.na(contentUrl), NA, paste0("https://bioguide.congress.gov/photo/", basename(contentUrl))), .keep = "unused") |> 
  as_tibble()

Now, the results variable holds the extracted image URLs:

# A tibble: 15 × 4
   unaccentedFamilyName unaccentedGivenName unaccentedMiddleName image_url                                       
   <chr>                <chr>               <chr>                <chr>                                           
 1 Aandahl              Fred                George               https://bioguide.congress.gov/photo/66c88d1d740…
 2 Abbitt               Watkins             Moorman              https://bioguide.congress.gov/photo/ad79716f164…
 3 Abbot                Joel                NA                   NA                                              
 4 Abbott               Amos                NA                   NA                                              
 5 Abbott               Joseph              Carter               https://bioguide.congress.gov/photo/39253c461f2…
 6 Abbott               Joseph              NA                   https://bioguide.congress.gov/photo/43ba0fd5299…
 7 Abbott               Josiah              Gardner              https://bioguide.congress.gov/photo/470dc5df4ba…
 8 Abbott               Nehemiah            NA                   NA                                              
 9 Abdnor               James               NA                   https://bioguide.congress.gov/photo/a32ba2ea44f…
10 Abel                 Hazel               Hempel               https://bioguide.congress.gov/photo/07f3a896ce1…
11 Abele                Homer               E.                   https://bioguide.congress.gov/photo/a58aa67c32f…
12 Abercrombie          James               NA                   NA                                              
13 Abercrombie          John                William              https://bioguide.congress.gov/photo/76a90e5795f…
14 Abercrombie          Neil                NA                   https://bioguide.congress.gov/photo/66cbb14989f…
15 Abernethy            Charles             Laban                https://bioguide.congress.gov/photo/00ff9ca93d0…

Additional data accompanies each query but the process of organizing and manipulating it is left for you to handle.

Answer 2

Examining the network traffic reveals that data is fetched from an API where the search function on the page triggers a POST request with a JSON payload. By utilizing httr2, it becomes possible to send these requests and retrieve a maximum of 100 records at once, although in the provided code snippet, each request is limited to returning only 3 records.

The URL and payload are as follows:

library(httr2)
library(jsonlite)
library(tidyverse)

# API address
url <- "https://app-elastic-prod-eus2-001.azurewebsites.net/search"

# JSON payload  
payload_string <- "{\"index\": \"bioguideprofiles\", \"aggregations\":{\"field\":\"jobPositions.congressAffiliation.congress.name\",\"subFields\":[\"jobPositions.congressAffiliation.congress.startDate\",\"jobPositions.congressAffiliation.congress.endDate\"]},{\"field\":\"jobPositions.congressAffiliation.partyAffiliation.party.name\"},{\"field\":\"jobPositions.job.name\"},{\"field\":\"jobPositions.congressAffiliation.represents.regionCode\"}],\"size\":12,\"from\":0,\"sort\":[{\"_score\":true},{\"field\":\"unaccentedFamilyName\",
\"order\":\"asc\"},{\"field\":\"unaccentedGivenName\",\"order\":\"asc\"},{\"field\":\"unaccentedMiddleName\",\"order\":\"asc\"}],\"keyword\":\"\",\"filters\":{},\"matches\":[],\"searchType\":\"OR\",\"applicationName\":\"bioguide.house.gov\"}"

To easily modify the from argument in the request, the payload needs to be converted to an R list using req_body_json_modify():

# Convert to R list
payload_list <- fromJSON(payload_string)

# Retrieve n records of first x records
request_size <- 3L         # Max 100 per request
total_records <- 15L       # 12953 records available
from <- seq(1L, total_records, request_size) - 1L  # Sequence of start positions

# Generate base request
req <- request(url) |>
    req_method("POST") |>
    req_body_json(payload_list) 

# Generate list of requests (5 requests of 3 records each)
requests <- from |> 
   lapply(\(n) req |> req_body_json_modify(from = n, size = request_size))

# Execute requests
responses <- req_perform_sequential(requests, on_error = "return")

# Parse responses and extract image URL
results <- resps_data(
  responses,
  \(r) r |>
    resp_body_json(simplifyDataFrame = TRUE) |>
    pluck("filteredHits")  |>
    select(starts_with("unaccented"), any_of("image"))
  ) |>
  bind_rows() |>
  hoist("image", "contentUrl") |> 
  select(-image) |> 
  mutate(image_url = ifelse(is.na(contentUrl), NA, paste0("https://bioguide.congress.gov/photo/", basename(contentUrl))), .keep = "unused") |> 
  as_tibble()

Now, the results variable holds the extracted image URLs:

# A tibble: 15 × 4
   unaccentedFamilyName unaccentedGivenName unaccentedMiddleName image_url                                       
   <chr>                <chr>               <chr>                <chr>                                           
 1 Aandahl              Fred                George               https://bioguide.congress.gov/photo/66c88d1d740…
 2 Abbitt               Watkins             Moorman              https://bioguide.congress.gov/photo/ad79716f164…
 3 Abbot                Joel                NA                   NA                                              
 4 Abbott               Amos                NA                   NA                                              
 5 Abbott               Joseph              Carter               https://bioguide.congress.gov/photo/39253c461f2…
 6 Abbott               Joseph              NA                   https://bioguide.congress.gov/photo/43ba0fd5299…
 7 Abbott               Josiah              Gardner              https://bioguide.congress.gov/photo/470dc5df4ba…
 8 Abbott               Nehemiah            NA                   NA                                              
 9 Abdnor               James               NA                   https://bioguide.congress.gov/photo/a32ba2ea44f…
10 Abel                 Hazel               Hempel               https://bioguide.congress.gov/photo/07f3a896ce1…
11 Abele                Homer               E.                   https://bioguide.congress.gov/photo/a58aa67c32f…
12 Abercrombie          James               NA                   NA                                              
13 Abercrombie          John                William              https://bioguide.congress.gov/photo/76a90e5795f…
14 Abercrombie          Neil                NA                   https://bioguide.congress.gov/photo/66cbb14989f…
15 Abernethy            Charles             Laban                https://bioguide.congress.gov/photo/00ff9ca93d0…

Additional data accompanies each query but the process of organizing and manipulating it is left for you to handle.

Using rvest to extract picture data

Answer №1

Similar questions

starter tablets

Adaptive Navigation using jquery

How to ensure consistent x-axis scales in upper and lower plots by utilizing layout with base graphics

Choose the section of the date input that represents the day

Is there a way to submit the value of a textbox that has been generated dynamically using jQuery?

Swapping out data with a list in R

position the input and span elements side by side

The JSON ticker for BTC/LTC has stopped functioning

Angular 2 failing to display background images

Sharing a PHP-filled form with a client: A step-by-step guide

Is there a way to apply the 'absolute' or 'fixed' with a width of 100% to the parent div specifically, rather than to the window size?

How to utilize flexbox for rearranging content in Bootstrap 4

An unexpected page transition occurs when attempting to delete a link

What is the best way to align HTML elements in a single row?

Attempting to develop a code that generates a list of prime numbers within the range up to

Locate the subset in a matrix that is organized based on the absolute difference between rows within certain columns

Combine extensive data sets using a computer cluster

Can someone help me troubleshoot this issue with my code so that my website can open in a blank page using about:blank?

Is there a way to extend this section to fill the entire screen height without any white space at the bottom?

What is the best way to showcase validation errors based on required fields in PHP?