Exploring alphabetical web pages (rvest)

Question

Exploring alphabetical web pages (rvest)

Exhausting all available resources and solutions, I have finally decided to seek help for my issue with web scraping using R and rvest. Despite presenting a detailed description of the problem to prevent confusion, I am encountering difficulties.

The Challenge: My goal is to extract author names from a conference webpage where authors are categorized alphabetically by their last name. To achieve this, I need to employ a for loop to execute follow_link() 25 times in order to navigate to each page and retrieve the relevant author data.

The conference website:

I have made two attempts in R utilizing rvest, both met with obstacles.

Solution 1 (Alphabetical Page Navigation)

lttrs <- LETTERS[seq( from = 1, to = 26 )] # create character vector
website <-  html_session(https://gsa.confex.com/gsa/2016AM/webprogram/authora.html)

tempList <- list() #create list to store each page's author information

for(i in 1:length(lttrs)){
  tempList[[i]] <- website %>%;
  follow_link(lttrs[i])%>% #use capital letters to call links to author pages  
  html_nodes(xpath ='//*[@class = "author"]') %>% 
  html_text()  
}

This code exhibits functionality, reaching a certain point before displaying an error. It successfully navigates through lettered pages until it reaches specific transitions where it fetches the incorrect content.

Navigating to authora.html
Navigating to authorb.html
Navigating to authorc.html
Navigating to authord.html
Navigating to authore.html
Navigating to authorf.html
Navigating to authorg.html
Navigating to authorh.html
Navigating to authora.html
Navigating to authorj.html
Navigating to authork.html
Navigating to authorl.html
Navigating to http://community.geosociety.org/gsa2016/home

Solution 2 (CSS Selector Page Identification) By applying a CSS selector on the webpage, individual lettered pages are recognized as "a:nth-child(1-26)". Thus, I revamped my loop employing this CSS identifier.

tempList <- list()
for(i in 2:length(lttrs)){
  tempList[[i]] <- website %>%
    follow_link(css = paste('a:nth-child(',i,')',sep = '')) %>%
    html_nodes(xpath ='//*[@class = "author"]') %>%; 
    html_text()
}

This method works partially. Yet again, encountering issues during specific transitions as outlined below.

Navigating to authora.html
Navigating to uploadlistall.html
Navigating to http://community.geosociety.org/gsa2016/home
Navigating to authore.html
Navigating to authorf.html
Navigating to authorg.html
Navigating to authorh.html
Navigating to authori.html
Navigating to authorj.html
Navigating to authork.html
Navigating to authorl.html
Navigating to authorm.html
Navigating to authorn.html
Navigating to authoro.html
Navigating to authorp.html
Navigating to authorq.html
Navigating to authorr.html
Navigating to authors.html
Navigating to authort.html
Navigating to authoru.html
Navigating to authorv.html
Navigating to authorw.html
Navigating to authorx.html
Navigating to authory.html
Navigating to authorz.html

Notably, this approach fails to capture B, C, and D, leading to incorrect page redirections at this juncture. I would greatly appreciate any suggestions or guidance on how to adjust my existing code to effectively cycle through all 26 alphabetical pages.

Thank you for your assistance!

css r web-scraping rvest

Answer 1

Answer №1

Welcome to the Stack Overflow community (and congratulations on asking a great first question).

You've been really fortunate because the robots.txt for that website has many entries, but it doesn't restrict your activities.

To extract all the links in the alphabetical pagination at the bottom of the page, use this code:

html_nodes(pg, "a[href^='author']")

. The following code fetches paper links from all authors:


pg <- read_html("https://gsa.confex.com/gsa/2016AM/webprogram/authora.html")

html_nodes(pg, "a[href^='author']") %>% 
  html_attr("href") %>% 
  sprintf("https://gsa.confex.com/gsa/2016AM/webprogram/%s", .) %>% 
  { pb <<- progress_estimated(length(.)) ; . } %>%  
  map_df(~{

    pb$tick()$print()

    Sys.sleep(5)

    read_html(.x) %>% 
      html_nodes("div.item > div.author") %>% 
      map_df(~{
        data_frame(
          author = html_text(.x, trim = TRUE),
          paper = html_nodes(.x, xpath="../div[@class='papers']/a") %>% 
            html_text(trim = TRUE),
          paper_url = html_nodes(.x, xpath="../div[@class='papers']/a") %>% 
            html_attr("href") %>% 
            sprintf("https://gsa.confex.com/gsa/2016AM/webprogram/%s", .)
        )
      })
  }) -> author_papers

author_papers
## # A tibble: 34,983 x 3
##    author               paper  paper_url                                                    
##    <chr>                <chr>  <chr>                                                        
##  1 Aadahl, Kristopher   296-5  https://gsa.confex.com/gsa/2016AM/webprogram/Paper283542.html
##  2 Aanderud, Zachary T. 215-11 https://gsa.confex.com/gsa/2016AM/webprogram/Paper286442.html
##  3 Abbey, Alyssa        54-4   https://gsa.confex.com/gsa/2016AM/webprogram/Paper281801.html
##  4 Abbott, Dallas H.    341-34 https://gsa.confex.com/gsa/2016AM/webprogram/Paper287404.html
##  5 Abbott Jr., David M. 38-6   https://gsa.confex.com/gsa/2016AM/webprogram/Paper278060.html
##  6 Abbott, Grant        58-7   https://gsa.confex.com/gsa/2016AM/webprogram/Paper283414.html
##  7 Abbott, Jared        29-10  https://gsa.confex.com/gsa/2016AM/webprogram/Paper286237.html
##  8 Abbott, Jared        317-9  https://gsa.confex.com/gsa/2016AM/webprogram/Paper282386.html
##  9 Abbott, Kathryn A.   187-9  https://gsa.confex.com/gsa/2016AM/webprogram/Paper286127.html
## 10 Abbott, Lon D.       208-16 https://gsa.confex.com/gsa/2016AM/webprogram/Paper280093.html
## # ... with 34,973 more rows

If you need specific information from individual paper pages, make sure to retrieve that as well.

No need to wait ~3m since the data frame 'author_papers' is available in an RDS file here: . Read it using:

readRDS(url("https://rud.is/dl/author-papers.rds"))

In case you plan to scrape all 34,983 papers, remember to be respectful and use a crawl delay. Check out this link for more details: .

UPDATE

html_nodes(pg, "a[href^='author']") %>% 
  html_attr("href") %>% 
  sprintf("https://gsa.confex.com/gsa/2016AM/webprogram/%s", .) %>% 
  { pb <<- progress_estimated(length(.)) ; . } %>%  
  map_df(~{

    pb$tick()$print()

    Sys.sleep(5)

    read_html(.x) %>% 
      html_nodes("div.item > div.author") %>% 
      map_df(~{
        data_frame(
          author = html_text(.x, trim = TRUE),
          is_presenting = html_nodes(.x, xpath="../div[@class='papers']") %>% 
            html_text(trim = TRUE) %>% 
            paste0(collapse=" ") %>% 
            grepl("*", ., fixed=TRUE)
        )
      })
  }) -> author_with_presenter_status

author_with_presenter_status
## # A tibble: 22,545 x 2
##    author               is_presenting
##    <chr>                <lgl>        
##  1 Aadahl, Kristopher   FALSE        
##  2 Aanderud, Zachary T. FALSE        
##  3 Abbey, Alyssa        TRUE         
##  4 Abbott, Dallas H.    FALSE        
##  5 Abbott Jr., David M. TRUE         
##  6 Abbott, Grant        FALSE        
##  7 Abbott, Jared        FALSE        
##  8 Abbott, Kathryn A.   FALSE        
##  9 Abbott, Lon D.       FALSE        
## 10 Abbott, Mark B.      FALSE        
## # ... with 22,535 more rows

You can also get the above information by using:

readRDS(url("https://rud.is/dl/author-presenter.rds"))

Answer 2

Welcome to the Stack Overflow community (and congratulations on asking a great first question).

You've been really fortunate because the robots.txt for that website has many entries, but it doesn't restrict your activities.

To extract all the links in the alphabetical pagination at the bottom of the page, use this code:

html_nodes(pg, "a[href^='author']")

. The following code fetches paper links from all authors:


pg <- read_html("https://gsa.confex.com/gsa/2016AM/webprogram/authora.html")

html_nodes(pg, "a[href^='author']") %>% 
  html_attr("href") %>% 
  sprintf("https://gsa.confex.com/gsa/2016AM/webprogram/%s", .) %>% 
  { pb <<- progress_estimated(length(.)) ; . } %>%  
  map_df(~{

    pb$tick()$print()

    Sys.sleep(5)

    read_html(.x) %>% 
      html_nodes("div.item > div.author") %>% 
      map_df(~{
        data_frame(
          author = html_text(.x, trim = TRUE),
          paper = html_nodes(.x, xpath="../div[@class='papers']/a") %>% 
            html_text(trim = TRUE),
          paper_url = html_nodes(.x, xpath="../div[@class='papers']/a") %>% 
            html_attr("href") %>% 
            sprintf("https://gsa.confex.com/gsa/2016AM/webprogram/%s", .)
        )
      })
  }) -> author_papers

author_papers
## # A tibble: 34,983 x 3
##    author               paper  paper_url                                                    
##    <chr>                <chr>  <chr>                                                        
##  1 Aadahl, Kristopher   296-5  https://gsa.confex.com/gsa/2016AM/webprogram/Paper283542.html
##  2 Aanderud, Zachary T. 215-11 https://gsa.confex.com/gsa/2016AM/webprogram/Paper286442.html
##  3 Abbey, Alyssa        54-4   https://gsa.confex.com/gsa/2016AM/webprogram/Paper281801.html
##  4 Abbott, Dallas H.    341-34 https://gsa.confex.com/gsa/2016AM/webprogram/Paper287404.html
##  5 Abbott Jr., David M. 38-6   https://gsa.confex.com/gsa/2016AM/webprogram/Paper278060.html
##  6 Abbott, Grant        58-7   https://gsa.confex.com/gsa/2016AM/webprogram/Paper283414.html
##  7 Abbott, Jared        29-10  https://gsa.confex.com/gsa/2016AM/webprogram/Paper286237.html
##  8 Abbott, Jared        317-9  https://gsa.confex.com/gsa/2016AM/webprogram/Paper282386.html
##  9 Abbott, Kathryn A.   187-9  https://gsa.confex.com/gsa/2016AM/webprogram/Paper286127.html
## 10 Abbott, Lon D.       208-16 https://gsa.confex.com/gsa/2016AM/webprogram/Paper280093.html
## # ... with 34,973 more rows

If you need specific information from individual paper pages, make sure to retrieve that as well.

No need to wait ~3m since the data frame 'author_papers' is available in an RDS file here: . Read it using:

readRDS(url("https://rud.is/dl/author-papers.rds"))

In case you plan to scrape all 34,983 papers, remember to be respectful and use a crawl delay. Check out this link for more details: .

UPDATE

html_nodes(pg, "a[href^='author']") %>% 
  html_attr("href") %>% 
  sprintf("https://gsa.confex.com/gsa/2016AM/webprogram/%s", .) %>% 
  { pb <<- progress_estimated(length(.)) ; . } %>%  
  map_df(~{

    pb$tick()$print()

    Sys.sleep(5)

    read_html(.x) %>% 
      html_nodes("div.item > div.author") %>% 
      map_df(~{
        data_frame(
          author = html_text(.x, trim = TRUE),
          is_presenting = html_nodes(.x, xpath="../div[@class='papers']") %>% 
            html_text(trim = TRUE) %>% 
            paste0(collapse=" ") %>% 
            grepl("*", ., fixed=TRUE)
        )
      })
  }) -> author_with_presenter_status

author_with_presenter_status
## # A tibble: 22,545 x 2
##    author               is_presenting
##    <chr>                <lgl>        
##  1 Aadahl, Kristopher   FALSE        
##  2 Aanderud, Zachary T. FALSE        
##  3 Abbey, Alyssa        TRUE         
##  4 Abbott, Dallas H.    FALSE        
##  5 Abbott Jr., David M. TRUE         
##  6 Abbott, Grant        FALSE        
##  7 Abbott, Jared        FALSE        
##  8 Abbott, Kathryn A.   FALSE        
##  9 Abbott, Lon D.       FALSE        
## 10 Abbott, Mark B.      FALSE        
## # ... with 22,535 more rows

You can also get the above information by using:

readRDS(url("https://rud.is/dl/author-presenter.rds"))

Exploring alphabetical web pages (rvest)

Answer №1

Similar questions

The particular division is failing to display in its designated location

The combination of two equal height elements with absolutely positioned child elements

Vue 3 has a known issue where scoped styles do not get applied correctly within the content of a <slot> element

How can I use Vue.js @scroll to create a dynamic CSS property?

Guidance on incorporating static files with Spring MVC and Thymeleaf

Tips for creating a responsive input field within a navbar

What specific CSS property creates the distinction in text vertical alignment between a <button> and a <div>?

Exploring R ggplot2 aesthetics: Utilizing color, point shape, and point fill status according to three distinct factors

What steps should I take to adjust the CSS code to properly integrate with my HTML file?

Jumping CSS dropdown menu glitch

Div behaving as a radio input

What is the method in R for sorting a column consisting of strings containing the character "_"?

Unlocking the Possibilities of scroll-snap-type

How to toggle a div's visibility in AngularJS with a click事件

The tab content refuses to show up in its designated fixed location

Utilizing C# dynamic variable within inline CSS for an element (MVC)

How to place a div within another div without using absolute positioning

When working on a small screen, Bootstrap generates divs one by one

Attempting to implement image switching with hover effects and clickable regions

What is the function in R that is similar to progn in Lisp?