Exploring alphabetical web pages (rvest)

Exhausting all available resources and solutions, I have finally decided to seek help for my issue with web scraping using R and rvest. Despite presenting a detailed description of the problem to prevent confusion, I am encountering difficulties.

The Challenge: My goal is to extract author names from a conference webpage where authors are categorized alphabetically by their last name. To achieve this, I need to employ a for loop to execute follow_link() 25 times in order to navigate to each page and retrieve the relevant author data.

The conference website:

I have made two attempts in R utilizing rvest, both met with obstacles.

Solution 1 (Alphabetical Page Navigation)

lttrs <- LETTERS[seq( from = 1, to = 26 )] # create character vector
website <-  html_session(https://gsa.confex.com/gsa/2016AM/webprogram/authora.html)

tempList <- list() #create list to store each page's author information

for(i in 1:length(lttrs)){
  tempList[[i]] <- website %>%;
  follow_link(lttrs[i])%>% #use capital letters to call links to author pages  
  html_nodes(xpath ='//*[@class = "author"]') %>% 
  html_text()  
}

This code exhibits functionality, reaching a certain point before displaying an error. It successfully navigates through lettered pages until it reaches specific transitions where it fetches the incorrect content.

Navigating to authora.html
Navigating to authorb.html
Navigating to authorc.html
Navigating to authord.html
Navigating to authore.html
Navigating to authorf.html
Navigating to authorg.html
Navigating to authorh.html
Navigating to authora.html
Navigating to authorj.html
Navigating to authork.html
Navigating to authorl.html
Navigating to http://community.geosociety.org/gsa2016/home

Solution 2 (CSS Selector Page Identification) By applying a CSS selector on the webpage, individual lettered pages are recognized as "a:nth-child(1-26)". Thus, I revamped my loop employing this CSS identifier.

tempList <- list()
for(i in 2:length(lttrs)){
  tempList[[i]] <- website %>%
    follow_link(css = paste('a:nth-child(',i,')',sep = '')) %>%
    html_nodes(xpath ='//*[@class = "author"]') %>%; 
    html_text()
}

This method works partially. Yet again, encountering issues during specific transitions as outlined below.

Navigating to authora.html
Navigating to uploadlistall.html
Navigating to http://community.geosociety.org/gsa2016/home
Navigating to authore.html
Navigating to authorf.html
Navigating to authorg.html
Navigating to authorh.html
Navigating to authori.html
Navigating to authorj.html
Navigating to authork.html
Navigating to authorl.html
Navigating to authorm.html
Navigating to authorn.html
Navigating to authoro.html
Navigating to authorp.html
Navigating to authorq.html
Navigating to authorr.html
Navigating to authors.html
Navigating to authort.html
Navigating to authoru.html
Navigating to authorv.html
Navigating to authorw.html
Navigating to authorx.html
Navigating to authory.html
Navigating to authorz.html

Notably, this approach fails to capture B, C, and D, leading to incorrect page redirections at this juncture. I would greatly appreciate any suggestions or guidance on how to adjust my existing code to effectively cycle through all 26 alphabetical pages.

Thank you for your assistance!

Answer №1

Welcome to the Stack Overflow community (and congratulations on asking a great first question).

You've been really fortunate because the robots.txt for that website has many entries, but it doesn't restrict your activities.

To extract all the links in the alphabetical pagination at the bottom of the page, use this code:

html_nodes(pg, "a[href^='author']")
. The following code fetches paper links from all authors:


pg <- read_html("https://gsa.confex.com/gsa/2016AM/webprogram/authora.html")

html_nodes(pg, "a[href^='author']") %>% 
  html_attr("href") %>% 
  sprintf("https://gsa.confex.com/gsa/2016AM/webprogram/%s", .) %>% 
  { pb <<- progress_estimated(length(.)) ; . } %>%  
  map_df(~{

    pb$tick()$print()

    Sys.sleep(5)

    read_html(.x) %>% 
      html_nodes("div.item > div.author") %>% 
      map_df(~{
        data_frame(
          author = html_text(.x, trim = TRUE),
          paper = html_nodes(.x, xpath="../div[@class='papers']/a") %>% 
            html_text(trim = TRUE),
          paper_url = html_nodes(.x, xpath="../div[@class='papers']/a") %>% 
            html_attr("href") %>% 
            sprintf("https://gsa.confex.com/gsa/2016AM/webprogram/%s", .)
        )
      })
  }) -> author_papers

author_papers
## # A tibble: 34,983 x 3
##    author               paper  paper_url                                                    
##    <chr>                <chr>  <chr>                                                        
##  1 Aadahl, Kristopher   296-5  https://gsa.confex.com/gsa/2016AM/webprogram/Paper283542.html
##  2 Aanderud, Zachary T. 215-11 https://gsa.confex.com/gsa/2016AM/webprogram/Paper286442.html
##  3 Abbey, Alyssa        54-4   https://gsa.confex.com/gsa/2016AM/webprogram/Paper281801.html
##  4 Abbott, Dallas H.    341-34 https://gsa.confex.com/gsa/2016AM/webprogram/Paper287404.html
##  5 Abbott Jr., David M. 38-6   https://gsa.confex.com/gsa/2016AM/webprogram/Paper278060.html
##  6 Abbott, Grant        58-7   https://gsa.confex.com/gsa/2016AM/webprogram/Paper283414.html
##  7 Abbott, Jared        29-10  https://gsa.confex.com/gsa/2016AM/webprogram/Paper286237.html
##  8 Abbott, Jared        317-9  https://gsa.confex.com/gsa/2016AM/webprogram/Paper282386.html
##  9 Abbott, Kathryn A.   187-9  https://gsa.confex.com/gsa/2016AM/webprogram/Paper286127.html
## 10 Abbott, Lon D.       208-16 https://gsa.confex.com/gsa/2016AM/webprogram/Paper280093.html
## # ... with 34,973 more rows

If you need specific information from individual paper pages, make sure to retrieve that as well.

No need to wait ~3m since the data frame 'author_papers' is available in an RDS file here: . Read it using:

readRDS(url("https://rud.is/dl/author-papers.rds"))

In case you plan to scrape all 34,983 papers, remember to be respectful and use a crawl delay. Check out this link for more details: .

UPDATE

html_nodes(pg, "a[href^='author']") %>% 
  html_attr("href") %>% 
  sprintf("https://gsa.confex.com/gsa/2016AM/webprogram/%s", .) %>% 
  { pb <<- progress_estimated(length(.)) ; . } %>%  
  map_df(~{

    pb$tick()$print()

    Sys.sleep(5)

    read_html(.x) %>% 
      html_nodes("div.item > div.author") %>% 
      map_df(~{
        data_frame(
          author = html_text(.x, trim = TRUE),
          is_presenting = html_nodes(.x, xpath="../div[@class='papers']") %>% 
            html_text(trim = TRUE) %>% 
            paste0(collapse=" ") %>% 
            grepl("*", ., fixed=TRUE)
        )
      })
  }) -> author_with_presenter_status

author_with_presenter_status
## # A tibble: 22,545 x 2
##    author               is_presenting
##    <chr>                <lgl>        
##  1 Aadahl, Kristopher   FALSE        
##  2 Aanderud, Zachary T. FALSE        
##  3 Abbey, Alyssa        TRUE         
##  4 Abbott, Dallas H.    FALSE        
##  5 Abbott Jr., David M. TRUE         
##  6 Abbott, Grant        FALSE        
##  7 Abbott, Jared        FALSE        
##  8 Abbott, Kathryn A.   FALSE        
##  9 Abbott, Lon D.       FALSE        
## 10 Abbott, Mark B.      FALSE        
## # ... with 22,535 more rows

You can also get the above information by using:

readRDS(url("https://rud.is/dl/author-presenter.rds"))

Similar questions

If you have not found the answer to your question or you are interested in this topic, then look at other similar questions below or use the search

The particular division is failing to display in its designated location

This is quite an interesting predicament I am facing! Currently, I am in the process of coding a website and incorporating the three.js flocking birds animation on a specific div labeled 'canvas-div'. My intention is to have this animation displa ...

The combination of two equal height elements with absolutely positioned child elements

I have a website that features a side-bar navigation and a main content pane, both enclosed within a container element. The content has its own unique background styling, while the menu adopts the background of the parent container. In situations where th ...

Vue 3 has a known issue where scoped styles do not get applied correctly within the content of a <slot> element

Utilizing the Oruga and Storybook libraries for creating Vue 3 components. The code in the Vue file looks like this: <template> <o-radio v-bind="$props" v-model="model"> <slot /> </o-radio> </template ...

How can I use Vue.js @scroll to create a dynamic CSS property?

I am developing a vuejs component for my project and I am looking to implement a zoom functionality on scroll within a div, similar to Google Maps. <div @scroll="zoomOnScroll"> <Plotly :data="info" :layout="layout" :display-mode-bar="false"&g ...

Guidance on incorporating static files with Spring MVC and Thymeleaf

I'm seeking guidance on how to properly incorporate static files such as CSS and images in my Spring MVC application using Thymeleaf. Despite researching extensively on this topic, I have not found a solution that works for me. Based on the recommenda ...

Tips for creating a responsive input field within a navbar

Can anyone assist me with setting the input field width in my navbar to adjust to the remaining width of the navbar and be responsive to different device sizes? I need the input field to dynamically change its width based on the screen size. Any help would ...

What specific CSS property creates the distinction in text vertical alignment between a <button> and a <div>?

I recently discovered that the text inside a <button> is automatically vertically centered, whereas the text inside a <div> is aligned to the top. Despite my best efforts, I couldn't determine which CSS rule was responsible for this disti ...

Exploring R ggplot2 aesthetics: Utilizing color, point shape, and point fill status according to three distinct factors

I'm in the process of creating a graph that illustrates the change in tree seed width over the past few decades using data collected from 6 distinct sites. The goal is to incorporate various elements into the graph based on different conditions within ...

What steps should I take to adjust the CSS code to properly integrate with my HTML file?

There seems to be an issue with the code provided. The CSS for the menu bar is not functioning properly with the HTML. <!DOCTYPE html> <html> <body class="news"> <head> <style type="text/css">body { margin: 0; paddi ...

Jumping CSS dropdown menu glitch

I'm facing an issue with a dropdown menu. The main problem is that the "parent" link is moving when hovered over. HTML: <ul id="nav"> <li><span>Page 1</span> <ul> <li><a>Extralong Pag ...

Div behaving as a radio input

I have customized radio buttons to look like buttons using JavaScript. However, on page load, both buttons automatically receive the 'active' class instead of only when they are selected. Additionally, the fadeToggle function in the if-statement ...

What is the method in R for sorting a column consisting of strings containing the character "_"?

I came across this dataset, df <- data.frame(V1= c("SF", "SF", "NYC"), V1_1 = c(1990, 2000, 1990), V1_10 = 1:3, V1_2 = 1:3, V2 = 1:3) My goal is to arrange it in the following order: V1, V1_1, V1_2, V1_10, V2 I attempted various methods but ...

Unlocking the Possibilities of scroll-snap-type

My goal is to effectively utilize the scroll-snap-type property I'm struggling to understand why this code functions as intended html, body { scroll-snap-type: y mandatory; } .child { scroll-snap-align: start none; border: 3px dashed blac ...

How to toggle a div's visibility in AngularJS with a click事件

I am developing a filter that requires me to create 3 popups using the ng-repeat directive in Angular. Each popup should have the same class name but a different id. When a button is clicked, I want one popup to show while the rest hide. I currently have ...

The tab content refuses to show up in its designated fixed location

I've been working on creating a responsive tab system that functions as an accordion on smaller mobile devices, inspired by this example. I've made great progress and I'm almost where I want to be, but for some reason, the active tab content ...

Utilizing C# dynamic variable within inline CSS for an element (MVC)

I'm currently struggling with trying to generate a random color for an element in a cshtml view. No matter what I attempt, it just won't cooperate. When I include the @ symbol, the color property doesn't highlight as expected, leading me to ...

How to place a div within another div without using absolute positioning

I've been searching for a solution to this issue, but nothing seems to be working. I want the inner div to be positioned at the bottom of the outer div, with a 5px margin separating them. However, using absolute positioning disrupts the width and cent ...

When working on a small screen, Bootstrap generates divs one by one

Hey there! I've been working with bootstrap CSS library version 5.1.3 and I'm trying to create a simple card layout where there's an image on the left side and text on the right side. Everything looks great on a large screen, but I'm st ...

Attempting to implement image switching with hover effects and clickable regions

Hey there, I'm currently working on a fun little project and could use some guidance on how to achieve a specific effect. The website in question is [redacted], and you can view the code I've used so far at [redacted]. You'll find a code blo ...

What is the function in R that is similar to progn in Lisp?

When working with Lisp, there is syntax available to run multiple expressions in sequence within function arguments. Because R has Lisp as its origin language, I am curious if there is a similar feature in R. For example, could the following code be writ ...