Exhausting all available resources and solutions, I have finally decided to seek help for my issue with web scraping using R and rvest. Despite presenting a detailed description of the problem to prevent confusion, I am encountering difficulties.
The Challenge: My goal is to extract author names from a conference webpage where authors are categorized alphabetically by their last name. To achieve this, I need to employ a for loop to execute follow_link() 25 times in order to navigate to each page and retrieve the relevant author data.
The conference website:
I have made two attempts in R utilizing rvest, both met with obstacles.
Solution 1 (Alphabetical Page Navigation)
lttrs <- LETTERS[seq( from = 1, to = 26 )] # create character vector
website <- html_session(https://gsa.confex.com/gsa/2016AM/webprogram/authora.html)
tempList <- list() #create list to store each page's author information
for(i in 1:length(lttrs)){
tempList[[i]] <- website %>%;
follow_link(lttrs[i])%>% #use capital letters to call links to author pages
html_nodes(xpath ='//*[@class = "author"]') %>%
html_text()
}
This code exhibits functionality, reaching a certain point before displaying an error. It successfully navigates through lettered pages until it reaches specific transitions where it fetches the incorrect content.
Navigating to authora.html
Navigating to authorb.html
Navigating to authorc.html
Navigating to authord.html
Navigating to authore.html
Navigating to authorf.html
Navigating to authorg.html
Navigating to authorh.html
Navigating to authora.html
Navigating to authorj.html
Navigating to authork.html
Navigating to authorl.html
Navigating to http://community.geosociety.org/gsa2016/home
Solution 2 (CSS Selector Page Identification) By applying a CSS selector on the webpage, individual lettered pages are recognized as "a:nth-child(1-26)". Thus, I revamped my loop employing this CSS identifier.
tempList <- list()
for(i in 2:length(lttrs)){
tempList[[i]] <- website %>%
follow_link(css = paste('a:nth-child(',i,')',sep = '')) %>%
html_nodes(xpath ='//*[@class = "author"]') %>%;
html_text()
}
This method works partially. Yet again, encountering issues during specific transitions as outlined below.
Navigating to authora.html
Navigating to uploadlistall.html
Navigating to http://community.geosociety.org/gsa2016/home
Navigating to authore.html
Navigating to authorf.html
Navigating to authorg.html
Navigating to authorh.html
Navigating to authori.html
Navigating to authorj.html
Navigating to authork.html
Navigating to authorl.html
Navigating to authorm.html
Navigating to authorn.html
Navigating to authoro.html
Navigating to authorp.html
Navigating to authorq.html
Navigating to authorr.html
Navigating to authors.html
Navigating to authort.html
Navigating to authoru.html
Navigating to authorv.html
Navigating to authorw.html
Navigating to authorx.html
Navigating to authory.html
Navigating to authorz.html
Notably, this approach fails to capture B, C, and D, leading to incorrect page redirections at this juncture. I would greatly appreciate any suggestions or guidance on how to adjust my existing code to effectively cycle through all 26 alphabetical pages.
Thank you for your assistance!