Can Rvest be used to extract data from an HTML node while excluding child classes?

Question

Can Rvest be used to extract data from an HTML node while excluding child classes?

I need assistance with extracting data from a specific URL

(https://forums.vwvortex.com/showthread.php?8829402-Atlas-V6-Oil-Change-Routine)

. The posts on this page contain replies that start with "Originally Posted by ...". I want to scrape all the content within the posts while excluding the initial "Originally posted by" text. Here is an example:

User  df_text
 A    Hi, how are you ?
 B    This is beautiful!
 C    Heuwi
 D    Originally posted by C Heuwi 
      Hellou
 E    Hello guys
 F    Originally posted by A Hi, how are you ?
      I am doing good
 G    Whats going on ?

The required information for post by user D is "Hellou", which is found under div.quote_container class (child class), and "I am doing good" is under blockquote.postcontent.restore, its parent class.

Expected output:

User  df_text
 A    Hi, how are you ?
 B    This is beautiful!
 C    Heuwi
 D    Hellou
 E    Hello guys
 F    I am doing good
 G    Whats going on ?

I have tried several methods as shown below:

url<-"https://forums.vwvortex.com/showthread.php?8829402-Atlas-V6-Oil-Change-Routine"
review <- read_html(url)
threads<- cbind(review %>% html_nodes("blockquote.postcontent.restore:not(.quote_container)") %>% html_text())

Additionally, I experimented with these alternatives:

threads <- cbind(review %>% html_nodes(xpath = '//div[@class="blockquote.postcontent.restore"]/node()[not(self::div)]') %>% html_text())

or

threads <- review %>% html_nodes(".content")
close_nodes <- threads %>% html_nodes(".quote_container")
chk <- xml_remove(close_nodes)

Unfortunately, none of these approaches yielded the desired outcome. Any suggestions on how to successfully extract the post data while excluding the child class would be greatly appreciated. Thank you in advance!

html css r web-scraping rvest

Answer 1

Answer №1

One way to tackle this issue is by leveraging the xml_remove function provided in the xml2 library, which gets automatically loaded with rvest.

library(rvest)
#fetch page data
url<-"https://forums.vwvortex.com/showthread.php?8829402-Atlas-V6-Oil-Change-Routine"
review <- read_html(url)

#locate parent nodes
threads<- review %>% html_nodes("blockquote.postcontent.restore:not(.quote_container)")
#identify child nodes for exclusion
toremove<-threads %>% html_node("div.bbcode_container")
#eliminate specified nodes
xml_remove(toremove)

#convert parent nodes to text format
threads %>% html_text(trim=TRUE)

The documentation for xml_remove warns "Care needs to be taken when using xml_remove()". Always proceed with caution and remember to save your work regularly.

Answer 2

One way to tackle this issue is by leveraging the xml_remove function provided in the xml2 library, which gets automatically loaded with rvest.

library(rvest)
#fetch page data
url<-"https://forums.vwvortex.com/showthread.php?8829402-Atlas-V6-Oil-Change-Routine"
review <- read_html(url)

#locate parent nodes
threads<- review %>% html_nodes("blockquote.postcontent.restore:not(.quote_container)")
#identify child nodes for exclusion
toremove<-threads %>% html_node("div.bbcode_container")
#eliminate specified nodes
xml_remove(toremove)

#convert parent nodes to text format
threads %>% html_text(trim=TRUE)

The documentation for xml_remove warns "Care needs to be taken when using xml_remove()". Always proceed with caution and remember to save your work regularly.

Can Rvest be used to extract data from an HTML node while excluding child classes?

Answer №1

Similar questions

The challenge of using CSS Flexbox to position the logo with a margin-top

What causes a custom element to suddenly crumble?

Unable to establish a connection with localhost on port 3918 due to a connection refusal error in R Rselenium

What is the method for calculating the effect size [90%CI] and including it in the summary table using the R package "gtsummary

How can elements be styled depending on the flex-wrap state?

What is the best way to sort rows based on the larger value in a separate column?

Position three divs in the center and at the top of my master page

Executing a webservice method in an html page using javascript without the need to refresh the page

What are some ways to create a table that can be easily filled in?

The website functions perfectly on a local server, but once it's uploaded, the images fail

Default Documents on Login.aspx are being displayed even though it hasn't been specified in the web.config file

Navigating the FormSpree redirect: Tips and tricks

What is the best way to address the problem of quotes while utilizing Freemarker in conjunction with ckEditor (or any other html editor

function for clicking on mobile navigation bars

Why does the selected attribute not function properly in <select><option> when used with ngModel?

Automated Web Scraping with Selenium: Utilizing the Next Button

Displaying an array value instead of a new value list in a React component

Substitute common words in a list-based column

JavaScript tabs function malfunctioning upon page load

Finding the R Data Frame Value using Search