I need assistance with extracting data from a specific URL
(https://forums.vwvortex.com/showthread.php?8829402-Atlas-V6-Oil-Change-Routine)
. The posts on this page contain replies that start with "Originally Posted by ...". I want to scrape all the content within the posts while excluding the initial "Originally posted by" text. Here is an example:
User df_text
A Hi, how are you ?
B This is beautiful!
C Heuwi
D Originally posted by C Heuwi
Hellou
E Hello guys
F Originally posted by A Hi, how are you ?
I am doing good
G Whats going on ?
The required information for post by user D is "Hellou", which is found under div.quote_container class (child class), and "I am doing good" is under blockquote.postcontent.restore, its parent class.
Expected output:
User df_text
A Hi, how are you ?
B This is beautiful!
C Heuwi
D Hellou
E Hello guys
F I am doing good
G Whats going on ?
I have tried several methods as shown below:
url<-"https://forums.vwvortex.com/showthread.php?8829402-Atlas-V6-Oil-Change-Routine"
review <- read_html(url)
threads<- cbind(review %>% html_nodes("blockquote.postcontent.restore:not(.quote_container)") %>% html_text())
Additionally, I experimented with these alternatives:
threads <- cbind(review %>% html_nodes(xpath = '//div[@class="blockquote.postcontent.restore"]/node()[not(self::div)]') %>% html_text())
or
threads <- review %>% html_nodes(".content")
close_nodes <- threads %>% html_nodes(".quote_container")
chk <- xml_remove(close_nodes)
Unfortunately, none of these approaches yielded the desired outcome. Any suggestions on how to successfully extract the post data while excluding the child class would be greatly appreciated. Thank you in advance!