What is the best way to extract text from multiple "div class" (html) elements in R?

Question

What is the best way to extract text from multiple "div class" (html) elements in R?

My objective is to collect data from this html page in order to build a database: https://drive.google.com/folderview?id=0B0aGd85uKFDyOS1XTTc2QnNjRmc&usp=sharing

One of the variables I am interested in is the price of the apartments. I have noticed that some apartments have the code div class="row_price" which includes the price (example A), while others do not have this code and therefore do not display a price (example B). I would like to extract this information so that observations without a price are marked as NA, thus preventing mixing up the database.

Example A

<div class="listing_column listing_row_price">
    <div class="row_price">
      $ 14,800
    </div>
<div class="row_info">Ayer&nbsp;19:53</div>

Example B

<div class="listing_column listing_row_price">

<div class="row_info">Ayer&nbsp;19:50</div>

I believe that by extracting the text from "listing_row_price" to the beginning of "row_info" in a character vector, I will be able to achieve my desired output, which is:

However, the results I have obtained so far include some entries full of NA.

I have tried using various commands, but have not yet obtained the desired outcome:

    html1<-read_html("file.html")
    title<-html_nodes(html1,"div")
    html1<-toString(title)
    pattern1<-'div class="row_price">([^<]*)<'
    title3<-unlist(str_extract_all(title,pattern1))
    title3<-title3[c(1:35)]
    pattern2<-'>\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t([^<*]*)'
    title3<-unlist(str_extract(title3,pattern2))
    title3<-gsub(">\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t $ ","",title3,fixed=TRUE)
    title3<-as.data.frame(as.numeric(gsub(",","", title3,fixed=TRUE)))

I have also attempted to use the following pattern:

pattern1<-'listing_row_price">([<div class="row_price">]?)([^<]*)<

, which I believe instructs to extract the "listing_row_price" part, then if available, extract the "row_price" part, next capture the digits, and finally extract the < that follows.

html css r regex rvest

Answer 1

Answer №1

There are numerous approaches to tackle this issue, and the most suitable method may vary depending on the consistency of the HTML structure. One relatively straightforward technique that proves effective in this scenario is as follows:

library(rvest)

page <- read_html('page.html')

# select all elements with a class of "listing_row_price"
listings <- html_nodes(page, css = '.listing_row_price')

# for each listing, extract the text of the first child element if it has two children; otherwise, return NA
prices <- sapply(listings, function(x){ifelse(length(html_children(x)) == 2, 
                                              html_text(html_children(x)[1]), 
                                              NA)})
# remove all non-numeric characters and convert it to an integer
prices <- as.integer(gsub('[^0-9]', '', prices))

Answer 2

There are numerous approaches to tackle this issue, and the most suitable method may vary depending on the consistency of the HTML structure. One relatively straightforward technique that proves effective in this scenario is as follows:

library(rvest)

page <- read_html('page.html')

# select all elements with a class of "listing_row_price"
listings <- html_nodes(page, css = '.listing_row_price')

# for each listing, extract the text of the first child element if it has two children; otherwise, return NA
prices <- sapply(listings, function(x){ifelse(length(html_children(x)) == 2, 
                                              html_text(html_children(x)[1]), 
                                              NA)})
# remove all non-numeric characters and convert it to an integer
prices <- as.integer(gsub('[^0-9]', '', prices))

What is the best way to extract text from multiple "div class" (html) elements in R?

Example A

Example B

Answer №1

Similar questions

Fixing content at the bottom inside a container with Bootstrap 4

Having trouble with displaying the results of JQuery Ajax in Chrome?

Transforming a radio button into a checkbox while successfully saving data to a database (toggling between checked and unchecked)

The issue of the Dropdown Menu Not Remaining Fixed While Scrolling

Having trouble with jquery's empty() function not functioning as expected

What is the functionality of fun.min, fun.max, and fun in relation to stat_summary?

What could be causing the slight pause in my CSS3 animation at each keyframe percentage range?

The function in PHP does not arbitrarily return a boolean value

What could be the reason behind the significant void in the grid I designed?

Clustering in an environment requires merging and minifying granules

The presence of ng-show dynamically adjusts the minimum height of a div element

javascript/AngularJS - make elements gradually disappear

Enable wp_star_rating for display on the front end

Locate specific phrases within the text and conceal the corresponding lines

Should I open the Intel XDK hyperlinks in the default web browser instead of

positioning of multiple buttons in a customized header for mui-datatables

Mastering the art of utilizing Angular Routing alongside CSS grid for seamless website navigation

Symfony 2 lacks the ability to automatically create the web/bundle/framework structure

Fetching User Details Including Cart Content Upon User Login

Steps to conceal a <select> element and reveal it by clicking a button