What is the best way to extract text from multiple "div class" (html) elements in R?

My objective is to collect data from this html page in order to build a database: https://drive.google.com/folderview?id=0B0aGd85uKFDyOS1XTTc2QnNjRmc&usp=sharing

One of the variables I am interested in is the price of the apartments. I have noticed that some apartments have the code div class="row_price" which includes the price (example A), while others do not have this code and therefore do not display a price (example B). I would like to extract this information so that observations without a price are marked as NA, thus preventing mixing up the database.

Example A

<div class="listing_column listing_row_price">
    <div class="row_price">
      $ 14,800
    </div>
<div class="row_info">Ayer&nbsp;19:53</div>

Example B

<div class="listing_column listing_row_price">

<div class="row_info">Ayer&nbsp;19:50</div>

I believe that by extracting the text from "listing_row_price" to the beginning of "row_info" in a character vector, I will be able to achieve my desired output, which is:

...
10 4000
11 14800
12 NA
13 14000
14 8000
...

However, the results I have obtained so far include some entries full of NA.

...
10 4000
11 14800
12 14000
13 8000
14 8500
...

I have tried using various commands, but have not yet obtained the desired outcome:

    html1<-read_html("file.html")
    title<-html_nodes(html1,"div")
    html1<-toString(title)
    pattern1<-'div class="row_price">([^<]*)<'
    title3<-unlist(str_extract_all(title,pattern1))
    title3<-title3[c(1:35)]
    pattern2<-'>\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t([^<*]*)'
    title3<-unlist(str_extract(title3,pattern2))
    title3<-gsub(">\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t $ ","",title3,fixed=TRUE)
    title3<-as.data.frame(as.numeric(gsub(",","", title3,fixed=TRUE)))

I have also attempted to use the following pattern:

pattern1<-'listing_row_price">([<div class="row_price">]?)([^<]*)<
, which I believe instructs to extract the "listing_row_price" part, then if available, extract the "row_price" part, next capture the digits, and finally extract the < that follows.

Answer №1

There are numerous approaches to tackle this issue, and the most suitable method may vary depending on the consistency of the HTML structure. One relatively straightforward technique that proves effective in this scenario is as follows:

library(rvest)

page <- read_html('page.html')

# select all elements with a class of "listing_row_price"
listings <- html_nodes(page, css = '.listing_row_price')

# for each listing, extract the text of the first child element if it has two children; otherwise, return NA
prices <- sapply(listings, function(x){ifelse(length(html_children(x)) == 2, 
                                              html_text(html_children(x)[1]), 
                                              NA)})
# remove all non-numeric characters and convert it to an integer
prices <- as.integer(gsub('[^0-9]', '', prices))

Similar questions

If you have not found the answer to your question or you are interested in this topic, then look at other similar questions below or use the search

Fixing content at the bottom inside a container with Bootstrap 4

My app is contained within a <div class="container">. Inside this container, I have an editor and some buttons that I always want to be displayed at the bottom of the screen. Here's how it looks: <div class="container> // content here.. ...

Having trouble with displaying the results of JQuery Ajax in Chrome?

I am facing a compatibility issue between IE 8 and Chrome 16.0.912.77 with the following code. Any suggestions? <html> <head> <script src="http://code.jquery.com/jquery-latest.js"></script> <script> $(document).ready(fu ...

Transforming a radio button into a checkbox while successfully saving data to a database (toggling between checked and unchecked)

I have limited experience in web development, but I recently created a webpage that allows users to input data into an SQL database. While my code is functional, I believe there's room for optimization. I pieced it together from various online resourc ...

The issue of the Dropdown Menu Not Remaining Fixed While Scrolling

There is a challenge with a Datetime Picker Dropdown Menu not staying in place when the user scrolls. Previous attempts to use .daterangepicker {position: fixed !important;} have been unsuccessful, causing the Datetime Picker to not stay fixed in its posit ...

Having trouble with jquery's empty() function not functioning as expected

I'm brand new to using jquery so I apologize in advance for any beginner mistakes. Essentially, my issue is straightforward: Here is the HTML code I have: <div class="v" align="center" id="div4"> <div id="div5" class="h blurb"> <sp ...

What is the functionality of fun.min, fun.max, and fun in relation to stat_summary?

Apologies for the repetition, I am aware that a similar question has been asked before here, but I am still struggling to grasp the inner workings of these summary functions (fun.min, fun.max, fun). In the examples provided in the manual, these functions a ...

What could be causing the slight pause in my CSS3 animation at each keyframe percentage range?

I'm currently working on an animation for a sailing ship, but I'm encountering some issues with its smoothness. Whenever I make changes in the @keyframes, the animation stops abruptly. The movement involves using transform:rotate(-5deg) and then ...

The function in PHP does not arbitrarily return a boolean value

After creating a template file for WordPress, I encountered an issue with a function that I wrote. Despite ensuring that the two variables are different (as I confirmed by printing them), the function consistently returns true. Even after testing and attem ...

What could be the reason behind the significant void in the grid I designed?

I'm currently learning how to utilize the grid container and have successfully created a grid with 2 images on top, 2 on the bottom, and one that stretches vertically on the left side. While I have arranged the grid as desired, I am facing an issue wi ...

Clustering in an environment requires merging and minifying granules

While Granule is effective for minifying and merging CSS/JS files on a local environment, it presents challenges in clustered environments. The issue arises when each node of the cluster computes its own file during runtime, causing discrepancies when a us ...

The presence of ng-show dynamically adjusts the minimum height of a div element

I am encountering an issue with a div that has the class of wrapper. Inside this div, there is a parent div with the class of content-wrapper. The wrapper div includes a conditional directive ng-show which toggles between displaying or hiding its content. ...

javascript/AngularJS - make elements gradually disappear

I need help implementing a fade effect for an icon in the middle of a picture that indicates scrollability to the user. Currently, when the user scrolls, I can hide the icon but would like to add a nice smooth fade-out effect. Here is the HTML code snippe ...

Enable wp_star_rating for display on the front end

My goal is to incorporate the wp_star_rating function into a shortcode for use on the frontend of my WordPress website. To achieve this, I have copied the code from wp-admin/includes/template.php to wp-content/themes/hemingway/functions.php. I have includ ...

Locate specific phrases within the text and conceal the corresponding lines

I have a JavaScript function that loops through each line. I want it to search for specific text on each line and hide the entire line if it contains that text. For example: <input id="search" type="button" value="Run" /> <textarea id ...

Should I open the Intel XDK hyperlinks in the default web browser instead of

Lately, I've been developing an HTML5 app on the Intel XDK that includes buttons linking to external websites. However, I'm encountering an issue where the apps open in the same window, leading users to get stuck in the browser. Currently, I am u ...

positioning of multiple buttons in a customized header for mui-datatables

I'm currently using mui-datatables in my app and have customized the table toolbar to include additional buttons. However, I've encountered an issue where adding a second button causes it to be displayed below the first one, despite there being e ...

Mastering the art of utilizing Angular Routing alongside CSS grid for seamless website navigation

My <app-root> layout is pretty straightforward: app.component.ts import { Component } from '@angular/core'; @Component({ selector: 'app-root', templateUrl: './app.component.html', styleUrls: ['./app.compone ...

Symfony 2 lacks the ability to automatically create the web/bundle/framework structure

I encountered a major issue with Symfony. After installing Symfony 2.7.5 via the command line: $ symfony new my_project The problem arose in the generated project directory: /web/bundle In this folder, I found two empty files (not directories!) named fr ...

Fetching User Details Including Cart Content Upon User Login

After successfully creating my e-commerce application, I have managed to implement API registration and login functionalities which are working perfectly in terms of requesting and receiving responses. Additionally, I have integrated APIs for various produ ...

Steps to conceal a <select> element and reveal it by clicking a button

I'm currently working with Bootstrap 3 and I am attempting to replicate jQueryMobile's select option feature. This would involve replacing the standard select textbar and arrows with a single button that can toggle the opening of the select. Ins ...