My objective is to collect data from this html page in order to build a database: https://drive.google.com/folderview?id=0B0aGd85uKFDyOS1XTTc2QnNjRmc&usp=sharing
One of the variables I am interested in is the price of the apartments. I have noticed that some apartments have the code div class="row_price"
which includes the price (example A), while others do not have this code and therefore do not display a price (example B). I would like to extract this information so that observations without a price are marked as NA
, thus preventing mixing up the database.
Example A
<div class="listing_column listing_row_price">
<div class="row_price">
$ 14,800
</div>
<div class="row_info">Ayer 19:53</div>
Example B
<div class="listing_column listing_row_price">
<div class="row_info">Ayer 19:50</div>
I believe that by extracting the text from "listing_row_price" to the beginning of "row_info" in a character vector, I will be able to achieve my desired output, which is:
...
10 4000
11 14800
12 NA
13 14000
14 8000
...
However, the results I have obtained so far include some entries full of NA
.
...
10 4000
11 14800
12 14000
13 8000
14 8500
...
I have tried using various commands, but have not yet obtained the desired outcome:
html1<-read_html("file.html")
title<-html_nodes(html1,"div")
html1<-toString(title)
pattern1<-'div class="row_price">([^<]*)<'
title3<-unlist(str_extract_all(title,pattern1))
title3<-title3[c(1:35)]
pattern2<-'>\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t([^<*]*)'
title3<-unlist(str_extract(title3,pattern2))
title3<-gsub(">\n\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t\n\t\t\t\t\t $ ","",title3,fixed=TRUE)
title3<-as.data.frame(as.numeric(gsub(",","", title3,fixed=TRUE)))
I have also attempted to use the following pattern:
pattern1<-'listing_row_price">([<div class="row_price">]?)([^<]*)<
, which I believe instructs to extract the "listing_row_price" part, then if available, extract the "row_price" part, next capture the digits, and finally extract the <
that follows.