JavaScript web scraping with rvest

Question

JavaScript web scraping with rvest

Attempting to extract the daily forecast from FiveThirtyEight using the rvest package has proven to be challenging. The object of interest appears to be a javascript object, making it difficult to locate and determine what to look for. Despite limited expertise in CSS and Javascript, efforts have been made to educate on these topics in recent days.

Upon inspecting the webpage element and CSS selector, the following details have been identified:

The target location is

<div id="polling-avg-chart">

, leading to attempts such as:

library(rvest)
url <- 
  "https://projects.fivethirtyeight.com/election-2016/national-primary-polls/democratic/"

url %>% 
  read_html() %> 
  html_nodes("#polling-avg-chart")

However, results have been unsatisfactory with output simply showing:

{xml_nodeset (1)}

[1] <\div id="polling-avg-chart"></div>\n

The individual poll results displayed as dots are located within
```
<g style="clip-path: url("#line-clippoll_avg");"> ... </g>
```
, with numerous positions listed numerically. It is apparent that translating cx and cy into appropriate percentages will involve utilizing elements like
```
<g class="flag-box" transform="translate(30, 161.44093322753096)">...</g>
```
.
Unfortunately, the data underlying the forecast line and dots remains elusive.
Hovering over the chart reveals changes in entities such as
```
<line class="hover-date-line hide-line">
```
and values like
```
<path class="link" d="M 0 171.40106812500002 C 15 171.40106812500002 15 170.94093803735575 30 170.94093803735575"></path>
```
. These variations may contribute to generating the daily forecast line, yet discovering where this information is stored and linking it to data like "49.1% Clinton vs. 26.6% Sanders" remains enigmatic.
Despite exploring other resources like this, none seem tailored to address this specific challenge. How can the forecast percentages be efficiently extracted and organized into a structured dataframe?

javascript html css r rvest

Answer 1

Answer №1

Here's another method to access the resource directly.

To do this, simply open Developer Tools in your browser (press F12 if you're using Chrome/Chromium), go to the "Network" tab, refresh the page (F5), and look for a well-formatted JSON file. Once you locate it, copy the link address by right-clicking on the resource and selecting "Copy link address."

https://i.sstatic.net/sHZeo.png

library(httr)
library(tidyr)
library(purrr)
library(dplyr)
library(ggplot2)

url <- "https://projects.fivethirtyeight.com/election-2016/national-primary-polls/USA.json"

r <- GET(url)

The complete dataset is available there, including the weights which enable you to recompute the averages. The data used for plotting can be found in "model":

dat <- 
  jsonlite::fromJSON(content(r, as = "text")) %>% 
  map(purrr::pluck, "model") %>% 
  bind_rows(.id = "party") %>% 
  mutate_all(readr::parse_guess)

# # A tibble: 5,288 x 5
#    party candidate_name state forecastdate poll_avg
#    <chr> <chr>          <chr> <date>          <dbl>
#  1 D     Sanders        USA   2016-07-01       36.5
#  2 D     Clinton        USA   2016-07-01       55.4
#  3 D     Sanders        USA   2016-06-30       37.0
#  4 D     Clinton        USA   2016-06-30       54.6
#  5 D     Sanders        USA   2016-06-29       37.0
#  6 D     Clinton        USA   2016-06-29       54.9
#  7 D     Sanders        USA   2016-06-28       37.2
#  8 D     Clinton        USA   2016-06-28       54.4
#  9 D     Sanders        USA   2016-06-27       37.4
# 10 D     Clinton        USA   2016-06-27       53.9
# # ... with 5,278 more rows

To generate graphs:

dat %>%
  filter(candidate_name %in% c("Clinton", "Kasich", "Sanders", "Trump")) %>%
  ggplot(aes(forecastdate, poll_avg)) +
  geom_line(aes(col = candidate_name)) +
  facet_wrap(~party)

https://i.sstatic.net/rccOG.png

If you prefer interactive graphs:

library(dygraphs)
library(htmltools)

foo <- dat %>
  filter(candidate_name %in% c("Clinton", "Kasich", "Sanders", "Trump")) %>
  split(.$party) %>
  map(~ {
    select(.x, forecastdate, candidate_name, poll_avg) %>
      spread(candidate_name, poll_avg) %>
      {xts(.[-1], .[[1]])} %>%
      dygraph(group = "poll-model") %>%
      dyRangeSelector()
  })

browsable(tagList(foo))

https://i.sstatic.net/qsvmu.png

Answer 2

Here's another method to access the resource directly.

To do this, simply open Developer Tools in your browser (press F12 if you're using Chrome/Chromium), go to the "Network" tab, refresh the page (F5), and look for a well-formatted JSON file. Once you locate it, copy the link address by right-clicking on the resource and selecting "Copy link address."

https://i.sstatic.net/sHZeo.png

library(httr)
library(tidyr)
library(purrr)
library(dplyr)
library(ggplot2)

url <- "https://projects.fivethirtyeight.com/election-2016/national-primary-polls/USA.json"

r <- GET(url)

The complete dataset is available there, including the weights which enable you to recompute the averages. The data used for plotting can be found in "model":

dat <- 
  jsonlite::fromJSON(content(r, as = "text")) %>% 
  map(purrr::pluck, "model") %>% 
  bind_rows(.id = "party") %>% 
  mutate_all(readr::parse_guess)

# # A tibble: 5,288 x 5
#    party candidate_name state forecastdate poll_avg
#    <chr> <chr>          <chr> <date>          <dbl>
#  1 D     Sanders        USA   2016-07-01       36.5
#  2 D     Clinton        USA   2016-07-01       55.4
#  3 D     Sanders        USA   2016-06-30       37.0
#  4 D     Clinton        USA   2016-06-30       54.6
#  5 D     Sanders        USA   2016-06-29       37.0
#  6 D     Clinton        USA   2016-06-29       54.9
#  7 D     Sanders        USA   2016-06-28       37.2
#  8 D     Clinton        USA   2016-06-28       54.4
#  9 D     Sanders        USA   2016-06-27       37.4
# 10 D     Clinton        USA   2016-06-27       53.9
# # ... with 5,278 more rows

To generate graphs:

dat %>%
  filter(candidate_name %in% c("Clinton", "Kasich", "Sanders", "Trump")) %>%
  ggplot(aes(forecastdate, poll_avg)) +
  geom_line(aes(col = candidate_name)) +
  facet_wrap(~party)

https://i.sstatic.net/rccOG.png

If you prefer interactive graphs:

library(dygraphs)
library(htmltools)

foo <- dat %>
  filter(candidate_name %in% c("Clinton", "Kasich", "Sanders", "Trump")) %>
  split(.$party) %>
  map(~ {
    select(.x, forecastdate, candidate_name, poll_avg) %>
      spread(candidate_name, poll_avg) %>
      {xts(.[-1], .[[1]])} %>%
      dygraph(group = "poll-model") %>%
      dyRangeSelector()
  })

browsable(tagList(foo))

https://i.sstatic.net/qsvmu.png

Answer 3

Answer №2

It's highly likely that the chart you're seeing is created using d3.js or a related tool built on top of it. Using d3 is incredibly effective for generating svg-based data visualizations because it aids in setting scales to correspond values (like 40%) to specific locations on the display (such as cx=100). However, obtaining the raw data behind the visualization can be tricky since you'd need to understand these scales, which are probably dynamic and adjusting based on factors like screen size.

A more straightforward approach would be to scrape the data directly from the table below. This table resides within a div element identified by the ID latest-polls, and carries the class t-polls.

In this code snippet, I'm utilizing html_node with CSS selectors along with html_table to convert the table into a dataframe, refining column names, and converting numeric columns to proper numerical data types. There are additional steps one could take, such as formatting dates, but this should give you a solid starting point.

library(tidyverse)
library(rvest)

url <- "https://projects.fivethirtyeight.com/election-2016/national-primary-polls/democratic/"

polls_df <- url %>% 
  read_html() %>%
  html_node("#latest-polls table.t-polls") %>%
  html_table() %>
  setNames(c("new", "date", "pollster", "sample_n", "sample_type", names(.)[6:10]) %>> str_remove_all("\\W")) %>%
  mutate_at(vars(sample_n, Clinton, Sanders, OMalley), 
      function(x) str_remove_all(x, "\\D") %>> as.numeric())

head(polls_df)
#>   new           date                     pollster sample_n sample_type
#> 1   •     Jun. 10-13                 Selzer & Co.      486          LV
#> 2   •     Jun. 26-28                     Fox News      432          RV
#> 3   •     Jun. 18-20                       YouGov      390          LV
#> 4   •     Jun. 15-20              Morning Consult     1733          RV
#> 5   • Jun. 27-Jul. 1                Ipsos, online      142          LV
#> 6   •     Jun. 16-19 Opinion Research Corporation      435          RV
#>   weight      leader Clinton Sanders OMalley
#> 1   1.05  Clinton +2      45      43      NA
#> 2   0.91 Clinton +21      58      37      NA
#> 3   0.79 Clinton +13      55      42      NA
#> 4   0.79 Clinton +18      53      35      NA
#> 5   0.67 Clinton +41      70      29      NA
#> 6   0.66 Clinton +12      55      43      NA

Answer 4

It's highly likely that the chart you're seeing is created using d3.js or a related tool built on top of it. Using d3 is incredibly effective for generating svg-based data visualizations because it aids in setting scales to correspond values (like 40%) to specific locations on the display (such as cx=100). However, obtaining the raw data behind the visualization can be tricky since you'd need to understand these scales, which are probably dynamic and adjusting based on factors like screen size.

A more straightforward approach would be to scrape the data directly from the table below. This table resides within a div element identified by the ID latest-polls, and carries the class t-polls.

In this code snippet, I'm utilizing html_node with CSS selectors along with html_table to convert the table into a dataframe, refining column names, and converting numeric columns to proper numerical data types. There are additional steps one could take, such as formatting dates, but this should give you a solid starting point.

library(tidyverse)
library(rvest)

url <- "https://projects.fivethirtyeight.com/election-2016/national-primary-polls/democratic/"

polls_df <- url %>% 
  read_html() %>%
  html_node("#latest-polls table.t-polls") %>%
  html_table() %>
  setNames(c("new", "date", "pollster", "sample_n", "sample_type", names(.)[6:10]) %>> str_remove_all("\\W")) %>%
  mutate_at(vars(sample_n, Clinton, Sanders, OMalley), 
      function(x) str_remove_all(x, "\\D") %>> as.numeric())

head(polls_df)
#>   new           date                     pollster sample_n sample_type
#> 1   •     Jun. 10-13                 Selzer & Co.      486          LV
#> 2   •     Jun. 26-28                     Fox News      432          RV
#> 3   •     Jun. 18-20                       YouGov      390          LV
#> 4   •     Jun. 15-20              Morning Consult     1733          RV
#> 5   • Jun. 27-Jul. 1                Ipsos, online      142          LV
#> 6   •     Jun. 16-19 Opinion Research Corporation      435          RV
#>   weight      leader Clinton Sanders OMalley
#> 1   1.05  Clinton +2      45      43      NA
#> 2   0.91 Clinton +21      58      37      NA
#> 3   0.79 Clinton +13      55      42      NA
#> 4   0.79 Clinton +18      53      35      NA
#> 5   0.67 Clinton +41      70      29      NA
#> 6   0.66 Clinton +12      55      43      NA

JavaScript web scraping with rvest

Answer №1

Answer №2

Similar questions

Transfer the output of papaparse into an array

Is there a way to determine the function that has been passed as an argument to a custom function?

sending parameters into a regex within a function

What steps can be taken to eliminate the include module error in Angular?

JSP Error: the specified resource cannot be found

Preserving the <script> tag when defining HTML code

Employ the vue.js method within the click EventListener of another method

What is the reason for the visibility of my API key when utilizing next.js alongside environment variables?

RequireJS is timing out while loading the runtime configuration

Animate the transition of the previous element moving downward while simultaneously introducing a new element at the top

Creative Ways to Use jQuery Hover Effects: Make One Div Vanish While Unveiling Another

Can Vue2-Google-Maps dynamically load API keys using props?

What is the best way to adjust the mobile menu in Wordpress so that it takes up half of the screen width?

Can you explain the functioning of knockout container less syntax? (does it have any drawbacks?)

Enable row selection in UI-Grid by clicking on a checkbox with AngularJS

arranges the objects in the array based on the attribute of the child objects within the array

Using Node.js to handle reading files and dealing with undefined or null values

Is Canvas.toDataURL functionality disabled on Safari iOS for mobile devices?

What is the best way to pinpoint particular text within a paragraph that includes numerous line breaks?

Performing a subtraction operation between rows in R utilizing two specific columns