JavaScript web scraping with rvest

Attempting to extract the daily forecast from FiveThirtyEight using the rvest package has proven to be challenging. The object of interest appears to be a javascript object, making it difficult to locate and determine what to look for. Despite limited expertise in CSS and Javascript, efforts have been made to educate on these topics in recent days.

Upon inspecting the webpage element and CSS selector, the following details have been identified:

  • The target location is

    <div id="polling-avg-chart">
    , leading to attempts such as:

    library(rvest)
    url <- 
      "https://projects.fivethirtyeight.com/election-2016/national-primary-polls/democratic/"
    
    url %>% 
      read_html() %> 
      html_nodes("#polling-avg-chart")
    

    However, results have been unsatisfactory with output simply showing:

    {xml_nodeset (1)}

    [1] <\div id="polling-avg-chart"></div>\n

  • The individual poll results displayed as dots are located within

    <g style="clip-path: url("#line-clippoll_avg");"> ... </g>
    , with numerous positions listed numerically. It is apparent that translating cx and cy into appropriate percentages will involve utilizing elements like
    <g class="flag-box" transform="translate(30, 161.44093322753096)">...</g>
    .

  • Unfortunately, the data underlying the forecast line and dots remains elusive.

  • Hovering over the chart reveals changes in entities such as
    <line class="hover-date-line hide-line">
    and values like
    <path class="link" d="M 0 171.40106812500002 C 15 171.40106812500002 15 170.94093803735575 30 170.94093803735575"></path>
    . These variations may contribute to generating the daily forecast line, yet discovering where this information is stored and linking it to data like "49.1% Clinton vs. 26.6% Sanders" remains enigmatic.
  • Despite exploring other resources like this, none seem tailored to address this specific challenge. How can the forecast percentages be efficiently extracted and organized into a structured dataframe?

Answer №1

Here's another method to access the resource directly.

To do this, simply open Developer Tools in your browser (press F12 if you're using Chrome/Chromium), go to the "Network" tab, refresh the page (F5), and look for a well-formatted JSON file. Once you locate it, copy the link address by right-clicking on the resource and selecting "Copy link address."

https://i.sstatic.net/sHZeo.png

library(httr)
library(tidyr)
library(purrr)
library(dplyr)
library(ggplot2)

url <- "https://projects.fivethirtyeight.com/election-2016/national-primary-polls/USA.json"

r <- GET(url)

The complete dataset is available there, including the weights which enable you to recompute the averages. The data used for plotting can be found in "model":

dat <- 
  jsonlite::fromJSON(content(r, as = "text")) %>% 
  map(purrr::pluck, "model") %>% 
  bind_rows(.id = "party") %>% 
  mutate_all(readr::parse_guess)

# # A tibble: 5,288 x 5
#    party candidate_name state forecastdate poll_avg
#    <chr> <chr>          <chr> <date>          <dbl>
#  1 D     Sanders        USA   2016-07-01       36.5
#  2 D     Clinton        USA   2016-07-01       55.4
#  3 D     Sanders        USA   2016-06-30       37.0
#  4 D     Clinton        USA   2016-06-30       54.6
#  5 D     Sanders        USA   2016-06-29       37.0
#  6 D     Clinton        USA   2016-06-29       54.9
#  7 D     Sanders        USA   2016-06-28       37.2
#  8 D     Clinton        USA   2016-06-28       54.4
#  9 D     Sanders        USA   2016-06-27       37.4
# 10 D     Clinton        USA   2016-06-27       53.9
# # ... with 5,278 more rows

To generate graphs:

dat %>%
  filter(candidate_name %in% c("Clinton", "Kasich", "Sanders", "Trump")) %>%
  ggplot(aes(forecastdate, poll_avg)) +
  geom_line(aes(col = candidate_name)) +
  facet_wrap(~party)

https://i.sstatic.net/rccOG.png

If you prefer interactive graphs:

library(dygraphs)
library(htmltools)

foo <- dat %>
  filter(candidate_name %in% c("Clinton", "Kasich", "Sanders", "Trump")) %>
  split(.$party) %>
  map(~ {
    select(.x, forecastdate, candidate_name, poll_avg) %>
      spread(candidate_name, poll_avg) %>
      {xts(.[-1], .[[1]])} %>%
      dygraph(group = "poll-model") %>%
      dyRangeSelector()
  })

browsable(tagList(foo))

https://i.sstatic.net/qsvmu.png

Answer №2

It's highly likely that the chart you're seeing is created using d3.js or a related tool built on top of it. Using d3 is incredibly effective for generating svg-based data visualizations because it aids in setting scales to correspond values (like 40%) to specific locations on the display (such as cx=100). However, obtaining the raw data behind the visualization can be tricky since you'd need to understand these scales, which are probably dynamic and adjusting based on factors like screen size.

A more straightforward approach would be to scrape the data directly from the table below. This table resides within a div element identified by the ID latest-polls, and carries the class t-polls.

In this code snippet, I'm utilizing html_node with CSS selectors along with html_table to convert the table into a dataframe, refining column names, and converting numeric columns to proper numerical data types. There are additional steps one could take, such as formatting dates, but this should give you a solid starting point.

library(tidyverse)
library(rvest)

url <- "https://projects.fivethirtyeight.com/election-2016/national-primary-polls/democratic/"

polls_df <- url %>% 
  read_html() %>%
  html_node("#latest-polls table.t-polls") %>%
  html_table() %>
  setNames(c("new", "date", "pollster", "sample_n", "sample_type", names(.)[6:10]) %>> str_remove_all("\\W")) %>%
  mutate_at(vars(sample_n, Clinton, Sanders, OMalley), 
      function(x) str_remove_all(x, "\\D") %>> as.numeric())

head(polls_df)
#>   new           date                     pollster sample_n sample_type
#> 1   •     Jun. 10-13                 Selzer & Co.      486          LV
#> 2   •     Jun. 26-28                     Fox News      432          RV
#> 3   •     Jun. 18-20                       YouGov      390          LV
#> 4   •     Jun. 15-20              Morning Consult     1733          RV
#> 5   • Jun. 27-Jul. 1                Ipsos, online      142          LV
#> 6   •     Jun. 16-19 Opinion Research Corporation      435          RV
#>   weight      leader Clinton Sanders OMalley
#> 1   1.05  Clinton +2      45      43      NA
#> 2   0.91 Clinton +21      58      37      NA
#> 3   0.79 Clinton +13      55      42      NA
#> 4   0.79 Clinton +18      53      35      NA
#> 5   0.67 Clinton +41      70      29      NA
#> 6   0.66 Clinton +12      55      43      NA

Similar questions

If you have not found the answer to your question or you are interested in this topic, then look at other similar questions below or use the search

Transfer the output of papaparse into an array

I'm currently utilizing papaparse to parse a local csv file. Here is the code I am using: var display_links = []; Papa.parse(file_links, { header: true, download: true, dynamicTyping: true, complete: function (results) { r ...

Is there a way to determine the function that has been passed as an argument to a custom function?

I am working on a function that takes another function as an argument. The challenge I'm facing is creating a variable within my main function based on the specific function passed into it. Is there a way to accomplish this? Below is a snippet of cod ...

sending parameters into a regex within a function

Struggling to pass a variable into a robust regex, managing to make it work outside the function but unable to figure out how to get it working within a function. Not sure why match[1] is returning null or how to find words after a keyword. Here's wh ...

What steps can be taken to eliminate the include module error in Angular?

I created a simple demo on my PC which is working perfectly fine. However, when I tried to replicate it on jsfiddle to ask a question, I encountered the following error message: Uncaught Error: [$injector:nomod] Module 'myapp' is not available ...

JSP Error: the specified resource cannot be found

I am a newcomer to this website, and as a programmer, I consider myself to be at a beginner to intermediate level. I am new to Java and have to work with JSP for a University course. I am encountering an error and would like help in identifying what is wro ...

Preserving the <script> tag when defining HTML code

During an AJAX call, I am receiving plain HTML with some JavaScript code. When I try to display this response in a container using .html (jQuery) or innerHTML (plain JavaScript), it removes the <script> tag and all JavaScript code. However, when I c ...

Employ the vue.js method within the click EventListener of another method

In my Vue.js script, I have a method that generates an element called 'lens'. Now, I want to include an EventListener that triggers another method when the lens element is clicked. The problem: I have attempted two different approaches to add ...

What is the reason for the visibility of my API key when utilizing next.js alongside environment variables?

I recently went through the next.js documentation and implemented a custom API key on my now server. However, I encountered an issue where when I execute now dev and navigate to the sources tab, my API key is visible. https://i.stack.imgur.com/kZvo9.jpg ...

RequireJS is timing out while loading the runtime configuration

I keep encountering a load timeout error with my run-time configuration, specifically with common.js. Although I have set the waitseconds value to 0 for files loaded from common.js, the loadTimeout issue persists for common.js itself. index.html <scr ...

Animate the transition of the previous element moving downward while simultaneously introducing a new element at the top

I currently have a hidden element called "new element" that is controlled by v-if. My goal is to create a button labeled "display" that, upon clicking, will reveal the new element on top after sliding down an old element. How can I achieve this using CSS ...

Creative Ways to Use jQuery Hover Effects: Make One Div Vanish While Unveiling Another

I posted a question earlier, but I forgot to include the second part of it. The initial question was about creating a hover animation (which is working perfectly). However, I am facing an issue where one div disappears and another one needs to appear in i ...

Can Vue2-Google-Maps dynamically load API keys using props?

Is there a way to access the component props before it gets rendered? I want to dynamically load the Google Maps API based on a passed prop value. import * as VueGoogleMaps from 'vue2-google-maps'; import GmapCluster from 'vue2-google-maps/ ...

What is the best way to adjust the mobile menu in Wordpress so that it takes up half of the screen width?

Encountering an issue after installing the "Mobile Navigation" Wordpress Plugin to implement a mobile menu on my custom theme that currently occupies 100% of the screen width. How can I adjust it to occupy only 50% of the screen width, as shown in the conc ...

Can you explain the functioning of knockout container less syntax? (does it have any drawbacks?)

There are numerous instances and examples of using knockout ContainerLess syntax, although I find it challenging to locate proper documentation from their site. Initially, my question was "is it evil?" but upon realizing my lack of understanding on how it ...

Enable row selection in UI-Grid by clicking on a checkbox with AngularJS

I am a newcomer to angular js and I am looking to have the checkbox automatically selected when clicking on a row to edit that specific cell. I have implemented a cell template to display the checkbox in the UI-grid, but now, when I select a row, the row i ...

arranges the objects in the array based on the attribute of the child objects within the array

Apologies for my limited English proficiency, I hope everyone can follow along. I am dealing with an array: const arr=[ { name:"c", pay:[{ name:"c", date: "2020-10-02" },{ name:"cc1" ...

Using Node.js to handle reading files and dealing with undefined or null values

The get method is responsible for receiving a userid with an initial total number of points defined in the stcok.json file, along with various transactions stored in another file. Below are some sample entries from the stock JSON: [ { "user" ...

Is Canvas.toDataURL functionality disabled on Safari iOS for mobile devices?

I experimented with various methods to display SVG images on a canvas and export them as PNG. While it worked seamlessly on Android, Chrome, Safari, and Firefox, I encountered an issue with mobile Safari on iOS when using canvas.toDataUrl() with SVG images ...

What is the best way to pinpoint particular text within a paragraph that includes numerous line breaks?

So here is the puzzling situation I'm grappling with. Here's a peek at the HTML snippet: <p>This paragraph has <br><br> two unusual line breaks <br><br> and it happens TWICE!</p> The problem arises when tryi ...

Performing a subtraction operation between rows in R utilizing two specific columns

In an attempt to perform a rolling subtraction using two columns, I am looking to subtract the value in the 'DistTravelValue' column from the 'distBWStops' column for each row, starting with the last stop in the sequence. Initially, I ...