Scraping company ratings from an Indeed job site using R

Question

Scraping company ratings from an Indeed job site using R

Although I have experience with R, I am new to HTML and CSS. I have been researching various web scraping methods both online and on Stack Overflow in order to implement them using R. However, I am encountering difficulties when it comes to extracting company ratings from job listing pages. Instead of retrieving the expected rating of 4.0 from the example URL, I keep getting character(0).

Below is my approach:

library(rvest)
library(tidyverse)
library(xml2)

#example URL
url<- "https://www.indeed.com/viewjob?jk=a25a91736b1f7042&tk=1e3q54n49heai800&from=serp&vjs=3&advn=8876452989351355&adid=95236293&sjdu=TDSJNe66qIM3gcXFOG94m--bPylNW2vvO3WAHEKN7JhCAD1FQ-2FXD1gQyElsLNkg6gfXO2CD3rQYOYjO9iXITyFdYOp8tCECkHuDmf3Og8qdMmciGFIv2ahigETjLmuY8uXdLjnQTg4__yOXqHJkA"

page<- read_html(url)


page%>
   rvest::html_nodes("span")  %>%
   rvest::html_nodes(xpath = '//*[contains(concat( " ", @class, " " ), concat( " ", "ratingsContent", " " ))]')%>%
   rvest::html_text()

#Output is 
#character(0)
#It should return 4.0 instead!

Can anyone provide guidance on how to achieve this, and also suggest a method for returning NA if the company rating is missing? Thank you!

html css r web-scraping rvest

Answer 1

Answer №1

It appears that the xpath you are using is incorrect. Upon examining the source document, it seems that the desired value can be found within the content attribute of meta tags with the itemprop attribute set to "ratingValue".

Here is a functional example based on your provided URL:

parse_html(url) %>
  html_elements(xpath = "//meta[contains(@itemprop, 'ratingValue')]") %>
  get_attribute("content") %>
  unique()
#> [1] "3.5"

Answer 2

It appears that the xpath you are using is incorrect. Upon examining the source document, it seems that the desired value can be found within the content attribute of meta tags with the itemprop attribute set to "ratingValue".

Here is a functional example based on your provided URL:

parse_html(url) %>
  html_elements(xpath = "//meta[contains(@itemprop, 'ratingValue')]") %>
  get_attribute("content") %>
  unique()
#> [1] "3.5"

Scraping company ratings from an Indeed job site using R

Answer №1

Similar questions

What could be causing the undefined status of my checkUser() function?

The sidebar in Semantic UI does not have a specified height for the form segment

Content Blocks with Added Cushioning

Nginx fails to load CSS in Asp.net Core on Raspberry Pi

Does the Fileupload jQuery event function differently when uploading from a mobile device compared to a desktop computer?

Unable to install rJava on openSUSE version 13.2

JavaScript PIP video feature

Creating a loader for a specific component in Angular based on the view

Enhancing link functionality with jQuery on a dynamically generated server page

Adding text to a Video Js fullscreen window on an iPad

What is the reason behind shadow dom concealing HTML elements when viewed in inspect mode?

Create a new column by analyzing two existing columns and applying various conditions and criteria to assign character values

Troubleshooting Issues with Bootstrap Carousel Functionality

Determine the prior location of an element using jQuery

Is there a way to trigger the opening of a new file or page when a CSS animation comes to an end?

The mobile menu is not responding to the click event

Tips for maximizing the height of the "container"

I encountered an error while attempting to install maptools in Rstudio

Updating the titles of Bootstrap 4 Switches on the fly

Create a React component using Material UI that remains fixed at the top of the page when scrolling, without being