What R package is capable of analyzing HTML strings to identify words that are formatted as bold, italic, and other styles within the text?

I am facing a challenge with my dataset that contains a column of HTML content. Each entry in this column represents a paragraph of HTML code. Here is an example:

html <- "<p id="PARA339" style="TEXT-ALIGN: left; MARGIN: 0pt; LINE-HEIGHT: 1.25"><font style="FONT-SIZE: 10pt; FONT-FAMILY: Times New Roman, Times, serif"><i>We had a net loss of $1.</i><i><b>55</b></i><i> million for the year ended December 31, 201</i><i>6</i><i> and have an accumulated deficit of $</i><i>61.5</i><i> million as of December 31, 201</i><i>6</i><i>. To achieve sustainable profitability, we must generate increased revenue.</i></font></p>"

My task is to determine the styling attributes of each HTML paragraph, such as bold, italic, underlined, etc. Some paragraphs may have a mix of different styles, like the one above (all italic with only the number 55 being bold), so I want to establish a rule - if more than 50% of the text within the HTML paragraph is emboldened, then flag it as bold.

I am unsure where to begin with this task. It would be immensely helpful to know which R package can assist me in analyzing and extracting these styling details from the HTML content. Any guidance or solution provided using that package would be greatly appreciated! Thank you.

Answer №1

If you want to extract information using the rvest package, you can search for specific elements like the b tag, as demonstrated in this guide:

library(rvest)
html <- minimal_html("
  <ul>
    <li><b>C-3PO</b> is a <i>droid</i> that weighs <span class='weight'>167 kg</span></li>
    <li><b>R2-D2</b> is a <i>droid</i> that weighs <span class='weight'>96 kg</span></li>
    <li><b>Yoda</b> weighs <span class='weight'>66 kg</span></li>
    <li><b>R4-P17</b> is a <i>droid</i></li>
  </ul>
  ")

html %>% html_nodes("b")

{xml_nodeset (4)}
[1] <b>C-3PO</b>
[2] <b>R2-D2</b>
[3] <b>Yoda</b>
[4] <b>R4-P17</b>

It's worth noting that for rvest 0.3.6, you should utilize html_node. Future versions will transition to using html_element.

If you wish to apply this to a dataframe :

library(purrr)
purrr::pmap(df,~with(list(...), {raw %>% read_html %>% html_nodes('b')}))

Similar questions

If you have not found the answer to your question or you are interested in this topic, then look at other similar questions below or use the search

Tips for closing the mobile navigation bar on your device after clicking

Currently developing a website and in need of assistance with the mobile device navbar functionality. Essentially, I am looking to close the navigation when a specific link, such as "Destaques", is clicked. All I require is to remove the class "opened" upo ...

The CSS property :read-only is used on elements that do not have the readonly attribute

Could someone please clarify why the CSS pseudo class :read-only is being added to elements that are not actually readonly? To see an example, check out this link I have tested this in recent versions of Edge, Chrome, and Firefox. All of them apply the i ...

The combination of absolute positioning and percentage-based height measurements allows for

For more information, go to http://jsfiddle.net/A2Qnx/1/ <div id='w'> <div id='p'> <div id='c'> </div> </div> </div> With absolute positioning enabled, div P has a height ...

Display the second dropdown after the user has made a selection in the first dropdown

Recently, I've been delving into the world of html/javascript and navigating through the process of implementing the code found in this specific link. While I can see the solution outlined in the link provided below, I'm struggling with how to i ...

Ensure the div remains floating to the left even when the browser window is resized

I'm currently working on creating a timeline, but I'm struggling to keep the div aligned horizontally. Within a div container with overflow set to hidden, I have divs floating to the left. Inside these divs are ul elements with fixed widths. My ...

Removing HTML formatting from JSON data within AngularJS

//Reviewing JSON Field [{"image":" <a href=\"http:\/\/docroot.com.dd:8083\/sites\/docroot.com.dd\/files\/catalogues\/2016-09\/images\/Pty%20Prs.compressedjpg_Page1.jpg\">Property Press.compress ...

Unusual behavior exhibited by ng-if within a widget

Hey there, seeking some assistance here. I'm currently making edits to a widget and within the client HTML code, I've come across two ng-if statements (the first one was already there, I added the second). <li> <a ng-if="data.closed ...

What is the best way to prevent a React app's scripts from loading on browsers that do not support them?

My current project makes use of create-react-app, where the React script main.js is loaded at the bottom of the <body/> tag. However, it crashes on unsupported browsers upon loading. Above the main.js script block, there is another <script> th ...

Using jQuery to Convert CSV Data into an HTML Table

I've been exploring the incredible jquery.csvtotable.js plugin for converting csv files to html tables. One question that has been on my mind is how can I allow users to select a .csv file using a browse button that isn't a server-side control? ...

Can I link the accordion title to a different webpage?

Is it possible to turn the title of an accordion into a button without it looking like a button? Here is an image of the accordion title and some accompanying data. I want to be able to click on the text in the title to navigate to another page. I am worki ...

What is the best way to create divs that can close and hide themselves when clicked from inside the div itself?

I have a situation in my code where clicking a link reveals a div, and then the same link must be clicked again to hide it. However, I want to modify this so that any link within the div can also hide it when clicked. I currently have eight divs set up lik ...

The smooth scrolling feature is not functioning properly as the links are jumping to the top and bottom instead of scrolling

My links are supposed to smoothly scroll to the bottom and top, but they're not working correctly even though I believe the JavaScript is functioning properly. Here is the JavaScript code: $.fn.ready(function() { // Smooth scroll to top ...

Obtain Insights into Community Tab Information on Youtube using Python

I have been attempting to retrieve information from the Community Tab of a YouTube channel, but it appears that this feature is not supported by the YouTube API that I've been utilizing. Despite my efforts, which include parsing the HTML, I have not b ...

Keep updating information dynamically using Ajax

I've set up a page featuring Bootstrap cards that fetch information from a MySQL database. What I'm Looking For I want these cards to refresh every 3 seconds without the need for page reload. I believe Ajax is the solution for this task as it a ...

Having trouble running R from terminal on macOS Sierra after the upgrade?

After a recent update to macOS Sierra (Version 10.12.3 (16D32)), I encountered an issue where I can no longer run R directly from Terminal: DN51ssqi:~ kjytay$ R -bash: R: command not found DN51ssqi:~ kjytay$ R --version -bash: R: command not found Howeve ...

Tips for retrieving the count from HTML using JavaScript:

I need to determine the count of list items in an unordered list within dir-pagination-controls. How can I achieve this using JavaScript? <dir-pagination-controls min-size="1" direction-links="true" boundary-links="true" class="pull-right ng-isolate- ...

Unable to navigate a simulated scrollbar

As someone who is new to web development, I am embarking on the journey of building a UI Grid that can effectively display a large amount of data. My goal is to implement a scrollbar that allows for horizontal scrolling across approximately 1,000,000 data ...

Customizing skin CSS for DNN Portal and Page images

Currently, I have a unique custom skin that I personally created. It is implemented on multiple portals that I am responsible for managing. My goal is to incorporate the ability for the header image to change based on the specific page being viewed. The he ...

Calculate the total of the smallest values in three columns as they are updated in real-time

I'm facing an issue with dynamically adding the sum of the 3 lowest values entered in columns. The Total Cost Field is not displaying any value, and changing the type from number to text results in showing NaN. I've tried various approaches but h ...

Could you clarify that for me?

Let's take a look at the function isIsogram(str) which checks if a string is an isogram. An isogram is a word or phrase in which no letter occurs more than once. The code snippet for this function can be seen below: We are particularly interested in ...