I am facing a challenge with my dataset that contains a column of HTML content. Each entry in this column represents a paragraph of HTML code. Here is an example:
html <- "<p id="PARA339" style="TEXT-ALIGN: left; MARGIN: 0pt; LINE-HEIGHT: 1.25"><font style="FONT-SIZE: 10pt; FONT-FAMILY: Times New Roman, Times, serif"><i>We had a net loss of $1.</i><i><b>55</b></i><i> million for the year ended December 31, 201</i><i>6</i><i> and have an accumulated deficit of $</i><i>61.5</i><i> million as of December 31, 201</i><i>6</i><i>. To achieve sustainable profitability, we must generate increased revenue.</i></font></p>"
My task is to determine the styling attributes of each HTML paragraph, such as bold, italic, underlined, etc. Some paragraphs may have a mix of different styles, like the one above (all italic with only the number 55 being bold), so I want to establish a rule - if more than 50% of the text within the HTML paragraph is emboldened, then flag it as bold.
I am unsure where to begin with this task. It would be immensely helpful to know which R package can assist me in analyzing and extracting these styling details from the HTML content. Any guidance or solution provided using that package would be greatly appreciated! Thank you.