Parsing all HTML elements using Nokogiri

I've been searching everywhere and all I could find was how to use CSS selection with Nokogiri. What I really need is to completely remove all HTML tags.

For instance, this:

<html>
   <head><title>My webpage</title></head>
   <body>
   <h1>Hello Webpage!</h1>
   <div id="references">
      <p><a href="http://www.google.com">Click here</a> to go to the search engine Google</p>
      <p>Or you can <a href="http://www.bing.com">click here to go</a> to Microsoft Bing.</p>
      <p>Don't want to learn Ruby? Then give <a href="http://learnpythonthehardway.org/">Zed Shaw's Learn Python the Hard Way</a> a try</p>
   </div>

   <div id="funstuff">
      <p>Here are some entertaining links:</p>
      <ul>
         <li><a href="http://youtube.com">YouTube</a></li>
         <li><a data-category="news" href="http://reddit.com">Reddit</a></li>
         <li><a href="http://kathack.com/">Kathack</a></li>
         <li><a data-category="news" href="http://www.nytimes.com">New York Times</a></li>
      </ul>
   </div>

   <p>Thank you for reading my webpage!</p>

   </body>
<p>Addition</p>
</html> 
Extra content

Will output as:

Hello Webpage!

Click here to go to the search engine Google

Or you can click here to go to Microsoft Bing.

Don't want to learn Ruby? Then give Zed Shaw's Learn Python the Hard Way a try

Here are some entertaining links:

YouTube
Reddit
Kathack
New York Times
Thank you for reading my webpage!
Addition
Extra content

Is there a way to achieve this using Nokogiri? Also, are there any methods to scrape other code like Javascript?

Answer №1

require 'nokogiri'

webContent = %q{
  <html>
   <head><title>My website</title></head>
   <body>
   <h1>Greetings from the Web!</h1>
   <div id="links">
     <p><a href="http://www.facebook.com">Visit Facebook</a> for social networking</p>
     <p>Alternatively, you can <a href="http://www.twitter.com">check out Twitter</a> for updates and news.</p>
     <p>Interested in learning a new language? Try <a href="http://duolingo.com/">Duolingo</a> for language lessons.</p>
   </div>

   <div id="entertainment">
    <p>Discover some fun links:</p>
    <ul>
     <li><a href="http://netflix.com">Netflix</a></li>
     <li><a data-category="music" href="http://spotify.com">Spotify</a></li>
     <li><a href="http://twitch.tv/">Twitch</a></li>
     <li><a data-category="games" href="http://steamcommunity.com">Steam Community</a></li>
     </ul>
   </div>

   <p>Thanks for exploring my site!</p>

   </body>
</html>
}

parsedDoc = Nokogiri::XML(webContent)
pageContent = parsedDoc.search('body')
puts pageContent.text.gsub(/<.*?\/?>/, '')

Answer №2

If you're looking for a solution to your problem, consider leveraging Loofah in combination with Nokogiri.

With Loofah, the process would involve:

content = Loofah.fragment(html)
content.scrub!(:prune).text

By utilizing the prune scrub feature, you'll effectively remove any unsafe elements and their subcomponents, resulting in text output separated by new line characters.

Similar questions

If you have not found the answer to your question or you are interested in this topic, then look at other similar questions below or use the search

Manipulate the value(s) of a multi-select form field

How can you effectively manage multiple selections in a form field like the one below and manipulate the selected options? <select class="items" multiple="multiple" size="5"> <option value="apple">apple</option> <option va ...

Steps for dynamically adjusting form fields depending on radio button selection

I am looking to create a dynamic form that changes based on the selection of radio buttons. The form includes textfields Field A, Field B, ..., Field G, along with radio buttons Radio 1 and Radio 2. I need to update the form dynamically depending on which ...

Scala Play templating with vararg HtmlContent allows for dynamic generation of

I have a standard template in play 2.6, where I need to pass in a variable number of HtmlContents. The template is defined like this (including the implicit parameter): @(foo: String)(content: Html*)(implicit bar: Bar) When working with the template, I c ...

Is the AngularJS Date property model sending an incorrect value to the server?

There are some puzzling things I am trying to figure out. When using datetimepicker, the Date and time selected appear correctly on the screenshot. The value in the textbox is accurate The model's value in console is correct (but hold on a second... ...

Comparison of valueChanges between ReactiveForms in the dom and component级主动形

Is there a method to determine if the change in valueChanges for a FormControl was initiated by the dom or the component itself? I have a scenario where I need to execute stuff() when the user modifies the value, but I want to avoid executing it if the v ...

The function isset is not properly functioning when used with a submit button

Currently, I am in the process of constructing a login page using PHP. My intention is for the code to verify the username and password entered only after the login button has been clicked. However, currently, the code seems to execute even if the login ...

Modification of encapsulated class in Angular is not permitted

Within Angular, a div element with the class "container" is automatically created and inserted into my component's HTML. Is there a way to modify this class to "container-fluid"? I understand that Angular utilizes encapsulation, but I am unsure of how ...

JavaScript generated form fails to submit

I am currently facing an issue with submitting form data to a PHP file when it is generated using JavaScript. I am unsure of how to resolve this problem. The form submission works fine on a test .html file by itself, but not when it is generated by JavaScr ...

Hello there! I'm currently working on implementing a button that will alter the CSS grid layout

I need help designing a button that can alter the layout of my CSS grid. The grid information is contained within a class called 'container'. My goal is to change the grid layout by simply clicking a button, using either HTML and CSS or JavaScri ...

Setting up global color variables for React with Material UI theming is essential for ensuring consistent and

Is there a way to customize the default material-ui theme using global CSS? Additionally, how can we incorporate HEX values for colors when setting primary and secondary colors in the theme? In my App.js root file, I have defined a custom theme and want t ...

What are some effective strategies for preventing CSS duplication?

What is the best way to prevent repeating CSS code like this: div#logo div:nth-child(1) div { width: 30%; height: 30%; background-color: white; } div#logo div:nth-child(3) div { width: 30%; height: 30%; background-color: white; ...

Error: The absence of type definition

I am struggling to implement the functionality of an Add Question button, encountering an error message that reads: TypeError: form.questionText is undefined Can anyone point out where I went wrong? <script type="text/javascript"> fun ...

Guide on adding a button to a mat-table in Angular

I am looking to add a delete button or an Angular trash icon to a mat-table in an Angular application. Can anyone guide me on how to achieve this? Here is my current table code: <mat-table #table [dataSource]="ELEMENT_DATA"> <ng-container cdk ...

Please provide an explanation for the statement "document.styleSheets[0].cssRules[0].style;"

I'm seeking an explanation for this block of code: document.styleSheets[0].cssRules[0].style; It would be helpful if you could use the following code as a reference: * { padding: 0; margin: 0; box-sizing: border-box; font-family:&apos ...

Unable to make a request to the localhost node server via fetch API from an HTML page

Description: I am attempting to transmit JSON data from an HTML page to a Node.js server running locally on port 5000 using the Fetch API. Error Encountered: The request to 'http://localhost:5000/attend' from origin 'http://127.0.0.1:5501 ...

How to automatically set a default bootstrap tab upon page load using an MVC JsonResult responseData

I'm looking to enhance the user experience on my dashboard by allowing users to set their preferred default tab view upon logging in. Using an AJAX call to retrieve the value from the database in MVC, I am working on the JS/Razor side of things to mak ...

PHP: Establishing SESSION Variables

On Page1.php, there is a variable called "flag" with the value of 1. When clicked, it triggers the JavaScript function named "ajaxreq()" which will display the text "Click me" from an AJAX request originating from page2.php. If you click on the "Click me" ...

Placing toast notifications on top of dialog modals

When trying to combine a DaisyUI modal (a TailwindCSS UI library) with a toast alert library from GitHub, I'm facing an issue where the global toast alerts are not appearing above the modal dialog. Despite experimenting with various CSS options using ...

Creating a custom Jquery function to generate a Div element instead of a Textbox in your web application

I need assistance with a jquery function that retrieves data from a JSON PHP MySQL setup. The retrieved results are currently displayed in textboxes. function showData(wine) { $('#Id').val(wine.id); $('#question').val(wine.question ...

Utilizing the Bootstrap grid system to seamlessly display images

I have been attempting to design a column that is divided into two nested columns, each containing images that should fill the entire height and width of their respective columns. However, I am facing an issue where the images are extending beyond the boun ...