Parsing all HTML elements using Nokogiri

I've been searching everywhere and all I could find was how to use CSS selection with Nokogiri. What I really need is to completely remove all HTML tags.

For instance, this:

<html>
   <head><title>My webpage</title></head>
   <body>
   <h1>Hello Webpage!</h1>
   <div id="references">
      <p><a href="http://www.google.com">Click here</a> to go to the search engine Google</p>
      <p>Or you can <a href="http://www.bing.com">click here to go</a> to Microsoft Bing.</p>
      <p>Don't want to learn Ruby? Then give <a href="http://learnpythonthehardway.org/">Zed Shaw's Learn Python the Hard Way</a> a try</p>
   </div>

   <div id="funstuff">
      <p>Here are some entertaining links:</p>
      <ul>
         <li><a href="http://youtube.com">YouTube</a></li>
         <li><a data-category="news" href="http://reddit.com">Reddit</a></li>
         <li><a href="http://kathack.com/">Kathack</a></li>
         <li><a data-category="news" href="http://www.nytimes.com">New York Times</a></li>
      </ul>
   </div>

   <p>Thank you for reading my webpage!</p>

   </body>
<p>Addition</p>
</html> 
Extra content

Will output as:

Hello Webpage!

Click here to go to the search engine Google

Or you can click here to go to Microsoft Bing.

Don't want to learn Ruby? Then give Zed Shaw's Learn Python the Hard Way a try

Here are some entertaining links:

YouTube
Reddit
Kathack
New York Times
Thank you for reading my webpage!
Addition
Extra content

Is there a way to achieve this using Nokogiri? Also, are there any methods to scrape other code like Javascript?

Answer №1

require 'nokogiri'

webContent = %q{
  <html>
   <head><title>My website</title></head>
   <body>
   <h1>Greetings from the Web!</h1>
   <div id="links">
     <p><a href="http://www.facebook.com">Visit Facebook</a> for social networking</p>
     <p>Alternatively, you can <a href="http://www.twitter.com">check out Twitter</a> for updates and news.</p>
     <p>Interested in learning a new language? Try <a href="http://duolingo.com/">Duolingo</a> for language lessons.</p>
   </div>

   <div id="entertainment">
    <p>Discover some fun links:</p>
    <ul>
     <li><a href="http://netflix.com">Netflix</a></li>
     <li><a data-category="music" href="http://spotify.com">Spotify</a></li>
     <li><a href="http://twitch.tv/">Twitch</a></li>
     <li><a data-category="games" href="http://steamcommunity.com">Steam Community</a></li>
     </ul>
   </div>

   <p>Thanks for exploring my site!</p>

   </body>
</html>
}

parsedDoc = Nokogiri::XML(webContent)
pageContent = parsedDoc.search('body')
puts pageContent.text.gsub(/<.*?\/?>/, '')

Answer №2

If you're looking for a solution to your problem, consider leveraging Loofah in combination with Nokogiri.

With Loofah, the process would involve:

content = Loofah.fragment(html)
content.scrub!(:prune).text

By utilizing the prune scrub feature, you'll effectively remove any unsafe elements and their subcomponents, resulting in text output separated by new line characters.

Similar questions

If you have not found the answer to your question or you are interested in this topic, then look at other similar questions below or use the search

I encountered an issue where the font awesome icons failed to load on my website

My website is experiencing an issue where all the Font Awesome icons are not displaying despite having added Font Awesome. Here is the link to my site: https://i.stack.imgur.com/RcNtv.png .login-top .lg-in::before { font-family: "Fontawesome"; ...

Customizing hover color with tailwind CSS

How can I change the color of an icon on mouseover using Tailwind CSS? I have tried to go from this to this , but it's not working. .btn { @apply agt-h-10 agt-w-10 agt-bg-zinc-100 agt-rounded-full agt-flex agt-justify-center } .img{ @apply agt ...

Using .htaccess to Conceal Directories with Documents

It seems that my website is being targeted by individuals seeking to obtain all the code. I have implemented an .htaccess file that will display nothing when someone visits domain.com/images, but if they specifically request domain.com/images/logo.png, the ...

Jquery Flip! plugin causing problems with Ajax functionality

I'm currently experimenting with the FLIP! plugin and attempting to load its content using ajax. However, I've encountered a glitch along the way. It just doesn't seem to be functioning properly. While I can observe the post event taking pl ...

Navigate to the web page in order to utilize the mobile GPS application and transfer your current location data directly to the

I am aiming to leverage my existing website to connect with a native mobile GPS application by simply clicking on a specific link on the webpage. This link will then transmit my current location and desired destination directly to the app, all without l ...

The memory usage steadily increases when you refresh data using the anychart gantt chart

I have a basic anychart code to update a gantt chart every second: function initializeSchedule(){ anychart.onDocumentReady(function () { anychart.data.loadJsonFile("../scheduler?type=getschedule", function (data) { documen ...

Restrict the dimensions of the image to fit within the div

I've been struggling to resize my LinkedIn logo on my website. I've attempted various methods, like using inherit and setting max height and width to 100%, but the logo keeps displaying at full size. Do I need to use a different type of containe ...

navigate through laravel pagination to different pages

Is there a way to navigate to a particular page on the pagination? ...

The functionality of List.js is currently not optimized for use with tables

I'm currently experimenting with list.js in order to create a real-time search feature for a table. I have successfully tested it on lists (similar to the example provided at ). However, I am facing difficulty replicating this functionality for tables ...

Display of items in a submenu through a drop-down menu when hovering over it

As I work on creating a Navbar with a dropdown menu, I'm encountering an issue where hovering over the main list element displays all the submenu contents at once. No matter what I try in my CSS, including adjusting the positioning of li and ul elemen ...

Exploring ways to transfer a function variable between files in React

Currently, I am working on a quiz application and have the final score stored in a function in app.js. My goal is to generate a bar graph visualization of the user's results (e.g. 6 right, 4 wrong) based on this score. To achieve this, I created anoth ...

What is the process for opening a local HTML file in IE compatibility mode?

Having issues with IE compatibility mode while opening my HTML file on desktop. In IE8, unable to switch it to compatibility mode as the icon seems to disappear in local HTML documents. Any insights on this? LATEST UPDATE: Tried three solutions but still ...

Using jQuery to arrange information from an API into a table

Currently, I am in the process of learning jQuery as it is a new concept for me. I have attempted to make a request to an example API and received an array of objects that need to be listed in a table. However, I am facing difficulty in sorting it within t ...

Modify the color of every element by applying a CSS class

I attempted to change the color of all elements in a certain class, but encountered an error: Unable to convert undefined or null to object This is the code I used: <div class="kolorek" onclick="changeColor('34495e');" style="background-c ...

What could be causing the iPhone to trim the 20px right padding of my website?

Essentially, here's the situation: Browser: Iphone (the issue doesn't occur in Android): My current setup includes: <meta name="viewport" content="user-scalable = yes"> Here is the corresponding CSS: html, body, #wrapper { height: 1 ...

Is it possible to simultaneously run multiple functions with event listeners on a canvas?

I'm attempting to create a canvas function that displays the real-time mouse cursor location within the canvas and, upon clicking, should draw a circle. I came across this code snippet that reveals the x and y coordinates of the mouse: document.addEve ...

Need help inserting an image into the div when the ngif condition is true, and when the ngif condition is false

Essentially, I have a situation where I am using an *ngIf condition on a div that also contains an image. This is for a login page where I need to display different versions based on the type of user. However, I'm facing an issue where the image does ...

Node.js is able to read an HTML file, but it seems to have trouble locating and loading

Struggling to load a jQuery file or any file in a node.js read HTML document. After spending considerable time on this issue, I realized that the Google-hosted library file works fine but my local file does not. The problem seems to be related to directing ...

Exporting MySQL data to MS Excel is functioning smoothly on the local environment, however, it is encountering difficulties when

<title>Orders Export</title> <body> <?php header('Content-Type: application/xls'); header('Content-Disposition: attachment; filename=download.xls'); $con = mysqli_connect('localhost','suresafe ...

CSS Vertical Accordion Menu

I am struggling to modify the vertical accordion menu CSS. I am having difficulty adjusting the sub-menu list items. <div id="menuleft"> <div class="top">header</div> <ul> <li><a href="#">Main 1</a> ...