Retrieve the complete content of a web page, including all HTML and JavaScript elements

After spending several hours researching and experimenting, I find myself a bit confused about the topic at hand.

My issue: I am attempting to retrieve the complete HTML content (including dynamically generated JavaScript content) of a specific web page. Here's what I've already attempted:

  • I initially tried using Jsoup, but had to switch gears due to its inability to handle JavaScript content.
  • I experimented with HtmlUtil, but encountered numerous errors while loading the targeted webpage (such as Css error, runtimeError, EcmaError, etc.)
  • I resorted to using the basic Chrome function to save the entire webpage content, then utilized the Jsoup library to extract the specific information I needed. This workaround proved to be the only way to achieve my desired results.

My current question is: How can I replicate the functionality of the "save as" feature in a browser, or more broadly, how can I extract the full HTML content first and then utilize Jsoup to parse the static HTML content effectively?

Thank you in advance for your guidance and assistance!

Answer №1

I've finally achieved what I was aiming for. I'll do my best to explain it for those who may need assistance!


Alright! The process consists of two steps:

  • First, retrieve the final content HTML (including JavaScript content) as if you were browsing the webpage and save it to a simple file.html
  • Next, utilize the Jsoup library to extract the desired content from the saved file, file.html.

1 - Obtain HTML content and save it

For this step, you'll need to download phantomjs and use it to fetch the content. Here's the code to grab the target page. Just replace myTargetedPage.com with the URL of the desired page and the filename mySaveFile.html.

var page = require('webpage').create();
var fs = require('fs');
page.open('http://myTargetedPage.com', function () {
    page.evaluate();
    fs.write('mySaveFile.html', page.content, 'w');
    phantom.exit();
});

As you can see, the saved file mirrors the content loaded in your browser.

2 - Extract the desired content

Now, we'll employ Java and the Jsoup library to retrieve the specific content. In my case, I wish to extract this section from the webpage:

/* HTML CONTENT */
<span class="my class" data="data1"></span>
/* HTML CONTENT */
<span class="my class" data="data2"></span>
/* HTML CONTENT */

To accomplish this, modify thePathToYourSavedFile.html in the following code:

public static void main(String[] args) throws Exception {
    String url = "thePathToYourSavedFile.html";

    Document document = Jsoup.connect(url).userAgent("Mozilla").get();

    Elements spanList= document.select("span");

   for (Element span: spanList) {
       if(span.attr("class").equals("my class")){
           String data = span.attr("data");
           System.out.println("data : "+data);             
       }
    }       
}

Enjoy!

Answer №2

If you're searching for a convenient plugin that provides the functionality you need, look no further. This plugin allows you to easily assess a page and its features. While it may not be compatible with all browsers, it is available for some. You can find it here:

Once installed, you'll notice a small gear icon on the toolbar located at the top right of the page. This is where you'll find all the essential functions of the plugin.

Similar questions

If you have not found the answer to your question or you are interested in this topic, then look at other similar questions below or use the search

Fetching Data Using Asynchronous API Calls

My goal is to retrieve all results consistently from the API, but I am encountering varying outcomes. The for loop seems to be skipping some requests and returning a random number of records. Can anyone provide assistance? I have experimented with using t ...

Passing a missing parameter to ASP MVC controller via jQuery/Ajax

Currently, the operation that should be simple is not working as expected in my program. The user should be able to input text into a field on the view and then click on the magnifying glass icon. This action would send the text to the controller, which wo ...

Attempting to modify the information within the #content division of a webpage through the activation of a linked image within a jquery slideshow

I'm completely stuck on this one, even though I'm relatively new to jquery. My goal is to design a webpage where there is a slideshow of products at the top, and when a user clicks on one of the products, it should update the main #content div w ...

Adjusting the width of a nested iframe within two div containers

I am trying to dynamically change the width of a structure using JavaScript. Here is the current setup: <div id="HTMLGroupBox742928" class="HTMLGroupBox" style="width:1366px"> <div style="width:800px;"> <iframe id="notReliable_C ...

Upgrade Angular from 12 to the latest version 13

I recently attempted to upgrade my Angular project from version 12 to 13 Following the recommendations provided in this link, which outlines the official Angular update process, I made sure to make all the necessary changes. List of dependencies for my p ...

Tips on extracting the image URL after uploading via Google Picker

I'm currently implementing the Google Drive File Picker on my website for file uploading. Everything seems to be working well, except I am facing an issue with retrieving the image URL for images uploaded through the picker. Below is my current JavaSc ...

"Utilizing JSON information to create visually appealing graphs and charts

Struggling with syntax and in need of assistance. I have a basic data set that I want to display as a timeline with two filled lines (Time Series with Rangeslider). This is the format of my data set: [{"pm10": 12.1, "pm25": 7.0, "time": "13.08.2018 12:25 ...

jQuery effects failing to run properly in Internet Explorer

<script type="text/javascript" src="css/jquery-1.7.1.js"></script> <script type="text/javascript"> function slidedown(id){ $('#rerooftext').hide(0); $('#inspectiontext').hide(0); $('#remodelingtext').hid ...

Error 404 occurred when trying to access the webpack file at my website address through next js

Check out my website at https://i.stack.imgur.com/i5Rjs.png I'm facing an error here and can't seem to figure it out. Interestingly, everything works fine on Vercel but not on Plesk VDS server. Visit emirpreview.vercel.app for comparison. ...

Having trouble with my Express.js logout route not redirecting, how can I troubleshoot and resolve it?

The issue with the logout route not working persists even when attempting to use another route, as it fails to render or redirect to that specific route. However, the console.log("am clicked"); function works perfectly fine. const express = require('e ...

"Troubleshooting the issue of AngularJS $http patch request failing to send

The information is successfully logged in the console when passed to replyMessage, but for some reason, the API does not seem to be receiving the data. Is the input field perhaps empty? replyMessage: function(data) { console.log(data); ...

ESLint detecting error with returning values in async arrow functions

Currently facing a minor inconvenience instead of a major problem. Here is the code snippet causing the issue: export const getLoginSession = async (req: NextApiRequest): Promise<undefined | User> => { const token = getTokenCookie(req) if (!t ...

Unable to retrieve responseText from AJAX call using XrayWrapper

I am utilizing the IUI framework and attempting to retrieve the results from an ajax call. When inspecting the call in Firebug, it shows an "XrayWrapper[Object XMLHttpRequest{}", but I am struggling to access the responseText from the object. Upon expand ...

Create a dynamic form using JSON data and the Material UI library

Looking for assistance in creating a dynamic form on Next.js by parsing JSON files and generating the required components from the JSON data. Additionally, seeking guidance on utilizing the Material UI library for styling. Any examples or help would be g ...

How can I customize the color of the disabled state in a mat-slide-toggle?

I am trying to customize the disabled state color of a mat-slide-toggle. This is what my slide toggle currently looks like: https://i.sstatic.net/Lhz0U.png Here is the code I have been using: <div> <mat-slide-toggle>Slide me!</mat-slide ...

Issues with CSS media queries not being responsive in Safari

Can anyone shed light on why media queries are not functioning properly in Safari? Here is an example: body { background-color:black; } @media screen and (min-width: 1024px) and (max-width: 1300px) { body { background-color:red; } } ...

Click on the button to add a new question and watch as another row magically appears

Working with ReactJS My goal is to display 10 rows by default, followed by a button labeled "Add a new question" which would create the 11th row. View current row image here Currently, only one row is being shown [referencing the image below]. I aim to ...

Focusing on a particular iframe

I am currently using the "Music" theme from Organic Theme on my WordPress site and have inserted this code to prevent SoundCloud and MixCloud oEmbeds from stretching the page width: iframe, embed { height: 100%; width: 100%; } Although the fitvid ...

Implementing Pagination for JSON-list items in AngularJS

On my webpage, I have a large list of Json data that is organized with paging. The issue arises when selecting categories from the listbox as the data does not display properly. When "All" is selected, each page shows the correct pageSize(4). However ...

Error came up as "backbone.radio.js" and it threw Uncaught SyntaxError since the token import appeared unexpectedly

I've been struggling to transition an application from Backbone to Marionette (v3) for the past two days. Every time I try to run the app in the browser, I keep encountering this error in the console (resulting in a blank screen): Uncaught SyntaxErr ...