Steps to obtain the precise source code of a webpage

Question

Steps to obtain the precise source code of a webpage

Is there a way to download the exact source code of a webpage? I have tried using the URL method and Jsoup method, but I am not getting the precise data as seen in the actual source code. For example:

<input type="image"
       name="ctl00$dtlAlbums$ctl00$imbAlbumImage"    
       id="ctl00_dtlAlbums_ctl00_imbAlbumImage"
       title="Independence Day Celebr..."
       border="0"         
       onmouseover="AlbumImageSlideShow('ctl00_dtlAlbums_ctl00_imbAlbumImage','ctl00_dtlAlbums_ctl00_hdThumbnails','0','Uploads/imagegallary/135/Thumbnails/IMG_3206.JPG','Uploads/imagegallary/135/Thumbnails/');"
       onmouseout="AlbumImageSlideShow('ctl00_dtlAlbums_ctl00_imbAlbumImage','ctl00_dtlAlbums_ctl00_hdThumbnails','1','Uploads/imagegallary/135/Thumbnails/IMG_3206.JPG','Uploads/imagegallary/135/Thumbnails/');" 
       src="Uploads/imagegallary/135/Thumbnails/IMG_3206.JPG"     
       alt="Independence Day Celebr..." 
       style="height:79px;width:148px;border-width:0px;"
/>

The 'style' attribute in this tag is not being detected by the Jsoup code. Additionally, when downloading using the URL method, the style tag gets changed into a border=""/> attribute.

I have tried the following code:

URL url=new URL("http://www.apcob.org/");
InputStream is = url.openStream();  // throws an IOException
BufferedReader br = new BufferedReader(new InputStreamReader(is));
String line;
File fileDir = new File(contextpath+"\\extractedtxt.txt");
Writer fw = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(fileDir), "UTF8"));
while ((line = br.readLine()) != null)
{
  fw.write("\n"+line);
}
 InputStream in = new FileInputStream(new File(contextpath+"extractedtxt.txt";));
String baseUrl="http://www.apcob.org/";
Document doc=Jsoup.parse(in,"UTF-8",baseUrl);
System.out.println(doc);

Another method I attempted is:

Document doc = Jsoup.connect(url_of_currentpage).get();

I am trying to achieve this in Java for the website '' where this issue is happening.

javascript java html css jsoup

Answer 1

Answer №1

The reason for the variation is likely a result of using a distinct user agent string - when you access the page through your browser, it transmits a user agent string containing information about the type of browser being utilized. Certain websites may display different pages based on the browser being used (e.g. mobile devices).

Try matching your browser's user agent string to see if that resolves the issue.

Answer 2

The reason for the variation is likely a result of using a distinct user agent string - when you access the page through your browser, it transmits a user agent string containing information about the type of browser being utilized. Certain websites may display different pages based on the browser being used (e.g. mobile devices).

Try matching your browser's user agent string to see if that resolves the issue.

Answer 3

Answer №2

The download page has been altered by a javascript code, which cannot be executed by Jsoup, an html parser.

If you want to view the source code as it appears in Chrome, you can use one of these tools:

ui4j
selenium
HtmlUnit

All three tools are capable of parsing and executing Javascript code within the page.

Answer 4

The download page has been altered by a javascript code, which cannot be executed by Jsoup, an html parser.

If you want to view the source code as it appears in Chrome, you can use one of these tools:

ui4j
selenium
HtmlUnit

All three tools are capable of parsing and executing Javascript code within the page.

Answer 5

Answer №3

It seems like this solution would do the trick,

public static void main(String[] args) throws Exception {
    //Only use this if you are working with a proxy
    //System.setProperty("java.net.useSystemProxies", "true");

    URL url = new URL("http://www.apcob.org/");

    HttpURLConnection connection = (HttpURLConnection) url.openConnection();
    connection.addRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36");
    BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(connection.getInputStream()));

    String inputLine;
    while ((inputLine = bufferedReader.readLine()) != null)
        System.out.println(inputLine);
    bufferedReader.close();
}

Answer 6

It seems like this solution would do the trick,

public static void main(String[] args) throws Exception {
    //Only use this if you are working with a proxy
    //System.setProperty("java.net.useSystemProxies", "true");

    URL url = new URL("http://www.apcob.org/");

    HttpURLConnection connection = (HttpURLConnection) url.openConnection();
    connection.addRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36");
    BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(connection.getInputStream()));

    String inputLine;
    while ((inputLine = bufferedReader.readLine()) != null)
        System.out.println(inputLine);
    bufferedReader.close();
}

Answer 7

Answer №4

Check out this useful function for fetching webpages. Use it to get the HTML String, then convert the String to a Document with JSOUP.

public static String fetchPage(String urlFullAddress) throws IOException {
//      String proxy = "10.3.100.207";
//      int port = 8080;
        URL url = new URL(urlFullAddress);
        HttpURLConnection connection = null;
//      Proxy proxyConnect = new Proxy(Proxy.Type.HTTP, new InetSocketAddress(proxy, port));
        connection = (HttpURLConnection) url.openConnection();//proxyConnect);
        connection.setDoOutput(true);
        connection.setDoInput(true);

        connection.addRequestProperty("User-Agent",
                "Mozilla/5.0 (iPad; U; CPU OS 3_2 like Mac OS X; en-us) AppleWebKit/531.21.10 (KHTML, like Gecko) Version/4.0.4 Mobile/7B334b Safari/531.21.10'");
        connection.setReadTimeout(5000); // set timeout

        connection.addRequestProperty("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8");
        connection.addRequestProperty("Accept-Language", "en-US,en;q=0.5");
        connection.addRequestProperty("Accept-Encoding", "gzip, deflate");
        connection.addRequestProperty("connection", "keep-alive");
        System.setProperty("http.keepAlive", "true");

        BufferedReader in = new BufferedReader(new InputStreamReader(connection.getInputStream()));

        String urlString = "";
        String current;
        while ((current = in.readLine()) != null) {
            urlString += current;
        }

        return urlString;   
}

If you encounter issues with the JSOUP Parser, consider using . It parses HTML as-is, without correcting errors.

A couple of other things I observed: You forgot to close fw. Replace UTF8 with UTF-8`. For extensive CSS parsing, try a CSS-Parser

Answer 8

Check out this useful function for fetching webpages. Use it to get the HTML String, then convert the String to a Document with JSOUP.

public static String fetchPage(String urlFullAddress) throws IOException {
//      String proxy = "10.3.100.207";
//      int port = 8080;
        URL url = new URL(urlFullAddress);
        HttpURLConnection connection = null;
//      Proxy proxyConnect = new Proxy(Proxy.Type.HTTP, new InetSocketAddress(proxy, port));
        connection = (HttpURLConnection) url.openConnection();//proxyConnect);
        connection.setDoOutput(true);
        connection.setDoInput(true);

        connection.addRequestProperty("User-Agent",
                "Mozilla/5.0 (iPad; U; CPU OS 3_2 like Mac OS X; en-us) AppleWebKit/531.21.10 (KHTML, like Gecko) Version/4.0.4 Mobile/7B334b Safari/531.21.10'");
        connection.setReadTimeout(5000); // set timeout

        connection.addRequestProperty("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8");
        connection.addRequestProperty("Accept-Language", "en-US,en;q=0.5");
        connection.addRequestProperty("Accept-Encoding", "gzip, deflate");
        connection.addRequestProperty("connection", "keep-alive");
        System.setProperty("http.keepAlive", "true");

        BufferedReader in = new BufferedReader(new InputStreamReader(connection.getInputStream()));

        String urlString = "";
        String current;
        while ((current = in.readLine()) != null) {
            urlString += current;
        }

        return urlString;   
}

If you encounter issues with the JSOUP Parser, consider using . It parses HTML as-is, without correcting errors.

A couple of other things I observed: You forgot to close fw. Replace UTF8 with UTF-8`. For extensive CSS parsing, try a CSS-Parser

Answer 9

Answer №5

When retrieving a webpage through the use of http, the web server typically presents the source in a specific format; accessing the exact source code of a php file is not possible via http. From what I understand, the only method to achieve this is by utilizing ftp.

Answer 10

When retrieving a webpage through the use of http, the web server typically presents the source in a specific format; accessing the exact source code of a php file is not possible via http. From what I understand, the only method to achieve this is by utilizing ftp.

Steps to obtain the precise source code of a webpage

Answer №1

Answer №2

Answer №3

Answer №4

Answer №5

Similar questions

Retrieving the authenticated user post logging in through Firebase

Is it possible to adjust the width of Material-UI TextField to match the width of the input text?

Words appear on the screen, flowing smoothly from left to right

Attempting to establish a cookie from the server end, however, it is not being successfully set on my client

Closing the Material UI Drawer

Does using .stopImmediatePropagation() in the click event of a menu item have any impact on analytical tools?

Could offering a Promise as a module's export be considered a legitimate approach for asynchronous initialization in a Node.js environment?

Using Vue.js to dynamically append router links with JavaScript

Dependency on the selection of items in the Bootstrap dropdown menu

Is there a method available for us to successfully deliver an email to the user who has been registered?

The Jquery AJAX call is sending the data twice

Extracting raw data from the dojo.xhrGet request

Guide to setting a background image on a div when hovering using jQuery

The mismatch between JSON schema validation for patternProperties and properties causes confusion

HTML: Dealing with issues in resizing and resolution with floating elements on the left and right using

Limiting maximum loading time for WebView in Android Java

Change the printer orientation to landscape when using the WebBrowser Control

Saving JSON Data into my HTML document

Mastering the art of transitioning between DIV elements

Error occurred: Undefined module imported