Steps to obtain the precise source code of a webpage

Is there a way to download the exact source code of a webpage? I have tried using the URL method and Jsoup method, but I am not getting the precise data as seen in the actual source code. For example:

<input type="image"
       title="Independence Day Celebr..."
       alt="Independence Day Celebr..." 

The 'style' attribute in this tag is not being detected by the Jsoup code. Additionally, when downloading using the URL method, the style tag gets changed into a border=""/> attribute.

I have tried the following code:

URL url=new URL("");
InputStream is = url.openStream();  // throws an IOException
BufferedReader br = new BufferedReader(new InputStreamReader(is));
String line;
File fileDir = new File(contextpath+"\\extractedtxt.txt");
Writer fw = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(fileDir), "UTF8"));
while ((line = br.readLine()) != null)
 InputStream in = new FileInputStream(new File(contextpath+"extractedtxt.txt";));
String baseUrl="";
Document doc=Jsoup.parse(in,"UTF-8",baseUrl);

Another method I attempted is:

Document doc = Jsoup.connect(url_of_currentpage).get();

I am trying to achieve this in Java for the website '' where this issue is happening.

Answer №1

The reason for the variation is likely a result of using a distinct user agent string - when you access the page through your browser, it transmits a user agent string containing information about the type of browser being utilized. Certain websites may display different pages based on the browser being used (e.g. mobile devices).

Try matching your browser's user agent string to see if that resolves the issue.

Answer №2

The download page has been altered by a javascript code, which cannot be executed by Jsoup, an html parser.

If you want to view the source code as it appears in Chrome, you can use one of these tools:

All three tools are capable of parsing and executing Javascript code within the page.

Answer №3

It seems like this solution would do the trick,

public static void main(String[] args) throws Exception {
    //Only use this if you are working with a proxy
    //System.setProperty("", "true");

    URL url = new URL("");

    HttpURLConnection connection = (HttpURLConnection) url.openConnection();
    connection.addRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/46.0.2490.86 Safari/537.36");
    BufferedReader bufferedReader = new BufferedReader(new InputStreamReader(connection.getInputStream()));

    String inputLine;
    while ((inputLine = bufferedReader.readLine()) != null)

Answer №4

Check out this useful function for fetching webpages. Use it to get the HTML String, then convert the String to a Document with JSOUP.

public static String fetchPage(String urlFullAddress) throws IOException {
//      String proxy = "";
//      int port = 8080;
        URL url = new URL(urlFullAddress);
        HttpURLConnection connection = null;
//      Proxy proxyConnect = new Proxy(Proxy.Type.HTTP, new InetSocketAddress(proxy, port));
        connection = (HttpURLConnection) url.openConnection();//proxyConnect);

                "Mozilla/5.0 (iPad; U; CPU OS 3_2 like Mac OS X; en-us) AppleWebKit/531.21.10 (KHTML, like Gecko) Version/4.0.4 Mobile/7B334b Safari/531.21.10'");
        connection.setReadTimeout(5000); // set timeout

        connection.addRequestProperty("Accept", "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8");
        connection.addRequestProperty("Accept-Language", "en-US,en;q=0.5");
        connection.addRequestProperty("Accept-Encoding", "gzip, deflate");
        connection.addRequestProperty("connection", "keep-alive");
        System.setProperty("http.keepAlive", "true");

        BufferedReader in = new BufferedReader(new InputStreamReader(connection.getInputStream()));

        String urlString = "";
        String current;
        while ((current = in.readLine()) != null) {
            urlString += current;

        return urlString;   

If you encounter issues with the JSOUP Parser, consider using . It parses HTML as-is, without correcting errors.

A couple of other things I observed: You forgot to close fw. Replace UTF8 with UTF-8`. For extensive CSS parsing, try a CSS-Parser

Answer №5

When retrieving a webpage through the use of http, the web server typically presents the source in a specific format; accessing the exact source code of a php file is not possible via http. From what I understand, the only method to achieve this is by utilizing ftp.

