Parsing HTML files with Perl to extract XML data

Question

Parsing HTML files with Perl to extract XML data

Can an XML output be generated from a webpage using web::scraper in Perl? Here is an example of HTML taken from a URL:

> <table class="reference">
    >     <tr>
    >     <th width="23%" align="left">Property</th>
    >     <th width="71%" align="left">Description</th>
    >     <th style="text-align:center;">DOM</th>
    >     </tr>
    >     <tr>
    // more HTML code

The Perl code snippet used to scrape this data is as follows:

    #!/usr/bin/perl
    use warnings;
    use strict;
    use URI;
    use Web::Scraper;

    my $urlToScrape = "http://www.w3schools.com/jsref/dom_obj_node.asp";

    my $rennersdata = scraper {
    process "table.reference > tr > td > a", 'renners[]' => 'TEXT';
    // more processing code
       };

    my $res = $teamsdata->scrape(URI->new($urlToScrape));

// more code-snippet

The current output obtained is structured like this:

<PropertyList>
<Property>
<Name>attributes</Name>
// more output
<ReturnValue>
Returns a collection of a node's attributes
</ReturnValue>
....

Desired output format:

<PropertyList>
<Property>
<Name>attributes</Name>
<ReturnValue>Returns a collection of a node's attributes</ReturnValue>
<DOMVersion>1</DOMVersion>
</Property> 
</PropertyList>

Please suggest how the for loops can be combined to achieve the desired output structure.

Thank you!

html css xml perl

Answer 1

Answer №1

If you want to access all the individual items in each of the three keys in $res, make sure to move your output into the first for loop. By utilizing the $i variable, which represents the number of items in each key, you can easily retrieve the values that correspond to one another during each iteration.

for my $i (0 .. $#{$res->{renners}}) {
  print <<"XML";
<PropertyList>
  <Property>
    <Name>$res->{renners}[$i]</Name>
    <ReturnValue>$res->{landrenner}[$i]</ReturnValue>
    <domversion>$res->{dom}[$i]</domversion>
  </Property>
</PropertyList>
XML
}

To enhance readability, I switched the print statements to use a HERE doc. Additionally, I rectified an issue by changing the line from

my $res = $teamsdata->scrape(URI->new($urlToScrape));

to

my $res = $rennersdata->scrape(URI->new($urlToScrape));

since it seems that $teamsdata was not properly declared.

Answer 2

If you want to access all the individual items in each of the three keys in $res, make sure to move your output into the first for loop. By utilizing the $i variable, which represents the number of items in each key, you can easily retrieve the values that correspond to one another during each iteration.

for my $i (0 .. $#{$res->{renners}}) {
  print <<"XML";
<PropertyList>
  <Property>
    <Name>$res->{renners}[$i]</Name>
    <ReturnValue>$res->{landrenner}[$i]</ReturnValue>
    <domversion>$res->{dom}[$i]</domversion>
  </Property>
</PropertyList>
XML
}

To enhance readability, I switched the print statements to use a HERE doc. Additionally, I rectified an issue by changing the line from

my $res = $teamsdata->scrape(URI->new($urlToScrape));

to

my $res = $rennersdata->scrape(URI->new($urlToScrape));

since it seems that $teamsdata was not properly declared.

Answer 3

Answer №2

While this may not be the exact solution you had in mind, I recommend checking out HTML::Element. It offers an as_XML method that allows you to easily convert HTML into XML format. Hope this helps.

Answer 4

While this may not be the exact solution you had in mind, I recommend checking out HTML::Element. It offers an as_XML method that allows you to easily convert HTML into XML format. Hope this helps.

Answer 5

Answer №3

Give this a try: Reorder the print statements like so:

print "<PropertyList>\n";   
for my $i (0 .. $#{$res->{renners}}) {    
    print "<Property>\n";
    print "<Name> ";
        print $res->{renners}[$i]; print "\n";
    print "</Name>"; print "\n";

    for my $j (0 .. $#{$res->{landrenner}}) {
        print "<ReturnValue>\n";
            print $res->{landrenner}[$j];print "\n";
        print "</ReturnValue>\n";
    }
    for my $k (0 .. $#{$res->{dom}}) {
        print "<domversion>\n";
            print $res->{dom}[$k];print "\n";
        print "</domversion>\n";
    }   
    print "</Property>\n";
}
print "</PropertyList>\n";

Answer 6

Give this a try: Reorder the print statements like so:

print "<PropertyList>\n";   
for my $i (0 .. $#{$res->{renners}}) {    
    print "<Property>\n";
    print "<Name> ";
        print $res->{renners}[$i]; print "\n";
    print "</Name>"; print "\n";

    for my $j (0 .. $#{$res->{landrenner}}) {
        print "<ReturnValue>\n";
            print $res->{landrenner}[$j];print "\n";
        print "</ReturnValue>\n";
    }
    for my $k (0 .. $#{$res->{dom}}) {
        print "<domversion>\n";
            print $res->{dom}[$k];print "\n";
        print "</domversion>\n";
    }   
    print "</Property>\n";
}
print "</PropertyList>\n";

Parsing HTML files with Perl to extract XML data

Answer №1

Answer №2

Answer №3

Similar questions

What is causing my jQuery to only impact the initial item in my iron-list?

Removing classes from multiple cached selectors can be achieved by using the .removeClass

Is there a way to organize a specific column based on user selection from a drop down menu on the client side?

Implementing the Upload Feature using AngularJS

How to center an item in a Bootstrap navbar without affecting the alignment of other right-aligned items

disableDefault not functioning properly post-fadeout, subsequent loading, and fade-in of new content with a different URL into

Disable and grey out the button while waiting for the Observable to broadcast successfully

using jQuery to show a block element

Tips for creating visually appealing tables in ReactJS

Utilizing AJAX and JavaScript to generate a table using the AJAX response and placing it within a <div> element

Is Django_compressor concealing the CSS code?

The quantity of elements stays constant even after adding to them using Javascript

Div with headers that stick to the top and stack when scrolling

Storing information on the webpage when it is refreshed

Dynamically size and position elements to fit perfectly within the container

Responsive Text and Alignment in the Latest Bootstrap 5

How can I activate a radio button by its corresponding label using Selenium?

Conceal the initial element featuring an image

Implementing nested popup windows using Bootstrap

Combining td elements within a table