Parsing HTML files with Perl to extract XML data

Can an XML output be generated from a webpage using web::scraper in Perl? Here is an example of HTML taken from a URL:

> <table class="reference">
    >     <tr>
    >     <th width="23%" align="left">Property</th>
    >     <th width="71%" align="left">Description</th>
    >     <th style="text-align:center;">DOM</th>
    >     </tr>
    >     <tr>
    // more HTML code

The Perl code snippet used to scrape this data is as follows:

    #!/usr/bin/perl
    use warnings;
    use strict;
    use URI;
    use Web::Scraper;

    my $urlToScrape = "http://www.w3schools.com/jsref/dom_obj_node.asp";

    my $rennersdata = scraper {
    process "table.reference > tr > td > a", 'renners[]' => 'TEXT';
    // more processing code
       };

    my $res = $teamsdata->scrape(URI->new($urlToScrape));

// more code-snippet

The current output obtained is structured like this:

<PropertyList>
<Property>
<Name>attributes</Name>
// more output
<ReturnValue>
Returns a collection of a node's attributes
</ReturnValue>
....

Desired output format:

<PropertyList>
<Property>
<Name>attributes</Name>
<ReturnValue>Returns a collection of a node's attributes</ReturnValue>
<DOMVersion>1</DOMVersion>
</Property> 
</PropertyList>

Please suggest how the for loops can be combined to achieve the desired output structure.

Thank you!

Answer №1

If you want to access all the individual items in each of the three keys in $res, make sure to move your output into the first for loop. By utilizing the $i variable, which represents the number of items in each key, you can easily retrieve the values that correspond to one another during each iteration.

for my $i (0 .. $#{$res->{renners}}) {
  print <<"XML";
<PropertyList>
  <Property>
    <Name>$res->{renners}[$i]</Name>
    <ReturnValue>$res->{landrenner}[$i]</ReturnValue>
    <domversion>$res->{dom}[$i]</domversion>
  </Property>
</PropertyList>
XML
}

To enhance readability, I switched the print statements to use a HERE doc. Additionally, I rectified an issue by changing the line from

my $res = $teamsdata->scrape(URI->new($urlToScrape));
to
my $res = $rennersdata->scrape(URI->new($urlToScrape));
since it seems that $teamsdata was not properly declared.

Answer №2

While this may not be the exact solution you had in mind, I recommend checking out HTML::Element. It offers an as_XML method that allows you to easily convert HTML into XML format. Hope this helps.

Answer №3

Give this a try: Reorder the print statements like so:

print "<PropertyList>\n";   
for my $i (0 .. $#{$res->{renners}}) {    
    print "<Property>\n";
    print "<Name> ";
        print $res->{renners}[$i]; print "\n";
    print "</Name>"; print "\n";

    for my $j (0 .. $#{$res->{landrenner}}) {
        print "<ReturnValue>\n";
            print $res->{landrenner}[$j];print "\n";
        print "</ReturnValue>\n";
    }
    for my $k (0 .. $#{$res->{dom}}) {
        print "<domversion>\n";
            print $res->{dom}[$k];print "\n";
        print "</domversion>\n";
    }   
    print "</Property>\n";
}
print "</PropertyList>\n";

Similar questions

If you have not found the answer to your question or you are interested in this topic, then look at other similar questions below or use the search

The performance of my Wordpress site is being hindered by the slowdown caused by Admin

I have a Wordpress website focused on winter sports vacations in Iceland: link Every plugin, WordPress core, and theme are up to date. Using Woocommerce for my online store. However, activating the Woocommerce plugin causes the site to slow down signific ...

Chrome has some issues with resizing the SVG Pattern element

Utilizing inline svgs, I have a svg circle filled with a pattern that should cover 100% of the container size. However, when the parent element is resized via JavaScript, the pattern no longer reflects the 100% width and height as expected. This issue seem ...

Formcontrol is not functioning properly with invalid input

I am attempting to style an input field with a FormControl using the :invalid pseudo-class. Here is my code snippet: scss input:invalid { background-color: red; } html <input type="text" [formControl]="NameCtrl"> Unfortunate ...

What is the best way to gather Data URI content through dropzone.js?

I am currently utilizing Dropzone for its thumbnail generation feature and user interface. However, I am only interested in using the thumbnail generation ability and UI and would prefer to collect all the data URIs myself and send them to the server via a ...

In order to format my text, I must locate a specific character and encompass the text following it with HTML tags

When working with a list generated in Jekyll, I came across the need to emphasize certain words using strong tags. My solution was to incorporate a delimiter into the process. <li>100g of |sugar</li> This would transform into: <li>100g ...

Filtering jQuery by excluding certain strings from an array

I have a list of six option values retrieved from a database table, including #blog and #hariciURL, among others. In another table, I need to disable the options that match values in an array, except for #hariciURL which should never be disabled. How can ...

Issues with Internet Explorer's scaling functionality are preventing it from operating correctly

I've utilized d3 to create a map. Its width is dynamically set based on the parent div's (with the id "map") width, and its height is calculated with a ratio of 5/9 in relation to the width. The viewBox attribute has been defined as "0 0 width he ...

PHP - session expires upon page refresh

I'm in the process of creating a login system for my website and I've run into an issue with updating the navigation bar once a user has logged in. Every time I refresh the page, it seems like the session gets lost and the navigation bar doesn&ap ...

Tips for sending all form elements via ajax without explicitly mentioning their names

I'm developing an application with a dynamic form generation feature. When a button is clicked, I need to send all the form elements to another page using Ajax. The challenge is that since the form is generated dynamically, I am unable to specify elem ...

A guide on accessing the href attribute of an HTML tag using jQuery

My goal is to extract the ID from the href attribute of an a tag on my webpage. First, I need to prevent the page from refreshing before performing this task. The example a tags look like this: '<a href="/ArtPlaces/Delete/5">Delete</a>&ap ...

Maintaining CSS elements in position

Hello everyone, I have a project at work where I need to create a map with specific cities pinpointed. I've completed it on my computer and now I'm trying to ensure that when I transfer it to another device, like a laptop, the points remain in t ...

Is there a way to make a div element clickable?

I have attempted to include a link (href) in the cart-button class using the following HTML code: <div class="cart-button" href="#shopping-cart" alt="view-shopping-cart"><span>Shopping Cart (0)</span> <div class="cart-dropdo ...

The image appears fuzzy until you hover over it and it transforms into a larger size

Encountering a strange issue with using the zoom property on an image during hover state. The image seems to be blurry before and after the scale transition, but surprisingly sharp during the actual transition. Any tips on how to prevent this blurriness in ...

Tips for refreshing the default style of Material UI select

I'm having trouble customizing the default background color of the first menuItem in the select component. The class I need is not visible when inspecting the element, as the background color disappears upon inspection. Steps to reproduce: 1. Click ...

Calculate the sum of multiple user-selected items in an array to display the total (using Angular)

Within my project, specifically in summary.component.ts, I have two arrays that are interdependent: state: State[] city: City[] selection: number[] = number The state.ts class looks like this: id: number name: string And the city.ts class is defined as f ...

What is the best way to ensure my hover caption spans the entire size of the image?

I'm currently working on creating a hover caption that adjusts itself to fit the full width and height of an image even when it is resized. I attempted to implement a jQuery solution that I came across, but unfortunately, I am struggling to get it to ...

Receiving input in a "textbox" rather than using an alert message

Hey there! I managed to get this jquery code that displays the number of Facebook likes: <script> $.getJSON("https://graph.facebook.com/TuamadreLeggenda?callback=?", function(data) { alert("Likes: " + data.likes); }); </script> Is there ...

Trouble with Styling React-Toastify in TypeScript: struggling to adjust z-index in Toast Container

Currently in the process of developing a React application utilizing TypeScript, I have incorporated the React-Toastify library to handle notifications. However, encountering some challenges with the styling of the ToastContainer component. Specifically, ...

Capture a snapshot of a webpage that includes an embedded iframe

Currently, we have a nodeJS/angular 4 website that contains an iframe from a third party (powerBI Emebdded). Our goal is to develop a feature that allows the end user to capture a screenshot of the entire page, including the content within the iframe. We ...

Wait until the link is clicked before showing the list element

Is there a way to prevent the display of list element id="two" in the code below until link "#two" has been clicked, removing element id="one"? I am looking for a CSS or JS solution that completely hides the list element rather than just hiding it from vie ...