I'm relatively new to coding, so please bear with me if this seems like a silly question. I'm currently working on developing a general-purpose scraper to extract product data using the "schema.org/Product" HTML microdata.
Unfortunately, I encountered an issue during testing (on this specific page where the name was incorrectly set as "Electronics" due to conflicting schema elements from Breadcrumbs).
To start, I declared a variable to verify if the page contains an element with Product schema microdata.
var productMicrodata = document.querySelector('[itemscope][itemtype="https://schema.org/Product"], [itemscope][itemtype="http://schema.org/Product"]');
Next, my intention was to select all elements with the itemprop attribute, such as:
productMicrodata.querySelectorAll('[itemprop]');
The problem arises when I attempt to filter out elements with different ancestor schemas, such as Breadcrumbs and ListItem, which are still being included.
I assumed that by using the following logic:
productMicrodata.querySelectorAll(':not([itemscope]) [itemprop]');
I could avoid matching child elements with ancestor schemas (e.g. breadcrumbs), but to no avail.
I feel like I must be overlooking something obvious, but any guidance on how to correctly select elements with only one ancestor possessing the
itemtype="http://schema.org/Product"
attribute would be greatly appreciated.
EDIT: To clarify the location of the elements I wish to exclude, here's a visual representation of the DOM for the example page provided. My goal is to skip elements with any ancestor itemtype attributes.
EDIT 2: Corrected the use of parent
to ancestor
. Apologies for the oversight, as I'm still learning :|
EDIT 4/SOLUTION: I've identified a JavaScript-based solution utilizing the Element.closest()
method. Here's an example:
let productMicrodata = document.querySelectorAll('[itemprop]');
let itemProp = {};
for (let i = 0; i < productMicrodata.length; i++) {
if (productMicrodata[i].closest('[itemtype]').getAttribute('itemtype') === "http://schema.org/Product" || productMicrodata[i].closest('[itemtype]').getAttribute('itemtype') === "https://schema.org/Product") {
itemProp[productMicrodata[i].getAttribute('itemprop')] = productMicrodata[i].textContent;
}
}
console.log(itemProp);