Extract information from a poorly structured PDF chart

Currently, I am facing the challenge of extracting data from a poorly structured PDF document (you can find the URL in the code below). To effectively extract meaningful data records, I will need to leverage the information regarding the position of the lines and borders within the table.

url="http://www.cmc.gv.ao/sites/main/pt/Lists/CMC%20%20PublicaesFicheiros/Attachments/89/Lista%20de%20Institui%C3%A7%C3%B5es%20Registadas%20(actualizado%2004.07.16).pdf"

import scraperwiki, urllib2, re
u = urllib2.urlopen(url)
xml=scraperwiki.pdftoxml(u.read()) # interpret pdf as xml

Upon inspecting the XML lines, it becomes apparent that the structure does not clearly show how the table-lines segment the data. A typical line resembles the following:

<text top="678" left="493" width="103" height="12" font="6">Besa Património </text>

In my analysis using the browser's element inspector, the HTML provides slightly more detail, but still lacks crucial information on the positioning of the table-lines.

This has been quite time-consuming for me, and therefore, any experimental solutions would be immensely helpful. The main query revolves around finding a method to obtain the position of these table-lines.

Answer №1

Here are the steps to extract table borders:

  • First, you need to decompress the PDF file and go through its objects. You can try using pdfrw or similar tools for this.
  • Look for lines and rectangles in the PDF document. Lines are represented as rectangles with four values followed by the re command, like:

270.17 749.85 182.81 20.67 re

or

270.17 414.16 182.81 20.76 re

If you successfully decompress the PDF, you can create a simple parser or use regular expressions to:

  • Gather all rectangles
  • Categorize rectangles based on similar X and Y coordinates
  • Determine X and Y border coordinates
  • Match text snippets with specific column or row boundaries (remember that Y coordinate is inverted in PDF, refer to pdf specification)

This is similar to how utilities like ByteScout PDF Multitool (which unfortunately runs only on Windows) operate.

Similar questions

If you have not found the answer to your question or you are interested in this topic, then look at other similar questions below or use the search

What is the best way to handle a 3-element mode using an onClick function in Next.js?

Creating an accordion menu in Next.js comes with the challenge of implementing 3 different modes for an element: The first mode is default, where no click event has occurred: .mainLi_default In the second mode, the first click event triggers the opening o ...

Using jQuery to submit an array within a form - a comprehensive guide

My goal is to accumulate form data each time the user clicks add_accommodation and add it to an array that will be sent to the endpoint http://server/end/point. <form action="http://localhost:3000/a/b/c" method="post"> <div> <input ...

The Safari browser displays the mobile-friendly version of the website

Everything appears correctly when looking at the website in Firefox and Chrome. However, Safari is displaying the mobile version of the site instead. I did not make any CSS changes specifically for desktop screens while working on the responsive design for ...

Adjusting font sizes in JavaScript causes the text to resize

Within my HTML pages, I am adjusting the font size using JavaScript code like this: "document.body.style.fontSize = 100/50/20" However, whenever the font size changes, the text content on the page moves up or down accordingly. This can be disorienting for ...

Can you please share the updated method for defining the traditional `inline-block` display setting?

Recently, CSS has undergone a change where the display property is now split into an inner display and outer display. With the legacy value of display: inline-block, it raises the question of what the newer equivalent should be. One might assume that the ...

Utilizing space with margin and padding in Bootstrap 5

I recently started experimenting with bootstrap 5, but I'm feeling quite disoriented. The layout I have created includes some blue elements (borrowed the sidebar from this source) and now I am attempting to replicate this design. https://i.sstatic.ne ...

"Utilizing the $in Operator in Pymongo for Regular Expressions: A Step-by-

Can anyone assist with a MongoDB aggregation issue I'm having? I'm trying to aggregate a collection based on a specific path in the Log_Dir field, but I can't seem to get it to work with pymongo. Just a note, I am using the 'version&apo ...

Applying a distinct CSS style to the final entry of each list item

Is there a way for me to assign a different style for the last entries in my code? <?= $k % 2 == 1 ? 'news-figure' : 'news-figure-b' ?> ...

Issue with Jquery addclass not activating upon clicking on li element

My menu structure is as follows: <ul class="menu_bg" id="menu-main-menu"> <li class="menu-item menu-item-type-post_type menu-item-object-page menu-item-653" id="menu-item-653"><a href="http://apptivowp.apptivo.com/" title="Home">Home&l ...

Incorporating Background Image with CSS Gradient - A Winning Combination!

Can anyone help me figure out what's wrong here? The gradient is working fine, but the image isn't showing up. You can view the page on my Wordpress site at: body { background-color: #FFF !important; /* setting a fallback color if gradi ...

Adjust the width of your table content to perfectly fit within the designated width by utilizing the CSS property "table width:

Example of a table <table> <tr> <td>Name</td> <td>John</td> <td>Age</td> <td>25</td> <td>Job Title</td> <td>Software Engineer ...

Displaying a carousel using CSS

I have constructed a carousel using HTML, but for some reason it does not display anything once compiled. <div class="list-offers"> <div class="container"> <div class="row1"> <div class="col-xs-12 item bg_fix right" style="ba ...

All components in my app are being styled by CSS files

Currently, I am facing an issue with my React app in VS Code. My goal is to assign a distinct background color to each component. However, the problem arises when unwanted CSS styles from other files start affecting components even though they are not impo ...

Using inline SVG as a background in CSS

Recently, I've been on the hunt for a way to create a blur effect on a background specifically for IE10+. After experimenting with StackBlur, I found that it did work, but unfortunately, it was quite slow due to the size of my background image. I&apos ...

My goal is to calculate a precise percentage for each text field and then determine the overall grade

<tr> <td>Knowledge</td> <td><input type="text" name="Knowledge" style="height: 30px; width: 220px;" class="computethis" id="knowledge" Placeholder = "Enter Grade" autocomplete ="off" /></td> </tr> <tr&g ...

Truncating text with ellipsis in CSS when wrapping on a nested div structure

Here is a 3-tier nested div tree structure with the outer node having a maximum width where I want the ellipsis wrapping to occur. I've almost achieved the desired result, but the issue arises when the inner nodes are not trimmed to accommodate as mu ...

Validating HTML using EJS templates set as "text/template" elements

What is the general consensus on HTML validation when utilizing a framework such as Backbone or Meteor and generating views in the client from EJS templates? An issue arises with the fact that name is not considered an official attribute for a <script& ...

The parent container fails to adjust its size to fit the child div

While browsing through various questions, I noticed that none of them mentioned the position:absolute property (which may be the missing piece). I have always been comfortable working with tables, but in my first attempt with divs, I encountered an issue. ...

Capture the results of the Python command in a dictionary for future reference

In order to improve the efficiency of my current class which parses command output using regex, I am planning to modify it to store the entire output in a dictionary. The current class is functional but I believe storing data as key-value pairs would enhan ...

When a td element is clicked, the Textbox will also be clicked

Similar Question: Dealing with jQuery on() when clicking a div but not its child $(oTrPlanning).prev().children('td').each(function () { this.onclick = setCountClick; }); When a TD element is clicked, the function setCountClick() i ...