Extract information from a poorly structured PDF chart

Question

Extract information from a poorly structured PDF chart

Currently, I am facing the challenge of extracting data from a poorly structured PDF document (you can find the URL in the code below). To effectively extract meaningful data records, I will need to leverage the information regarding the position of the lines and borders within the table.

url="http://www.cmc.gv.ao/sites/main/pt/Lists/CMC%20%20PublicaesFicheiros/Attachments/89/Lista%20de%20Institui%C3%A7%C3%B5es%20Registadas%20(actualizado%2004.07.16).pdf"

import scraperwiki, urllib2, re
u = urllib2.urlopen(url)
xml=scraperwiki.pdftoxml(u.read()) # interpret pdf as xml

Upon inspecting the XML lines, it becomes apparent that the structure does not clearly show how the table-lines segment the data. A typical line resembles the following:

<text top="678" left="493" width="103" height="12" font="6">Besa Património </text>

In my analysis using the browser's element inspector, the HTML provides slightly more detail, but still lacks crucial information on the positioning of the table-lines.

This has been quite time-consuming for me, and therefore, any experimental solutions would be immensely helpful. The main query revolves around finding a method to obtain the position of these table-lines.

html css regex python-2.7 pdf

Answer 1

Answer №1

Here are the steps to extract table borders:

First, you need to decompress the PDF file and go through its objects. You can try using pdfrw or similar tools for this.
Look for lines and rectangles in the PDF document. Lines are represented as rectangles with four values followed by the re command, like:

270.17 749.85 182.81 20.67 re

or

270.17 414.16 182.81 20.76 re

If you successfully decompress the PDF, you can create a simple parser or use regular expressions to:

Gather all rectangles
Categorize rectangles based on similar X and Y coordinates
Determine X and Y border coordinates
Match text snippets with specific column or row boundaries (remember that Y coordinate is inverted in PDF, refer to pdf specification)

This is similar to how utilities like ByteScout PDF Multitool (which unfortunately runs only on Windows) operate.

Answer 2

Here are the steps to extract table borders:

First, you need to decompress the PDF file and go through its objects. You can try using pdfrw or similar tools for this.
Look for lines and rectangles in the PDF document. Lines are represented as rectangles with four values followed by the re command, like:

270.17 749.85 182.81 20.67 re

or

270.17 414.16 182.81 20.76 re

If you successfully decompress the PDF, you can create a simple parser or use regular expressions to:

Gather all rectangles
Categorize rectangles based on similar X and Y coordinates
Determine X and Y border coordinates
Match text snippets with specific column or row boundaries (remember that Y coordinate is inverted in PDF, refer to pdf specification)

This is similar to how utilities like ByteScout PDF Multitool (which unfortunately runs only on Windows) operate.

Extract information from a poorly structured PDF chart

Answer №1

Similar questions

What is the best way to handle a 3-element mode using an onClick function in Next.js?

Using jQuery to submit an array within a form - a comprehensive guide

The Safari browser displays the mobile-friendly version of the website

Adjusting font sizes in JavaScript causes the text to resize

Can you please share the updated method for defining the traditional `inline-block` display setting?

Utilizing space with margin and padding in Bootstrap 5

"Utilizing the $in Operator in Pymongo for Regular Expressions: A Step-by-

Applying a distinct CSS style to the final entry of each list item

Issue with Jquery addclass not activating upon clicking on li element

Incorporating Background Image with CSS Gradient - A Winning Combination!

Adjust the width of your table content to perfectly fit within the designated width by utilizing the CSS property "table width:

Displaying a carousel using CSS

All components in my app are being styled by CSS files

Using inline SVG as a background in CSS

My goal is to calculate a precise percentage for each text field and then determine the overall grade

Truncating text with ellipsis in CSS when wrapping on a nested div structure

Validating HTML using EJS templates set as "text/template" elements

The parent container fails to adjust its size to fit the child div

Capture the results of the Python command in a dictionary for future reference

When a td element is clicked, the Textbox will also be clicked