Currently, I am facing the challenge of extracting data from a poorly structured PDF document (you can find the URL in the code below). To effectively extract meaningful data records, I will need to leverage the information regarding the position of the lines and borders within the table.
url="http://www.cmc.gv.ao/sites/main/pt/Lists/CMC%20%20PublicaesFicheiros/Attachments/89/Lista%20de%20Institui%C3%A7%C3%B5es%20Registadas%20(actualizado%2004.07.16).pdf"
import scraperwiki, urllib2, re
u = urllib2.urlopen(url)
xml=scraperwiki.pdftoxml(u.read()) # interpret pdf as xml
Upon inspecting the XML lines, it becomes apparent that the structure does not clearly show how the table-lines segment the data. A typical line resembles the following:
<text top="678" left="493" width="103" height="12" font="6">Besa Património </text>
In my analysis using the browser's element inspector, the HTML provides slightly more detail, but still lacks crucial information on the positioning of the table-lines.
This has been quite time-consuming for me, and therefore, any experimental solutions would be immensely helpful. The main query revolves around finding a method to obtain the position of these table-lines.