Re: Reading & Extracting Data From PDF Files

30 Apr 2017


      Oops. Hit Send too soon...
https://stackoverflow.com/questions/3650957/how-to-extract-text-from-a-pdf
Main answers are 7 years old, likely irrelevant, but the links on the
right lead to more interesting answers.
My resume was written in OpenOffice, iirc:
http://tedroche.com/homepage/~tedroche/TedRocheResume.html
I ran this through (Linux) pdftotext:
http://tedroche.com/homepage/~tedroche/TedRocheResume.txt
Not bad!
Columnar data out of Excel or similar into DBFs is a lot harder. I'll
be interested to hear what others might suggest.
On Sat, Apr 29, 2017 at 4:38 PM, Ted Roche tedroche@gmail.com wrote:
...
In my limited experience, PDF is trouble. PDF is essential a printer
output file, in PostScript, encapsulated as document. There's lots of
info about fonts, geometry and there are letters placed in specific
places, not necessarily in the order you think, depending on the
application that generated the print image, and the printer drivers
used. It doesn't help that there are lots of different PDF standards
("I love standards, that's why I have so many!") and extensions to do
things like provide accessiblility or to slim it down for web and
visual presentation (vs. High-resolution for print pre-press).
Recently, I was trying to copy one of my articles out of a PDF, and I
found that the two snaked columns that appeared to you and me meant
nothing to the PDF. Highlighting the text got line one of column one,
then line one of column two, all the way down. Pretty frustrating.
There are some smart applications out there. Monarch was advertised years ago
Here's an SO question, with some possiblities:
On Fri, Apr 28, 2017 at 11:19 AM, Paul H. Tarver paul@tpcqpc.com wrote:
...
Original Thread: Getting count of rows in a text file -- best approach?
A couple of times I've heard people mention reading in PDF files using
FileToStr and I want to know more about reading and extracting data from PDF
files. I do a lot of data conversion and interface work with lots of file
formats, but I've not been very successful at importing and extracting data
from PDF reports. Obviously a scanned image saved as a PDF would have to be
ocr'd first, but is there is a reliable way to extract data from PDF reports
and if so, how? I'm sure I don't know all the ends and outs of the PDF
format, but when I try, I seem to get a strange mix of formatting details
and data combined in a random way.
Am I being thick here or is there really a way that I can get any PDF file
from any client and then successfully extract the data elements from that
format?
I'm prepared to be thought of as stupid but be gentle! :)
Paul H. Tarver
Tarver Program Consultants, Inc.
Email: paul@tpcqpc.com
-----Original Message-----
From: Brant E. Layton [mailto:dcci@futureone.com]
Sent: Wednesday, April 26, 2017 3:17 PM
To: profoxtech@leafe.com
Subject: RE: Getting count of rows in a text file -- best approach?
|My experience was moving PDF files in and out of SQLServer tables -
|found an
abrupt truncation at the 16,777,184 mark...
Brant Layton|
|480.964.1316|
On 4/26/2017 12:57 PM, profoxtech-request@leafe.com wrote:
...
RE: Getting count of rows in a text file -- best approach?
--- StripMime Report -- processed MIME parts --- multipart/alternative
  text/plain (text body -- kept)
  text/html

[excessive quoting removed by server]

2026

2025

2024

2023

2022

2021

2020

2019

2018

2017

2016

Re: Reading & Extracting Data From PDF Files