Original Thread: Getting count of rows in a text file -- best approach?
A couple of times I've heard people mention reading in PDF files using FileToStr and I want to know more about reading and extracting data from PDF files. I do a lot of data conversion and interface work with lots of file formats, but I've not been very successful at importing and extracting data from PDF reports. Obviously a scanned image saved as a PDF would have to be ocr'd first, but is there is a reliable way to extract data from PDF reports and if so, how? I'm sure I don't know all the ends and outs of the PDF format, but when I try, I seem to get a strange mix of formatting details and data combined in a random way.
Am I being thick here or is there really a way that I can get any PDF file from any client and then successfully extract the data elements from that format?
I'm prepared to be thought of as stupid but be gentle! :)
Paul H. Tarver Tarver Program Consultants, Inc. Email: paul@tpcqpc.com
-----Original Message----- From: Brant E. Layton [mailto:dcci@futureone.com] Sent: Wednesday, April 26, 2017 3:17 PM To: profoxtech@leafe.com Subject: RE: Getting count of rows in a text file -- best approach?
|My experience was moving PDF files in and out of SQLServer tables - |found an abrupt truncation at the 16,777,184 mark...
Brant Layton| |480.964.1316| On 4/26/2017 12:57 PM, profoxtech-request@leafe.com wrote:
RE: Getting count of rows in a text file -- best approach?
--- StripMime Report -- processed MIME parts --- multipart/alternative text/plain (text body -- kept) text/html ---
[excessive quoting removed by server]
I too would be keen to know of a "Reliable" way to get text etc. out of a PDF file. I'd have to think that it stands to reason the thing appears formatted so have to wonder how hard it can be. To date my experience has been poor at best so I'm with you Paul. I have had some limited luck with a couple do the PDF to Text offerings out there. Another that I found produced pretty good results is the utility that comes with Beyond compare - can't think of the name of it ATM but it hides in one of the BC directories.
If there was a good reliable solution I'd be happy to part with a good sum for it as it would be of great use.
A reasonably priced tool to convert RTF would be handy as well.
-----Original Message----- From: ProfoxTech [mailto:profoxtech-bounces@leafe.com] On Behalf Of Paul H. Tarver Sent: Saturday, 29 April 2017 1:19 AM To: profoxtech@leafe.com Subject: Reading & Extracting Data From PDF Files
Original Thread: Getting count of rows in a text file -- best approach?
A couple of times I've heard people mention reading in PDF files using FileToStr and I want to know more about reading and extracting data from PDF files. I do a lot of data conversion and interface work with lots of file formats, but I've not been very successful at importing and extracting data from PDF reports. Obviously a scanned image saved as a PDF would have to be ocr'd first, but is there is a reliable way to extract data from PDF reports and if so, how? I'm sure I don't know all the ends and outs of the PDF format, but when I try, I seem to get a strange mix of formatting details and data combined in a random way.
Am I being thick here or is there really a way that I can get any PDF file from any client and then successfully extract the data elements from that format?
I'm prepared to be thought of as stupid but be gentle! :)
Paul H. Tarver Tarver Program Consultants, Inc. Email: paul@tpcqpc.com
-----Original Message----- From: Brant E. Layton [mailto:dcci@futureone.com] Sent: Wednesday, April 26, 2017 3:17 PM To: profoxtech@leafe.com Subject: RE: Getting count of rows in a text file -- best approach?
|My experience was moving PDF files in and out of SQLServer tables - |found an abrupt truncation at the 16,777,184 mark...
Brant Layton| |480.964.1316| On 4/26/2017 12:57 PM, profoxtech-request@leafe.com wrote:
RE: Getting count of rows in a text file -- best approach?
--- StripMime Report -- processed MIME parts --- multipart/alternative text/plain (text body -- kept) text/html ---
[excessive quoting removed by server]
I use PDFPenPro on a Mac. This tool converts PDF's to Word (DOC/RTF), Excel, and PowerPoint formats. Its not perfect, but it works much better than re-keying content. There are also a bunch of cloud sites that will convert PDF's to various formats ... some for free and others on a subscription basis. I also think there's some Linux command line tools that will extract text from PDF files. If you get stuck googling these concepts, ping back to the list and I will dig up my notes on these other options. I'm not on my computer at the moment or would do so now.
Darren,
I already feel better knowing I'm not the only one thinking about this and that it is not a stupid question! :)
Paul H. Tarver Tarver Program Consultants, Inc. Email: paul@tpcqpc.com
-----Original Message----- From: Darren [mailto:foxdev@ozemail.com.au] Sent: Friday, April 28, 2017 10:41 AM To: profoxtech@leafe.com Subject: RE: Reading & Extracting Data From PDF Files
I too would be keen to know of a "Reliable" way to get text etc. out of a PDF file. I'd have to think that it stands to reason the thing appears formatted so have to wonder how hard it can be. To date my experience has been poor at best so I'm with you Paul. I have had some limited luck with a couple do the PDF to Text offerings out there. Another that I found produced pretty good results is the utility that comes with Beyond compare - can't think of the name of it ATM but it hides in one of the BC directories.
If there was a good reliable solution I'd be happy to part with a good sum for it as it would be of great use.
A reasonably priced tool to convert RTF would be handy as well.
-----Original Message----- From: ProfoxTech [mailto:profoxtech-bounces@leafe.com] On Behalf Of Paul H. Tarver Sent: Saturday, 29 April 2017 1:19 AM To: profoxtech@leafe.com Subject: Reading & Extracting Data From PDF Files
Original Thread: Getting count of rows in a text file -- best approach?
A couple of times I've heard people mention reading in PDF files using FileToStr and I want to know more about reading and extracting data from PDF files. I do a lot of data conversion and interface work with lots of file formats, but I've not been very successful at importing and extracting data from PDF reports. Obviously a scanned image saved as a PDF would have to be ocr'd first, but is there is a reliable way to extract data from PDF reports and if so, how? I'm sure I don't know all the ends and outs of the PDF format, but when I try, I seem to get a strange mix of formatting details and data combined in a random way.
Am I being thick here or is there really a way that I can get any PDF file from any client and then successfully extract the data elements from that format?
I'm prepared to be thought of as stupid but be gentle! :)
Paul H. Tarver Tarver Program Consultants, Inc. Email: paul@tpcqpc.com
-----Original Message----- From: Brant E. Layton [mailto:dcci@futureone.com] Sent: Wednesday, April 26, 2017 3:17 PM To: profoxtech@leafe.com Subject: RE: Getting count of rows in a text file -- best approach?
|My experience was moving PDF files in and out of SQLServer tables - |found an abrupt truncation at the 16,777,184 mark...
Brant Layton| |480.964.1316| On 4/26/2017 12:57 PM, profoxtech-request@leafe.com wrote:
RE: Getting count of rows in a text file -- best approach?
--- StripMime Report -- processed MIME parts --- multipart/alternative text/plain (text body -- kept) text/html ---
[excessive quoting removed by server]
I have used with success Balabolka Text Extract Utility.
http://www.cross-plus-a.com/btext.htm
Gianni
On Fri, 28 Apr 2017 10:19:20 -0500, "Paul H. Tarver" paul@tpcqpc.com wrote:
Original Thread: Getting count of rows in a text file -- best approach?
A couple of times I've heard people mention reading in PDF files using FileToStr and I want to know more about reading and extracting data from PDF files. I do a lot of data conversion and interface work with lots of file formats, but I've not been very successful at importing and extracting data from PDF reports. Obviously a scanned image saved as a PDF would have to be ocr'd first, but is there is a reliable way to extract data from PDF reports and if so, how? I'm sure I don't know all the ends and outs of the PDF format, but when I try, I seem to get a strange mix of formatting details and data combined in a random way.
Am I being thick here or is there really a way that I can get any PDF file from any client and then successfully extract the data elements from that format?
I'm prepared to be thought of as stupid but be gentle! :)
Paul H. Tarver Tarver Program Consultants, Inc. Email: paul@tpcqpc.com
-----Original Message----- From: Brant E. Layton [mailto:dcci@futureone.com] Sent: Wednesday, April 26, 2017 3:17 PM To: profoxtech@leafe.com Subject: RE: Getting count of rows in a text file -- best approach?
|My experience was moving PDF files in and out of SQLServer tables - |found an abrupt truncation at the 16,777,184 mark...
Brant Layton| |480.964.1316| On 4/26/2017 12:57 PM, profoxtech-request@leafe.com wrote:
RE: Getting count of rows in a text file -- best approach?
In my limited experience, PDF is trouble. PDF is essential a printer output file, in PostScript, encapsulated as document. There's lots of info about fonts, geometry and there are letters placed in specific places, not necessarily in the order you think, depending on the application that generated the print image, and the printer drivers used. It doesn't help that there are lots of different PDF standards ("I love standards, that's why I have so many!") and extensions to do things like provide accessiblility or to slim it down for web and visual presentation (vs. High-resolution for print pre-press).
Recently, I was trying to copy one of my articles out of a PDF, and I found that the two snaked columns that appeared to you and me meant nothing to the PDF. Highlighting the text got line one of column one, then line one of column two, all the way down. Pretty frustrating.
There are some smart applications out there. Monarch was advertised years ago
Here's an SO question, with some possiblities:
On Fri, Apr 28, 2017 at 11:19 AM, Paul H. Tarver paul@tpcqpc.com wrote:
Original Thread: Getting count of rows in a text file -- best approach?
A couple of times I've heard people mention reading in PDF files using FileToStr and I want to know more about reading and extracting data from PDF files. I do a lot of data conversion and interface work with lots of file formats, but I've not been very successful at importing and extracting data from PDF reports. Obviously a scanned image saved as a PDF would have to be ocr'd first, but is there is a reliable way to extract data from PDF reports and if so, how? I'm sure I don't know all the ends and outs of the PDF format, but when I try, I seem to get a strange mix of formatting details and data combined in a random way.
Am I being thick here or is there really a way that I can get any PDF file from any client and then successfully extract the data elements from that format?
I'm prepared to be thought of as stupid but be gentle! :)
Paul H. Tarver Tarver Program Consultants, Inc. Email: paul@tpcqpc.com
-----Original Message----- From: Brant E. Layton [mailto:dcci@futureone.com] Sent: Wednesday, April 26, 2017 3:17 PM To: profoxtech@leafe.com Subject: RE: Getting count of rows in a text file -- best approach?
|My experience was moving PDF files in and out of SQLServer tables - |found an abrupt truncation at the 16,777,184 mark...
Brant Layton| |480.964.1316| On 4/26/2017 12:57 PM, profoxtech-request@leafe.com wrote:
RE: Getting count of rows in a text file -- best approach?
--- StripMime Report -- processed MIME parts --- multipart/alternative text/plain (text body -- kept) text/html
[excessive quoting removed by server]
It's all that font data and positional information that cause the grief. I have spent a fair bit of time on PDF's with varying degrees of success. Comes down to how they were created in the first place. I have always found that I can consistently retrieve data from PDFs created by any given source but the rules applied to that PDF do not transport well to other PDF's.
There are a lot of PDFTOTEXT type tools out there. Those that I have explored mostly do a "pretty good job" of extracting text but none is perfect and all require varying degrees of manipulation post conversion to extract the data.
None have proven consistent in extracting the text and retaining layout (insofar as layout can be retained between different fonts).
All that said I still think it should be possible to take the positional data and font metrics and resolve that such that text is extracted and layout retained. Not a simple exercise but for an entity focused on the task achievable (I would have thought).
In the end though - if you can avoid them - then that is best approach.
-----Original Message----- From: ProfoxTech [mailto:profoxtech-bounces@leafe.com] On Behalf Of Ted Roche Sent: Sunday, 30 April 2017 6:39 AM To: profoxtech@leafe.com Subject: Re: Reading & Extracting Data From PDF Files
In my limited experience, PDF is trouble. PDF is essential a printer output file, in PostScript, encapsulated as document. There's lots of info about fonts, geometry and there are letters placed in specific places, not necessarily in the order you think, depending on the application that generated the print image, and the printer drivers used. It doesn't help that there are lots of different PDF standards ("I love standards, that's why I have so many!") and extensions to do things like provide accessiblility or to slim it down for web and visual presentation (vs. High-resolution for print pre-press).
Recently, I was trying to copy one of my articles out of a PDF, and I found that the two snaked columns that appeared to you and me meant nothing to the PDF. Highlighting the text got line one of column one, then line one of column two, all the way down. Pretty frustrating.
There are some smart applications out there. Monarch was advertised years ago
Here's an SO question, with some possiblities:
On Fri, Apr 28, 2017 at 11:19 AM, Paul H. Tarver paul@tpcqpc.com wrote:
Original Thread: Getting count of rows in a text file -- best approach?
A couple of times I've heard people mention reading in PDF files using FileToStr and I want to know more about reading and extracting data from PDF files. I do a lot of data conversion and interface work with lots of file formats, but I've not been very successful at importing and extracting data from PDF reports. Obviously a scanned image saved as a PDF would have to be ocr'd first, but is there is a reliable way to extract data from PDF reports and if so, how? I'm sure I don't know all the ends and outs of the PDF format, but when I try, I seem to get a strange mix of formatting details and data combined in a random way.
Am I being thick here or is there really a way that I can get any PDF file from any client and then successfully extract the data elements from that format?
I'm prepared to be thought of as stupid but be gentle! :)
Paul H. Tarver Tarver Program Consultants, Inc. Email: paul@tpcqpc.com
-----Original Message----- From: Brant E. Layton [mailto:dcci@futureone.com] Sent: Wednesday, April 26, 2017 3:17 PM To: profoxtech@leafe.com Subject: RE: Getting count of rows in a text file -- best approach?
|My experience was moving PDF files in and out of SQLServer tables - |found an abrupt truncation at the 16,777,184 mark...
Brant Layton| |480.964.1316| On 4/26/2017 12:57 PM, profoxtech-request@leafe.com wrote:
RE: Getting count of rows in a text file -- best approach?
--- StripMime Report -- processed MIME parts --- multipart/alternative text/plain (text body -- kept) text/html
[excessive quoting removed by server]
Darren,
I've generally been successful in avoiding PDF's and getting the same information in a more "standardized" format such as text, CSV, Excel, etc. However, I've always wondered if my preference was based on my unwillingness to devote the necessary time required to understand PDF formats, or just a case of taking the path of least resistance. After all, why spend a 100 hours figuring out a PDF version of a report when you can spend 5 hours building an extract from a text version of the same report.
In the thread related to counting the number of lines in a file, there were at least two references made to getting the number of lines in a PDF file which made me think perhaps someone had figured out how to work with a PDF in the same sort of way we would work with an XML file. Generally when I work with XML, import all the lines, use the tags as triggers and flags, then strip the tags and markup leaving all the good stuff behind in tables I created in the parsing process. It was my hope someone had magically figured out how to do something similar with PDF's but clearly I'm not the only one avoiding PDF's whenever possible.
I think it is enough for me to know that it might be possible given enough time and enough money to extract data from a PDF file but I'll wait for the day when I have NO OTHER OPTION on a project to justify spending the time.
Thanks!
Paul H. Tarver Tarver Program Consultants, Inc.
-----Original Message----- From: Darren [mailto:foxdev@ozemail.com.au] Sent: Saturday, April 29, 2017 4:37 PM To: profoxtech@leafe.com Subject: RE: Reading & Extracting Data From PDF Files
It's all that font data and positional information that cause the grief. I have spent a fair bit of time on PDF's with varying degrees of success. Comes down to how they were created in the first place. I have always found that I can consistently retrieve data from PDFs created by any given source but the rules applied to that PDF do not transport well to other PDF's.
There are a lot of PDFTOTEXT type tools out there. Those that I have explored mostly do a "pretty good job" of extracting text but none is perfect and all require varying degrees of manipulation post conversion to extract the data.
None have proven consistent in extracting the text and retaining layout (insofar as layout can be retained between different fonts).
All that said I still think it should be possible to take the positional data and font metrics and resolve that such that text is extracted and layout retained. Not a simple exercise but for an entity focused on the task achievable (I would have thought).
In the end though - if you can avoid them - then that is best approach.
-----Original Message----- From: ProfoxTech [mailto:profoxtech-bounces@leafe.com] On Behalf Of Ted Roche Sent: Sunday, 30 April 2017 6:39 AM To: profoxtech@leafe.com Subject: Re: Reading & Extracting Data From PDF Files
In my limited experience, PDF is trouble. PDF is essential a printer output file, in PostScript, encapsulated as document. There's lots of info about fonts, geometry and there are letters placed in specific places, not necessarily in the order you think, depending on the application that generated the print image, and the printer drivers used. It doesn't help that there are lots of different PDF standards ("I love standards, that's why I have so many!") and extensions to do things like provide accessiblility or to slim it down for web and visual presentation (vs. High-resolution for print pre-press).
Recently, I was trying to copy one of my articles out of a PDF, and I found that the two snaked columns that appeared to you and me meant nothing to the PDF. Highlighting the text got line one of column one, then line one of column two, all the way down. Pretty frustrating.
There are some smart applications out there. Monarch was advertised years ago
Here's an SO question, with some possiblities:
On Fri, Apr 28, 2017 at 11:19 AM, Paul H. Tarver paul@tpcqpc.com wrote:
Original Thread: Getting count of rows in a text file -- best approach?
A couple of times I've heard people mention reading in PDF files using FileToStr and I want to know more about reading and extracting data from PDF files. I do a lot of data conversion and interface work with lots of file formats, but I've not been very successful at importing and extracting data from PDF reports. Obviously a scanned image saved as a PDF would have to be ocr'd first, but is there is a reliable way to extract data from PDF reports and if so, how? I'm sure I don't know all the ends and outs of the PDF format, but when I try, I seem to get a strange mix of formatting details and data combined in a random way.
Am I being thick here or is there really a way that I can get any PDF file from any client and then successfully extract the data elements from that format?
I'm prepared to be thought of as stupid but be gentle! :)
Paul H. Tarver Tarver Program Consultants, Inc. Email: paul@tpcqpc.com
-----Original Message----- From: Brant E. Layton [mailto:dcci@futureone.com] Sent: Wednesday, April 26, 2017 3:17 PM To: profoxtech@leafe.com Subject: RE: Getting count of rows in a text file -- best approach?
|My experience was moving PDF files in and out of SQLServer tables - |found an abrupt truncation at the 16,777,184 mark...
Brant Layton| |480.964.1316| On 4/26/2017 12:57 PM, profoxtech-request@leafe.com wrote:
RE: Getting count of rows in a text file -- best approach?
--- StripMime Report -- processed MIME parts --- multipart/alternative text/plain (text body -- kept) text/html
[excessive quoting removed by server]
There are libraries to buy into that will open up the PDF environment to identify a section of a document as well as pull ou the data of it.
Been years since I needed to do that but it was all very straight forward. I was dealing with 25-100 pg documents and needing to pull data in matrix, spreadsheet, form out of it for loading into patient care data warehouse environment.
Also used it to pull billing data from medical invoice files when a B2B didn't have edi.
On Mon, May 1, 2017 at 9:56 AM, Paul H. Tarver paul@tpcqpc.com wrote:
Darren,
I've generally been successful in avoiding PDF's and getting the same information in a more "standardized" format such as text, CSV, Excel, etc. However, I've always wondered if my preference was based on my unwillingness to devote the necessary time required to understand PDF formats, or just a case of taking the path of least resistance. After all, why spend a 100 hours figuring out a PDF version of a report when you can spend 5 hours building an extract from a text version of the same report.
In the thread related to counting the number of lines in a file, there were at least two references made to getting the number of lines in a PDF file which made me think perhaps someone had figured out how to work with a PDF in the same sort of way we would work with an XML file. Generally when I work with XML, import all the lines, use the tags as triggers and flags, then strip the tags and markup leaving all the good stuff behind in tables I created in the parsing process. It was my hope someone had magically figured out how to do something similar with PDF's but clearly I'm not the only one avoiding PDF's whenever possible.
I think it is enough for me to know that it might be possible given enough time and enough money to extract data from a PDF file but I'll wait for the day when I have NO OTHER OPTION on a project to justify spending the time.
Thanks!
Paul H. Tarver Tarver Program Consultants, Inc.
-----Original Message----- From: Darren [mailto:foxdev@ozemail.com.au] Sent: Saturday, April 29, 2017 4:37 PM To: profoxtech@leafe.com Subject: RE: Reading & Extracting Data From PDF Files
It's all that font data and positional information that cause the grief. I have spent a fair bit of time on PDF's with varying degrees of success. Comes down to how they were created in the first place. I have always found that I can consistently retrieve data from PDFs created by any given source but the rules applied to that PDF do not transport well to other PDF's.
There are a lot of PDFTOTEXT type tools out there. Those that I have explored mostly do a "pretty good job" of extracting text but none is perfect and all require varying degrees of manipulation post conversion to extract the data.
None have proven consistent in extracting the text and retaining layout (insofar as layout can be retained between different fonts).
All that said I still think it should be possible to take the positional data and font metrics and resolve that such that text is extracted and layout retained. Not a simple exercise but for an entity focused on the task achievable (I would have thought).
In the end though - if you can avoid them - then that is best approach.
-----Original Message----- From: ProfoxTech [mailto:profoxtech-bounces@leafe.com] On Behalf Of Ted Roche Sent: Sunday, 30 April 2017 6:39 AM To: profoxtech@leafe.com Subject: Re: Reading & Extracting Data From PDF Files
In my limited experience, PDF is trouble. PDF is essential a printer output file, in PostScript, encapsulated as document. There's lots of info about fonts, geometry and there are letters placed in specific places, not necessarily in the order you think, depending on the application that generated the print image, and the printer drivers used. It doesn't help that there are lots of different PDF standards ("I love standards, that's why I have so many!") and extensions to do things like provide accessiblility or to slim it down for web and visual presentation (vs. High-resolution for print pre-press).
Recently, I was trying to copy one of my articles out of a PDF, and I found that the two snaked columns that appeared to you and me meant nothing to the PDF. Highlighting the text got line one of column one, then line one of column two, all the way down. Pretty frustrating.
There are some smart applications out there. Monarch was advertised years ago
Here's an SO question, with some possiblities:
On Fri, Apr 28, 2017 at 11:19 AM, Paul H. Tarver paul@tpcqpc.com wrote:
Original Thread: Getting count of rows in a text file -- best approach?
A couple of times I've heard people mention reading in PDF files using FileToStr and I want to know more about reading and extracting data from PDF files. I do a lot of data conversion and interface work with lots of file formats, but I've not been very successful at importing and extracting data from PDF reports. Obviously a scanned image saved as a PDF would have to be ocr'd first, but is there is a reliable way to extract data from PDF reports and if so, how? I'm sure I don't know all the ends and outs of the PDF format, but when I try, I seem to get a strange mix of formatting details and data combined in a random way.
Am I being thick here or is there really a way that I can get any PDF file from any client and then successfully extract the data elements from that format?
I'm prepared to be thought of as stupid but be gentle! :)
Paul H. Tarver Tarver Program Consultants, Inc. Email: paul@tpcqpc.com
-----Original Message----- From: Brant E. Layton [mailto:dcci@futureone.com] Sent: Wednesday, April 26, 2017 3:17 PM To: profoxtech@leafe.com Subject: RE: Getting count of rows in a text file -- best approach?
|My experience was moving PDF files in and out of SQLServer tables - |found an abrupt truncation at the 16,777,184 mark...
Brant Layton| |480.964.1316| On 4/26/2017 12:57 PM, profoxtech-request@leafe.com wrote:
RE: Getting count of rows in a text file -- best approach?
--- StripMime Report -- processed MIME parts --- multipart/alternative text/plain (text body -- kept) text/html
[excessive quoting removed by server]
Oops. Hit Send too soon...
https://stackoverflow.com/questions/3650957/how-to-extract-text-from-a-pdf
Main answers are 7 years old, likely irrelevant, but the links on the right lead to more interesting answers.
My resume was written in OpenOffice, iirc: http://tedroche.com/homepage/~tedroche/TedRocheResume.html
I ran this through (Linux) pdftotext:
http://tedroche.com/homepage/~tedroche/TedRocheResume.txt
Not bad!
Columnar data out of Excel or similar into DBFs is a lot harder. I'll be interested to hear what others might suggest.
On Sat, Apr 29, 2017 at 4:38 PM, Ted Roche tedroche@gmail.com wrote:
In my limited experience, PDF is trouble. PDF is essential a printer output file, in PostScript, encapsulated as document. There's lots of info about fonts, geometry and there are letters placed in specific places, not necessarily in the order you think, depending on the application that generated the print image, and the printer drivers used. It doesn't help that there are lots of different PDF standards ("I love standards, that's why I have so many!") and extensions to do things like provide accessiblility or to slim it down for web and visual presentation (vs. High-resolution for print pre-press).
Recently, I was trying to copy one of my articles out of a PDF, and I found that the two snaked columns that appeared to you and me meant nothing to the PDF. Highlighting the text got line one of column one, then line one of column two, all the way down. Pretty frustrating.
There are some smart applications out there. Monarch was advertised years ago
Here's an SO question, with some possiblities:
On Fri, Apr 28, 2017 at 11:19 AM, Paul H. Tarver paul@tpcqpc.com wrote:
Original Thread: Getting count of rows in a text file -- best approach?
A couple of times I've heard people mention reading in PDF files using FileToStr and I want to know more about reading and extracting data from PDF files. I do a lot of data conversion and interface work with lots of file formats, but I've not been very successful at importing and extracting data from PDF reports. Obviously a scanned image saved as a PDF would have to be ocr'd first, but is there is a reliable way to extract data from PDF reports and if so, how? I'm sure I don't know all the ends and outs of the PDF format, but when I try, I seem to get a strange mix of formatting details and data combined in a random way.
Am I being thick here or is there really a way that I can get any PDF file from any client and then successfully extract the data elements from that format?
I'm prepared to be thought of as stupid but be gentle! :)
Paul H. Tarver Tarver Program Consultants, Inc. Email: paul@tpcqpc.com
-----Original Message----- From: Brant E. Layton [mailto:dcci@futureone.com] Sent: Wednesday, April 26, 2017 3:17 PM To: profoxtech@leafe.com Subject: RE: Getting count of rows in a text file -- best approach?
|My experience was moving PDF files in and out of SQLServer tables - |found an abrupt truncation at the 16,777,184 mark...
Brant Layton| |480.964.1316| On 4/26/2017 12:57 PM, profoxtech-request@leafe.com wrote:
RE: Getting count of rows in a text file -- best approach?
--- StripMime Report -- processed MIME parts --- multipart/alternative text/plain (text body -- kept) text/html
[excessive quoting removed by server]
I use the command line version of PDF2TXT, Version 3.2 from http://www.verypdf.com/app/pdf-to-txt-converter/index.html to extract text, then parse the text in VFP. Their tech support is very responsive (sometimes same day).
It does not work if the PDF generator used embedded fonts.
Customized Business Services, LLC (928) 580-6352 Dennis Schuette Primary: dennis@cbsds.com 49 NW 130 Avenue Alternate: Schuette.dennis@gmail.com Great Bend, KS 67530
-----Original Message----- From: ProfoxTech [mailto:profoxtech-bounces@leafe.com] On Behalf Of Paul H. Tarver Sent: Friday, April 28, 2017 10:19 AM To: profoxtech@leafe.com Subject: Reading & Extracting Data From PDF Files
Original Thread: Getting count of rows in a text file -- best approach?
A couple of times I've heard people mention reading in PDF files using FileToStr and I want to know more about reading and extracting data from PDF files. I do a lot of data conversion and interface work with lots of file formats, but I've not been very successful at importing and extracting data from PDF reports. Obviously a scanned image saved as a PDF would have to be ocr'd first, but is there is a reliable way to extract data from PDF reports and if so, how? I'm sure I don't know all the ends and outs of the PDF format, but when I try, I seem to get a strange mix of formatting details and data combined in a random way.
Am I being thick here or is there really a way that I can get any PDF file from any client and then successfully extract the data elements from that format?
I'm prepared to be thought of as stupid but be gentle! :)
Paul H. Tarver Tarver Program Consultants, Inc. Email: paul@tpcqpc.com
-----Original Message----- From: Brant E. Layton [mailto:dcci@futureone.com] Sent: Wednesday, April 26, 2017 3:17 PM To: profoxtech@leafe.com Subject: RE: Getting count of rows in a text file -- best approach?
|My experience was moving PDF files in and out of SQLServer tables - |found an abrupt truncation at the 16,777,184 mark...
Brant Layton| |480.964.1316| On 4/26/2017 12:57 PM, profoxtech-request@leafe.com wrote:
RE: Getting count of rows in a text file -- best approach?
--- StripMime Report -- processed MIME parts --- multipart/alternative text/plain (text body -- kept) text/html ---
[excessive quoting removed by server]