Common File Document Validation

List overview All Threads
Download

newer

older

[OT] One coders sig line I saw...

iCloud

Kevin J Cully

17 Aug 2016 17 Aug '16

2:12 p.m.

I work in a Windows Network environment. We're interested in purchasing or building a system that does document validation on our documents across our servers. We're needing a system that goes through our servers and file systems, logging directories of documents, performing a hash on the file and storing that information. If the hash has changed from previous hashed values, trigger a document validation process. We want to make sure that the file is a valid format of the file that it says that it is. Examples: Is the file MyFile.PDF a valid PDF file? Is the file MyFile.xlsx a valid Excel file? Common file types would be:

* PDF * TIFF * DOC and DOCX * XLS and XLSX * JPG * GIF * PNG * ... etc. There are some PDF validation tools such as XPDF, PDFInfo, and ImageMagick but is there something more general? Is there a system to purchase that would do what I've stated above? I can build the system if need be, but purchasing is preferable. If not, a command line validation routine would be very helpful. Thanks in advance for any advice. Kevin

This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law or may constitute as attorney work product. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. If you have received this communication in error, notify us immediately by telephone and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication.

Thank you.

--- StripMime Report -- processed MIME parts --- multipart/alternative text/plain (text body -- kept) text/html ---

Show replies by date

Alan Bourke

17 Aug 17 Aug

3:32 p.m.

...

We want to make sure that the file is a valid format of the file that it says that it is. Examples: Is the file MyFile.PDF a valid PDF file? Is the file MyFile.xlsx a valid Excel file?

I don't think such a global tool or methodology exists, probably for a good reason.

Taking even PDF or XLSX, how do you even establish if the format is valid? If Acrobat or Excel can be automated programmatically to open them, they're presumably almost certainly valid, so you could check that for each file type. That's probably the most you could hope to achieve. Maybe check the file header for the byte signature of the file types, but does a database of those signatures exist for every type you might have to deal with?

Going further would you really want to re-engineer all the code those applications have to open a file and populate the required internal memory structures and figure out what internal checks they do to decide "yep, that looks OK" ?

-- Alan Bourke alanpbourke (at) fastmail (dot) fm

Stephen Russell

3:37 p.m.

For file validity how about scanning files with your AV? That should identify fake .pdf files as well as .docx

Now a file inventory for file servers is a lot more to deal with.

http://www.lansweeper.com/

On Wed, Aug 17, 2016 at 9:12 AM, Kevin J Cully kjcully@cherokeega.com wrote:

...

I work in a Windows Network environment. We're interested in purchasing or building a system that does document validation on our documents across our servers. We're needing a system that goes through our servers and file systems, logging directories of documents, performing a hash on the file and storing that information. If the hash has changed from previous hashed values, trigger a document validation process. We want to make sure that the file is a valid format of the file that it says that it is. Examples: Is the file MyFile.PDF a valid PDF file? Is the file MyFile.xlsx a valid Excel file? Common file types would be:

PDF

TIFF

DOC and DOCX

XLS and XLSX

JPG

GIF

PNG

... etc.

There are some PDF validation tools such as XPDF, PDFInfo, and ImageMagick but is there something more general? Is there a system to purchase that would do what I've stated above? I can build the system if need be, but purchasing is preferable. If not, a command line validation routine would be very helpful. Thanks in advance for any advice. Kevin

This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law or may constitute as attorney work product. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. If you have received this communication in error, notify us immediately by telephone and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication.

Thank you.

--- StripMime Report -- processed MIME parts --- multipart/alternative text/plain (text body -- kept) text/html

[excessive quoting removed by server]

Ted Roche

6:28 p.m.

That's a great question!

Obviously, since the post's subject didn't include "[NF]" you've already found your solution -- FoxPro! *wink*

I've done some document management systems in VFP, and the recursion, cataloging and checksums is easy, relatively-speaking. But the validation is an interesting twist, and a much more difficult problem.

Triggering the checking is also an interesting feature. Doing a bulk rescan would be slow and intensive, though you could tune it to not consume excessive resources, at a cost of slower checking.

Windows File Systems have some advanced features in the newer servers that would let you hook into a file system event (adding a new file or saving over an old one) to trigger your validation routine. If WinFS had ever been released, (https://en.wikipedia.org/wiki/WinFS) that would have been perfect, but alas, it was another empty vaporware promise of "The Old Microsoft." However some of "Longhorn" did end up in DotNet, like:

https://msdn.microsoft.com/en-us/library/system.io.filesystemwatcher.changed...

A simpler solution might be a "Document Management System" but implementing one of these is a tough challenge in technology, politics, and technical support.

"Validity" is a bit nebulous. How are you defining that?

I mean, there are Word95 documents I can't open in Word2007, but can in LibreOffice. And is a Word document with a macro virus valid? How many versions and variations to support? How to handle password-encrypted or restricted files?

VFP would be a great tool for doing the validation, where you can use low-level file functions to read headers and calculate checksums, but complex structured documents, like MS's Compound OLE Documents, and MS's ZIP-encoded XML and JSON DocX documents, get a lot trickier. There's typically a "magic" signature at the beginning of most files that will tell you it's type, but whether all the contents have integrity is a lot tougher to determine. I suspect each format would need to be reviewed to determine if there were internal consistency checks that would tell you of corruption or truncation.

Sounds like an interesting project, though. Will be interested to hear if you find a suitable package, or DIY it.

-- Ted Roche Ted Roche & Associates, LLC http://www.tedroche.com

Kevin J Cully

8:55 p.m.

The thought was to keep an MD5 of each file (or similar), and if that changes then trigger the actual validation. First run would be intense, but most files don't change much. Perhaps ever.

It's funny you mention LibreOffice, because a suggestion I received was to use the command line tool 'soffice.exe' which is part of LibreOffice to check the office documents. Basically if soffice can turn it into a pdf, (to be deleted afterward) then the file would be considered a 'valid' file.

In regards to images, the ImageMagick tool of "identify" would produce the meta data from image files. Also in the running to enter the test phase of this project. http://www.imagemagick.org/script/identify.php

In the case of a Word document with a macro virus, hopefully (fingers crossed!) the malware scan would find it as soon as it was saved. If we're using LibreOffice, we'd hopefully have the option to disable macros when (test) converting it to PDF.

This definitely would be interesting. I hope I get the green light to work on it.

"Most useful complex projects begin their lives as useful simple projects."

-Kevin

-----Original Message----- From: ProFox [mailto:profox-bounces@leafe.com] On Behalf Of Ted Roche Sent: Wednesday, August 17, 2016 2:29 PM To: profox@leafe.com Subject: Re: Common File Document Validation

That's a great question!

Obviously, since the post's subject didn't include "[NF]" you've already found your solution -- FoxPro! *wink*

Triggering the checking is also an interesting feature. Doing a bulk rescan would be slow and intensive, though you could tune it to not consume excessive resources, at a cost of slower checking.

https://msdn.microsoft.com/en-us/library/system.io.filesystemwatcher.changed...

A simpler solution might be a "Document Management System" but implementing one of these is a tough challenge in technology, politics, and technical support.

"Validity" is a bit nebulous. How are you defining that?

Sounds like an interesting project, though. Will be interested to hear if you find a suitable package, or DIY it.

-- Ted Roche Ted Roche & Associates, LLC http://www.tedroche.com

[excessive quoting removed by server]

Ted Roche

9:12 p.m.

On Wed, Aug 17, 2016 at 4:55 PM, Kevin J Cully kjcully@cherokeega.com wrote:

...

It's funny you mention LibreOffice,...

There are no coincidences ;)

...

In regards to images, the ImageMagick tool of "identify" would produce the meta data from image files. Also in the running to enter the test phase of this project. http://www.imagemagick.org/script/identify.php

Good thought.

...

In the case of a Word document with a macro virus, hopefully (fingers crossed!) the malware scan would find it as soon as it was saved. If we're using LibreOffice, we'd hopefully have the option to disable macros when (test) converting it to PDF.

...

"Most useful complex projects begin their lives as useful simple projects."

Yep, there are a couple of very useful modules you can build here, and plug in as needed.

-- Ted Roche Ted Roche & Associates, LLC http://www.tedroche.com

Chris Davis

9:15 p.m.

Might be a silly question but if you found a file which wasn't valid, what you going to do?

Why wouldn't a file be valid?

...

On 17 Aug 2016, at 21:56, Kevin J Cully kjcully@cherokeega.com wrote:

The thought was to keep an MD5 of each file (or similar), and if that changes then trigger the actual validation. First run would be intense, but most files don't change much. Perhaps ever.

It's funny you mention LibreOffice, because a suggestion I received was to use the command line tool 'soffice.exe' which is part of LibreOffice to check the office documents. Basically if soffice can turn it into a pdf, (to be deleted afterward) then the file would be considered a 'valid' file.

In regards to images, the ImageMagick tool of "identify" would produce the meta data from image files. Also in the running to enter the test phase of this project. http://www.imagemagick.org/script/identify.php

In the case of a Word document with a macro virus, hopefully (fingers crossed!) the malware scan would find it as soon as it was saved. If we're using LibreOffice, we'd hopefully have the option to disable macros when (test) converting it to PDF.

This definitely would be interesting. I hope I get the green light to work on it.

"Most useful complex projects begin their lives as useful simple projects."

-Kevin

-----Original Message----- From: ProFox [mailto:profox-bounces@leafe.com] On Behalf Of Ted Roche Sent: Wednesday, August 17, 2016 2:29 PM To: profox@leafe.com Subject: Re: Common File Document Validation

That's a great question!

Obviously, since the post's subject didn't include "[NF]" you've already found your solution -- FoxPro! *wink*

I've done some document management systems in VFP, and the recursion, cataloging and checksums is easy, relatively-speaking. But the validation is an interesting twist, and a much more difficult problem.

Triggering the checking is also an interesting feature. Doing a bulk rescan would be slow and intensive, though you could tune it to not consume excessive resources, at a cost of slower checking.

Windows File Systems have some advanced features in the newer servers that would let you hook into a file system event (adding a new file or saving over an old one) to trigger your validation routine. If WinFS had ever been released, (https://en.wikipedia.org/wiki/WinFS) that would have been perfect, but alas, it was another empty vaporware promise of "The Old Microsoft." However some of "Longhorn" did end up in DotNet, like:

https://msdn.microsoft.com/en-us/library/system.io.filesystemwatcher.changed...

A simpler solution might be a "Document Management System" but implementing one of these is a tough challenge in technology, politics, and technical support.

"Validity" is a bit nebulous. How are you defining that?

I mean, there are Word95 documents I can't open in Word2007, but can in LibreOffice. And is a Word document with a macro virus valid? How many versions and variations to support? How to handle password-encrypted or restricted files?

VFP would be a great tool for doing the validation, where you can use low-level file functions to read headers and calculate checksums, but complex structured documents, like MS's Compound OLE Documents, and MS's ZIP-encoded XML and JSON DocX documents, get a lot trickier. There's typically a "magic" signature at the beginning of most files that will tell you it's type, but whether all the contents have integrity is a lot tougher to determine. I suspect each format would need to be reviewed to determine if there were internal consistency checks that would tell you of corruption or truncation.

Sounds like an interesting project, though. Will be interested to hear if you find a suitable package, or DIY it.

-- Ted Roche Ted Roche & Associates, LLC http://www.tedroche.com

[excessive quoting removed by server]

Kevin J Cully

18 Aug 18 Aug

12:08 p.m.

Not a silly question.

Some background: We have to archive documents for various lengths of time. For a murder case, it's basically forever. When searching for a file from 2007, the file wasn't valid. Not a problem in this case in that they still had the hard copy of the file so they re-scanned it. I wasn't involved with the research of this file, but let's just imagine that the file existed but it was filled with garbage characters. (Worst case scenario.)

The way our offsite long-term backup system works, is that they will keep versions of a file that have changed in the last 30 days. After that, they start removing older files until the most current file is 30 days old and they will keep that forever. BUT, what if that current file is corrupted, and the older files were valid? It'd be better to know earlier than later.

Why wouldn't a file be valid? * Zero bytes long * Mis constructed header * Garbage filled file * Truncated file * Other file format saved with wrong extension ... more

Thanks everyone for their feedback!

-----Original Message----- From: ProFox [mailto:profox-bounces@leafe.com] On Behalf Of Chris Davis Sent: Wednesday, August 17, 2016 5:15 PM To: profox@leafe.com Subject: Re: Common File Document Validation

Might be a silly question but if you found a file which wasn't valid, what you going to do?

Why wouldn't a file be valid?

Thank you.

Ted Roche

12:48 p.m.

You bring up two important aspects of this system you hadn't mentioned before: backups and evidence.

If you have evidence from a 2007 murder that was used in a 2009 trial but was somehow "changed" in 2012, it would really be good if you could account for that change. If you are responsible for producing "point-in-time" backups ("Where is the document shown to the jury in 2009?") you'd likely want more of a version control system where you could diff changes rather that store "the last valid document."

The chances are that the 2012 changes are perfectly innocent -- saving a new print configuration, for example -- but storing diffs of these documents is inexpensive and efficient.

On Thu, Aug 18, 2016 at 8:08 AM, Kevin J Cully kjcully@cherokeega.com wrote:

...

Not a silly question.

Some background: We have to archive documents for various lengths of time. For a murder case, it's basically forever. When searching for a file from 2007, the file wasn't valid. Not a problem in this case in that they still had the hard copy of the file so they re-scanned it. I wasn't involved with the research of this file, but let's just imagine that the file existed but it was filled with garbage characters. (Worst case scenario.)

The way our offsite long-term backup system works, is that they will keep versions of a file that have changed in the last 30 days. After that, they start removing older files until the most current file is 30 days old and they will keep that forever. BUT, what if that current file is corrupted, and the older files were valid? It'd be better to know earlier than later.

Why wouldn't a file be valid?

Zero bytes long

Mis constructed header

Garbage filled file

Truncated file

Other file format saved with wrong extension

... more

Thanks everyone for their feedback!

-----Original Message----- From: ProFox [mailto:profox-bounces@leafe.com] On Behalf Of Chris Davis Sent: Wednesday, August 17, 2016 5:15 PM To: profox@leafe.com Subject: Re: Common File Document Validation

Might be a silly question but if you found a file which wasn't valid, what you going to do?

Why wouldn't a file be valid?

This message (including any attachments) is intended only for the use of the individual or entity to which it is addressed and may contain information that is non-public, proprietary, privileged, confidential, and exempt from disclosure under applicable law or may constitute as attorney work product. If you are not the intended recipient, you are hereby notified that any use, dissemination, distribution, or copying of this communication is strictly prohibited. If you have received this communication in error, notify us immediately by telephone and (i) destroy this message if a facsimile or (ii) delete this message immediately if this is an electronic communication.

Thank you.

[excessive quoting removed by server]

Alan Bourke

17 Aug 17 Aug

9:08 p.m.

Regarding FileSystemWatcher in .Net, I don't know how good an idea it is to monitor a whole drive like that.

-- Alan Bourke alanpbourke (at) fastmail (dot) fm On Wed, 17 Aug 2016, at 07:28 PM, Ted Roche wrote: > That's a great question! > > Obviously, since the post's subject didn't include "[NF]" you've > already found your solution -- FoxPro! *wink* > > I've done some document management systems in VFP, and the recursion, > cataloging and checksums is easy, relatively-speaking. But the > validation is an interesting twist, and a much more difficult problem. > > Triggering the checking is also an interesting feature. Doing a bulk > rescan would be slow and intensive, though you could tune it to not > consume excessive resources, at a cost of slower checking. > > Windows File Systems have some advanced features in the newer servers > that would let you hook into a file system event (adding a new file or > saving over an old one) to trigger your validation routine. If WinFS > had ever been released, (https://en.wikipedia.org/wiki/WinFS) that > would have been perfect, but alas, it was another empty vaporware > promise of "The Old Microsoft." However some of "Longhorn" did end up > in DotNet, like: > > https://msdn.microsoft.com/en-us/library/system.io.filesystemwatcher.changed... > > A simpler solution might be a "Document Management System" but > implementing one of these is a tough challenge in technology, > politics, and technical support. > > "Validity" is a bit nebulous. How are you defining that? > > I mean, there are Word95 documents I can't open in Word2007, but can > in LibreOffice. And is a Word document with a macro virus valid? How > many versions and variations to support? How to handle > password-encrypted or restricted files? > > VFP would be a great tool for doing the validation, where you can use > low-level file functions to read headers and calculate checksums, but > complex structured documents, like MS's Compound OLE Documents, and > MS's ZIP-encoded XML and JSON DocX documents, get a lot trickier. > There's typically a "magic" signature at the beginning of most files > that will tell you it's type, but whether all the contents have > integrity is a lot tougher to determine. I suspect each format would > need to be reviewed to determine if there were internal consistency > checks that would tell you of corruption or truncation. > > Sounds like an interesting project, though. Will be interested to hear > if you find a suitable package, or DIY it. > > -- > Ted Roche > Ted Roche & Associates, LLC > http://www.tedroche.com > [excessive quoting removed by server]

Stephen Russell

9:14 p.m.

I would just do a foreach loop, from hell that is, writing the data to a set of tables. On second pass you update if necessary because date has changed.

Doing MD5 compares just seems harder on BIG FILES to me.

On Wed, Aug 17, 2016 at 4:08 PM, Alan Bourke alanpbourke@fastmail.fm wrote:

...

Regarding FileSystemWatcher in .Net, I don't know how good an idea it is to monitor a whole drive like that.

-- Alan Bourke alanpbourke (at) fastmail (dot) fm

On Wed, 17 Aug 2016, at 07:28 PM, Ted Roche wrote:

...
That's a great question!

Obviously, since the post's subject didn't include "[NF]" you've already found your solution -- FoxPro! *wink*

I've done some document management systems in VFP, and the recursion, cataloging and checksums is easy, relatively-speaking. But the validation is an interesting twist, and a much more difficult problem.

Triggering the checking is also an interesting feature. Doing a bulk rescan would be slow and intensive, though you could tune it to not consume excessive resources, at a cost of slower checking.

Windows File Systems have some advanced features in the newer servers that would let you hook into a file system event (adding a new file or saving over an old one) to trigger your validation routine. If WinFS had ever been released, (https://en.wikipedia.org/wiki/WinFS) that would have been perfect, but alas, it was another empty vaporware promise of "The Old Microsoft." However some of "Longhorn" did end up in DotNet, like:

https://msdn.microsoft.com/en-us/library/system.io.

filesystemwatcher.changed(v=vs.110).aspx

...
A simpler solution might be a "Document Management System" but implementing one of these is a tough challenge in technology, politics, and technical support.

"Validity" is a bit nebulous. How are you defining that?

I mean, there are Word95 documents I can't open in Word2007, but can in LibreOffice. And is a Word document with a macro virus valid? How many versions and variations to support? How to handle password-encrypted or restricted files?

VFP would be a great tool for doing the validation, where you can use low-level file functions to read headers and calculate checksums, but complex structured documents, like MS's Compound OLE Documents, and MS's ZIP-encoded XML and JSON DocX documents, get a lot trickier. There's typically a "magic" signature at the beginning of most files that will tell you it's type, but whether all the contents have integrity is a lot tougher to determine. I suspect each format would need to be reviewed to determine if there were internal consistency checks that would tell you of corruption or truncation.

Sounds like an interesting project, though. Will be interested to hear if you find a suitable package, or DIY it.

-- Ted Roche Ted Roche & Associates, LLC http://www.tedroche.com

[excessive quoting removed by server]

Ted Roche

9:15 p.m.

On Wed, Aug 17, 2016 at 5:08 PM, Alan Bourke alanpbourke@fastmail.fm wrote:

...

Regarding FileSystemWatcher in .Net, I don't know how good an idea it is to monitor a whole drive like that.

Agree. Just a f'rinstance.

If this is a fairly secure / stable environment, you could likely do a fast scan just checking directory info, then checksumming files with changed size/datetime.

-- Ted Roche Ted Roche & Associates, LLC http://www.tedroche.com

3471

Age (days ago)

3472

Last active (days ago)

profox@leafe.com

11 comments

5 participants

tags (0)

participants (5)

Alan Bourke
Chris Davis
Kevin J Cully
Stephen Russell
Ted Roche