The thought was to keep an MD5 of each file (or similar), and if that changes then trigger the actual validation. First run would be intense, but most files don't change much. Perhaps ever.
It's funny you mention LibreOffice, because a suggestion I received was to use the command line tool 'soffice.exe' which is part of LibreOffice to check the office documents. Basically if soffice can turn it into a pdf, (to be deleted afterward) then the file would be considered a 'valid' file.
In regards to images, the ImageMagick tool of "identify" would produce the meta data from image files. Also in the running to enter the test phase of this project. http://www.imagemagick.org/script/identify.php
In the case of a Word document with a macro virus, hopefully (fingers crossed!) the malware scan would find it as soon as it was saved. If we're using LibreOffice, we'd hopefully have the option to disable macros when (test) converting it to PDF.
This definitely would be interesting. I hope I get the green light to work on it.
"Most useful complex projects begin their lives as useful simple projects."
-Kevin
-----Original Message----- From: ProFox [mailto:profox-bounces@leafe.com] On Behalf Of Ted Roche Sent: Wednesday, August 17, 2016 2:29 PM To: profox@leafe.com Subject: Re: Common File Document Validation
That's a great question!
Obviously, since the post's subject didn't include "[NF]" you've already found your solution -- FoxPro! *wink*
I've done some document management systems in VFP, and the recursion, cataloging and checksums is easy, relatively-speaking. But the validation is an interesting twist, and a much more difficult problem.
Triggering the checking is also an interesting feature. Doing a bulk rescan would be slow and intensive, though you could tune it to not consume excessive resources, at a cost of slower checking.
Windows File Systems have some advanced features in the newer servers that would let you hook into a file system event (adding a new file or saving over an old one) to trigger your validation routine. If WinFS had ever been released, (https://en.wikipedia.org/wiki/WinFS) that would have been perfect, but alas, it was another empty vaporware promise of "The Old Microsoft." However some of "Longhorn" did end up in DotNet, like:
https://msdn.microsoft.com/en-us/library/system.io.filesystemwatcher.changed...
A simpler solution might be a "Document Management System" but implementing one of these is a tough challenge in technology, politics, and technical support.
"Validity" is a bit nebulous. How are you defining that?
I mean, there are Word95 documents I can't open in Word2007, but can in LibreOffice. And is a Word document with a macro virus valid? How many versions and variations to support? How to handle password-encrypted or restricted files?
VFP would be a great tool for doing the validation, where you can use low-level file functions to read headers and calculate checksums, but complex structured documents, like MS's Compound OLE Documents, and MS's ZIP-encoded XML and JSON DocX documents, get a lot trickier. There's typically a "magic" signature at the beginning of most files that will tell you it's type, but whether all the contents have integrity is a lot tougher to determine. I suspect each format would need to be reviewed to determine if there were internal consistency checks that would tell you of corruption or truncation.
Sounds like an interesting project, though. Will be interested to hear if you find a suitable package, or DIY it.
-- Ted Roche Ted Roche & Associates, LLC http://www.tedroche.com
[excessive quoting removed by server]