Ok, this may be a dumb question, but is there a reliable and easy way to detect and determine the file encoding on simple text files?
I have a client sending me files with UTF-16 Little Endian encoding. I have some code in place to try to determine if a file is UNICODE based on the first two or four characters once the file is loaded to memory and then convert it using STRCONV, but I'm concerned that although it works, it is a bit of a hack and maybe there is a better way.
Any thoughts?
Paul
--- StripMime Report -- processed MIME parts --- multipart/alternative text/plain (text body -- kept) text/html ---
AFAIK there is no way to determine the exact encoding of the files. You can do a "best effort" algorithm to try identifying it, but even Notepad++ sometimes fails to show the correct encoding.
That's why XML, HTML and some other metalanguages use the [encoding="utf-8"] or [charset="utf-8"] or similar, because this must be explicitly indicated for not misunderstanding the contents.
In similar way, when delivering text files to someone, an encoding must be explicitly defined and agreed between the parts to not misinterpret the contents.
UTF-16 is a little strange for me and never did deal with it, isn't used for double byte characters, like chinese or similar?
One idea that comes to me is that you can ask for a header indicating the encoding (like XML does), or even ask for a predefined string (always the same, like "Test header - áàä") [with some special chars] which you can compare to your own. If the comparison of the source string in UTF-16 does not match your string in UTF-16, then you can assume it's UTF-8, or re-check comparing with the same string in UTF-8
Regards.-
2018-08-01 20:00 GMT+02:00 Paul H. Tarver paul@tpcqpc.com:
Ok, this may be a dumb question, but is there a reliable and easy way to detect and determine the file encoding on simple text files?
I have a client sending me files with UTF-16 Little Endian encoding. I have some code in place to try to determine if a file is UNICODE based on the first two or four characters once the file is loaded to memory and then convert it using STRCONV, but I'm concerned that although it works, it is a bit of a hack and maybe there is a better way.
Any thoughts?
Paul
--- StripMime Report -- processed MIME parts --- multipart/alternative text/plain (text body -- kept) text/html
[excessive quoting removed by server]
This link has some ideas although they're Python and/or Linux learning solutions https://superuser.com/questions/301552/how-to-auto-detect-text-file-encoding
Malcolm
Meant "leaning" vs "learning" in the post below ...
This link has some ideas although they're Python and/or Linux learning solutions https://superuser.com/questions/301552/how-to-auto-detect-text-file-encoding
Malcolm
Currently I'm checking the first two bytes and if they are 255 & 254 respectively, I run STRCONV(textdata,6) on file contents and resave the text file to a temp file and it seems to do the trick and I can then import the tab-delimited data from the temp file with no problem.
I've only run into this a few times before and my method has worked pretty well so far, but I thought I would run it by the group.
Thanks!
Paul
-----Original Message----- From: ProfoxTech [mailto:profoxtech-bounces@leafe.com] On Behalf Of Fernando D. Bozzo Sent: Wednesday, August 01, 2018 1:32 PM To: profoxtech@leafe.com Subject: Re: Determining Text File Encoding
AFAIK there is no way to determine the exact encoding of the files. You can do a "best effort" algorithm to try identifying it, but even Notepad++ sometimes fails to show the correct encoding.
That's why XML, HTML and some other metalanguages use the [encoding="utf-8"] or [charset="utf-8"] or similar, because this must be explicitly indicated for not misunderstanding the contents.
In similar way, when delivering text files to someone, an encoding must be explicitly defined and agreed between the parts to not misinterpret the contents.
UTF-16 is a little strange for me and never did deal with it, isn't used for double byte characters, like chinese or similar?
One idea that comes to me is that you can ask for a header indicating the encoding (like XML does), or even ask for a predefined string (always the same, like "Test header - áàä") [with some special chars] which you can compare to your own. If the comparison of the source string in UTF-16 does not match your string in UTF-16, then you can assume it's UTF-8, or re-check comparing with the same string in UTF-8
Regards.-
2018-08-01 20:00 GMT+02:00 Paul H. Tarver paul@tpcqpc.com:
Ok, this may be a dumb question, but is there a reliable and easy way to detect and determine the file encoding on simple text files?
I have a client sending me files with UTF-16 Little Endian encoding. I have some code in place to try to determine if a file is UNICODE based on the first two or four characters once the file is loaded to memory and then convert it using STRCONV, but I'm concerned that although it works, it is a bit of a hack and maybe there is a better way.
Any thoughts?
Paul
--- StripMime Report -- processed MIME parts --- multipart/alternative text/plain (text body -- kept) text/html
[excessive quoting removed by server]
On Wed, 1 Aug 2018, at 7:00 PM, Paul H. Tarver wrote:
Ok, this may be a dumb question, but is there a reliable and easy way to detect and determine the file encoding on simple text files?
I use the code at the link below, which seems to work OK.
Thanks Alan! I'll give this a try.
BTW, I like to note the original source in my comments, so do I get to credit you for this code?
Paul
-----Original Message----- From: ProfoxTech [mailto:profoxtech-bounces@leafe.com] On Behalf Of Alan Bourke Sent: Thursday, August 02, 2018 4:25 AM To: profoxtech@leafe.com Subject: Re: Determining Text File Encoding
On Wed, 1 Aug 2018, at 7:00 PM, Paul H. Tarver wrote:
Ok, this may be a dumb question, but is there a reliable and easy way to detect and determine the file encoding on simple text files?
I use the code at the link below, which seems to work OK.
To be perfectly honest I can't remember. I may have taken it from somewhere and formatted it to the way I do things.