Determining Text File Encoding

List overview All Threads
Download

newer

older

Re: Test Email

Optimised search of memo data.

Paul H. Tarver

1 Aug 2018 1 Aug '18

6 p.m.

Ok, this may be a dumb question, but is there a reliable and easy way to detect and determine the file encoding on simple text files?

I have a client sending me files with UTF-16 Little Endian encoding. I have some code in place to try to determine if a file is UNICODE based on the first two or four characters once the file is loaded to memory and then convert it using STRCONV, but I'm concerned that although it works, it is a bit of a hack and maybe there is a better way.

Any thoughts?

Paul

--- StripMime Report -- processed MIME parts --- multipart/alternative text/plain (text body -- kept) text/html ---

Show replies by date

Fernando D. Bozzo

1 Aug 1 Aug

6:32 p.m.

AFAIK there is no way to determine the exact encoding of the files. You can do a "best effort" algorithm to try identifying it, but even Notepad++ sometimes fails to show the correct encoding.

That's why XML, HTML and some other metalanguages use the [encoding="utf-8"] or [charset="utf-8"] or similar, because this must be explicitly indicated for not misunderstanding the contents.

In similar way, when delivering text files to someone, an encoding must be explicitly defined and agreed between the parts to not misinterpret the contents.

UTF-16 is a little strange for me and never did deal with it, isn't used for double byte characters, like chinese or similar?

One idea that comes to me is that you can ask for a header indicating the encoding (like XML does), or even ask for a predefined string (always the same, like "Test header - áàä") [with some special chars] which you can compare to your own. If the comparison of the source string in UTF-16 does not match your string in UTF-16, then you can assume it's UTF-8, or re-check comparing with the same string in UTF-8

Regards.-

2018-08-01 20:00 GMT+02:00 Paul H. Tarver paul@tpcqpc.com:

...

Ok, this may be a dumb question, but is there a reliable and easy way to detect and determine the file encoding on simple text files?

I have a client sending me files with UTF-16 Little Endian encoding. I have some code in place to try to determine if a file is UNICODE based on the first two or four characters once the file is loaded to memory and then convert it using STRCONV, but I'm concerned that although it works, it is a bit of a hack and maybe there is a better way.

Any thoughts?

Paul

--- StripMime Report -- processed MIME parts --- multipart/alternative text/plain (text body -- kept) text/html

[excessive quoting removed by server]

Malcolm Greene

6:36 p.m.

This link has some ideas although they're Python and/or Linux learning solutions https://superuser.com/questions/301552/how-to-auto-detect-text-file-encoding

Malcolm

Malcolm Greene

6:37 p.m.

Meant "leaning" vs "learning" in the post below ...

This link has some ideas although they're Python and/or Linux learning solutions https://superuser.com/questions/301552/how-to-auto-detect-text-file-encoding

Malcolm

Paul H. Tarver

7:55 p.m.

Currently I'm checking the first two bytes and if they are 255 & 254 respectively, I run STRCONV(textdata,6) on file contents and resave the text file to a temp file and it seems to do the trick and I can then import the tab-delimited data from the temp file with no problem.

I've only run into this a few times before and my method has worked pretty well so far, but I thought I would run it by the group.

Thanks!

Paul

-----Original Message----- From: ProfoxTech [mailto:profoxtech-bounces@leafe.com] On Behalf Of Fernando D. Bozzo Sent: Wednesday, August 01, 2018 1:32 PM To: profoxtech@leafe.com Subject: Re: Determining Text File Encoding

AFAIK there is no way to determine the exact encoding of the files. You can do a "best effort" algorithm to try identifying it, but even Notepad++ sometimes fails to show the correct encoding.

That's why XML, HTML and some other metalanguages use the [encoding="utf-8"] or [charset="utf-8"] or similar, because this must be explicitly indicated for not misunderstanding the contents.

In similar way, when delivering text files to someone, an encoding must be explicitly defined and agreed between the parts to not misinterpret the contents.

UTF-16 is a little strange for me and never did deal with it, isn't used for double byte characters, like chinese or similar?

Regards.-

2018-08-01 20:00 GMT+02:00 Paul H. Tarver paul@tpcqpc.com:

...

Ok, this may be a dumb question, but is there a reliable and easy way to detect and determine the file encoding on simple text files?

I have a client sending me files with UTF-16 Little Endian encoding. I have some code in place to try to determine if a file is UNICODE based on the first two or four characters once the file is loaded to memory and then convert it using STRCONV, but I'm concerned that although it works, it is a bit of a hack and maybe there is a better way.

Any thoughts?

Paul

--- StripMime Report -- processed MIME parts --- multipart/alternative text/plain (text body -- kept) text/html

[excessive quoting removed by server]

Alan Bourke

2 Aug 2 Aug

9:25 a.m.

On Wed, 1 Aug 2018, at 7:00 PM, Paul H. Tarver wrote:

...

Ok, this may be a dumb question, but is there a reliable and easy way to detect and determine the file encoding on simple text files?

I use the code at the link below, which seems to work OK.

https://pastebin.com/1wzftPUg

-- Alan Bourke alanpbourke (at) fastmail (dot) fm

Paul H. Tarver

3 Aug 3 Aug

1:51 p.m.

Thanks Alan! I'll give this a try.

BTW, I like to note the original source in my comments, so do I get to credit you for this code?

Paul

-----Original Message----- From: ProfoxTech [mailto:profoxtech-bounces@leafe.com] On Behalf Of Alan Bourke Sent: Thursday, August 02, 2018 4:25 AM To: profoxtech@leafe.com Subject: Re: Determining Text File Encoding

On Wed, 1 Aug 2018, at 7:00 PM, Paul H. Tarver wrote:

...

Ok, this may be a dumb question, but is there a reliable and easy way to detect and determine the file encoding on simple text files?

I use the code at the link below, which seems to work OK.

https://pastebin.com/1wzftPUg

-- Alan Bourke alanpbourke (at) fastmail (dot) fm [excessive quoting removed by server]

Alan Bourke

2:12 p.m.

To be perfectly honest I can't remember. I may have taken it from somewhere and formatted it to the way I do things.

-- Alan Bourke alanpbourke (at) fastmail (dot) fm On Fri, 3 Aug 2018, at 2:51 PM, Paul H. Tarver wrote: > Thanks Alan! I'll give this a try. > > BTW, I like to note the original source in my comments, so do I get to > credit you for this code? > > Paul > > -----Original Message----- > From: ProfoxTech [mailto:profoxtech-bounces@leafe.com] On Behalf Of Alan > Bourke > Sent: Thursday, August 02, 2018 4:25 AM > To: profoxtech@leafe.com > Subject: Re: Determining Text File Encoding > > On Wed, 1 Aug 2018, at 7:00 PM, Paul H. Tarver wrote: > > Ok, this may be a dumb question, but is there a reliable and easy way to > > detect and determine the file encoding on simple text files? > > > > > > > > I use the code at the link below, which seems to work OK. > > https://pastebin.com/1wzftPUg > > -- > Alan Bourke > alanpbourke (at) fastmail (dot) fm > [excessive quoting removed by server]

2757

Age (days ago)

2759

Last active (days ago)

profox@leafe.com

7 comments

4 participants

tags (0)

participants (4)

Alan Bourke
Fernando D. Bozzo
Malcolm Greene
Paul H. Tarver