IonicWind Software

IWBasic => Console Corner => Topic started by: billhsln on August 20, 2020, 09:59:52 AM

Title: Reading a 65001 UTF-8 File
Post by: billhsln on August 20, 2020, 09:59:52 AM
I have been trying to read a Unicode UTF-8 (65001) file.  What comes out is blank with a length of 0.

Just trying to figure out what I need to do to read the information.  I have the input field defined as IWSTRING linein[100]

Just don't understand what the problem is.

I have uploaded the Test program and the data file I am trying to read.

Thanks,
Bill

Title: Re: Reading a 65001 UTF-8 File
Post by: jalih on August 20, 2020, 10:40:34 AM
Windows support for UTF-8 is poor and I guess you need to use functions like "MultiByteToWideChar" and "WideCharToMultiByte" to work with UTF-8. Also notice, that each conversion requires two calls to the routines!
Title: Re: Reading a 65001 UTF-8 File
Post by: Egil on August 21, 2020, 07:32:54 AM
Hi Bill,

When using GETSTARTPATH instead of full path for the Cities-file, I got the result in the pic below-


Egil
Title: Re: Reading a 65001 UTF-8 File
Post by: Brian on August 21, 2020, 11:40:43 AM
Bill, try this code

Brian
Title: Re: Reading a 65001 UTF-8 File
Post by: billhsln on August 21, 2020, 12:36:45 PM
Egil, tried doing the GETSTARTPATH, still came up with 0's.

Brian, your code worked.  Funny that I don't even need to do the IWSTRING, STRING actually showed the right values (except for the first, it has the BOM on it).  Will need to add logic for the very first record to remove the BOM (it is the first 3 characters).

Thanks,
Bill
Title: Re: Reading a 65001 UTF-8 File
Post by: jalih on August 22, 2020, 06:57:16 AM
Quote from: billhsln on August 21, 2020, 12:36:45 PMBrian, your code worked.  Funny that I don't even need to do the IWSTRING, STRING actually showed the right values (except for the first, it has the BOM on it).  Will need to add logic for the very first record to remove the BOM (it is the first 3 characters).

It works for basic one byte ascii range of characters only. Remember UTF-8 character can take from 1 to 4 bytes.
Title: Re: Reading a 65001 UTF-8 File
Post by: Brian on August 22, 2020, 07:08:52 AM
Jalih,

You are right - I just had another look. Notepad will open the file as UTF-8, but if you then save it as ANSI, the special characters are lost. Have to have another think about it!

Brian