December 11, 2024, 09:47:21 PM

News:

IWBasic runs in Windows 11!


Reading a 65001 UTF-8 File

Started by billhsln, August 20, 2020, 09:59:52 AM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

billhsln

I have been trying to read a Unicode UTF-8 (65001) file.  What comes out is blank with a length of 0.

Just trying to figure out what I need to do to read the information.  I have the input field defined as IWSTRING linein[100]

Just don't understand what the problem is.

I have uploaded the Test program and the data file I am trying to read.

Thanks,
Bill

When all else fails, get a bigger hammer.

jalih

Windows support for UTF-8 is poor and I guess you need to use functions like "MultiByteToWideChar" and "WideCharToMultiByte" to work with UTF-8. Also notice, that each conversion requires two calls to the routines!

Egil

Hi Bill,

When using GETSTARTPATH instead of full path for the Cities-file, I got the result in the pic below-


Egil
Support Amateur Radio  -  Have a ham  for dinner!

Brian

Bill, try this code

Brian

billhsln

August 21, 2020, 12:36:45 PM #4 Last Edit: August 21, 2020, 12:38:23 PM by billhsln
Egil, tried doing the GETSTARTPATH, still came up with 0's.

Brian, your code worked.  Funny that I don't even need to do the IWSTRING, STRING actually showed the right values (except for the first, it has the BOM on it).  Will need to add logic for the very first record to remove the BOM (it is the first 3 characters).

Thanks,
Bill
When all else fails, get a bigger hammer.

jalih

Quote from: billhsln on August 21, 2020, 12:36:45 PMBrian, your code worked.  Funny that I don't even need to do the IWSTRING, STRING actually showed the right values (except for the first, it has the BOM on it).  Will need to add logic for the very first record to remove the BOM (it is the first 3 characters).

It works for basic one byte ascii range of characters only. Remember UTF-8 character can take from 1 to 4 bytes.

Brian

Jalih,

You are right - I just had another look. Notepad will open the file as UTF-8, but if you then save it as ANSI, the special characters are lost. Have to have another think about it!

Brian