April 27, 2024, 05:24:39 AM

News:

IonicWind Snippit Manager 2.xx Released!  Install it on a memory stick and take it with you!  With or without IWBasic!


IStrings and Array Memory Management

Started by Jim Scott, January 03, 2007, 06:35:24 PM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

Jim Scott

The question is this.  In EBasic, do I need to check the length of strings before assigning them to a variable in order not to "step" on the next variable in memory?

The following code;

TYPE phonerecord
   DEF Name[12]:ISTRING
   DEF Age:INT
   DEF Phone[17]:ISTRING
ENDTYPE

DEF Rec[10]:phonerecord

Rec[5].Name = "Joe Smith"
Rec[5].Age = 35
Rec[5].Phone = "555-555-1212 for    this makes the phone number too long"
Rec[6].Name = "Hanoi Hannah"
Print Rec[5].Name
Print Rec[5].Age
Print Rec[5].Phone
Locate(1,40): Print "Length of Rec[5].Name = ", Len(Rec[5].Name)
Locate(2,40): Print "Length of Rec[5].Age = ", Len(Rec[5].Age)
Locate(3,40): Print "Length of Rec[5].Phone = ", Len(Rec[5].Phone)
print:print


Produces this output;
Joe Smith                              Length of Rec[5].Name = 9
35                                     Length of Rec[5].Age = 4
555-555-1212 for    Hanoi Hannah       Length of Rec[5].Phone = 32


I expected Rec[5].Phone to be clipped at length = 17.
Jim Scott

Ionic Wind Support Team

Yes you need to check the lengths of assignments.  Emergence BASIC is a true compiler meaning it translates your source into assembly language, and then into machine code.  The compiler can't know what the contents of your variables will be at runtime. Consider this statement:

str = str2 + str3

The compiler has no idea what the contents of str2 and str3 will be when you run the program.  It simply creates the assembly code needed to call the string copying function.

Paul.
Ionic Wind Support Team

Jim Scott

January 03, 2007, 08:09:04 PM #2 Last Edit: January 03, 2007, 09:02:44 PM by Jim Scott
Since we a have;
Def SomeString[Size] as IString

Why can't the compiler look at all future assignments and simply truncate the Istring to "Size"?

So the following;

Def SomeString[9] as IString
SomeString = "This is a longer than 9 Char string"
Print SomeString


Would produce output;
This is a

Would there be any short commings to having the compiler do this?  Maybe I'm not seeing the bigger picture here.
Sorry if this is something you've covered before.  And if so, where would I look to get up to speed?
Jim Scott

Ionic Wind Support Team

Because Emergence strings are identical to strings used in C and the Windows API.  They are ASCII NULL terminated.  The allocated length of the string is not stored anywhere.  Again that is the difference between a true compiler and most interpreters.  The advantage is raw speed and leaving you in control.  The disadvantage is you have to actually think about the size of your string buffers. 

An ISTRING is not a type.  It is a keyword that allows changing the default dimensioned size of a string from 255 to n, where n is from 1 to the available memory.  Since Emergence uses a linker your code might have looked like this:

extern SomeString as STRING
SomeString = "This is a longer than 9 Char string"

Where SomeString is located in a different source file and consequently a different binary object file.  The linker and compiler just know it is an address in memory, not what the original dimensioned length is.  A string by definition is just an address and the type dictates how the assignment is performed.   Since the original file that contained the actual dimension is long gone and compiled.

Or consider a function that takes a string as a parameter.

GLOBAL SUB CoolSub(a as STRING)
a = "Hello" + a
PRINT a
ENDSUB

'a' is an address to the string, not the entire string itself.  In other words the string is passed by reference.  That subroutine could be in a DLL where the string is passed by an external program.  How is the compiler going to know the dimensioned size of the string when it is an entity from an external source?  The answer is it simply can't.  Which is why you see many API functions and functions in DLL's take a 'size' parameter.  The maximum size of the string passed.

Some languages, like Visual BASIC, use a hybrid string where the dimensioned length is stored as the first 2 or 4 bytes of the allocated amount.  To allow for overwrite checking.  The two biggest problems are speed and incompatibility with the rest of the world.  You can't just 'pass' a VB string to a Windows function without some sort of conversion to an ASCII NULL terminated string.  C++ uses a class, and not a native type, for string handling.  The conversion back and forth to the real string is done by class methods and if you are interested in that sort of thing then perhaps Aurora would be of interest to you. 

Paul.
Ionic Wind Support Team

Jim Scott

January 04, 2007, 03:19:59 AM #4 Last Edit: January 04, 2007, 05:15:11 PM by Jim Scott
Hello Paul.  Thanks for your thoughts.  I get the feeling that you must have struggled with this and many other choices as you were writing the compiler.  Yes, I may be interested in Aurora in due time.  While I'm not a big fan of BASIC like languages in general, I picked up a nasty allergy like aversion to C like code while at Sun Microsystems back in the 90's.

Thanks for your time and your work.

(Edited for PC police)
Jim Scott

Kale

Quote from: Jim Scott on January 04, 2007, 03:19:59 AM
I picked up a nasty allergy to  C code while at Sun Microsystems back in the 90's.

Aurora is not C. ;)

Mike Stefanik

Quote from: Paul Turley on January 03, 2007, 10:36:16 PM
Some languages, like Visual BASIC, use a hybrid string where the dimensioned length is stored as the first 2 or 4 bytes of the allocated amount.  To allow for overwrite checking.  The two biggest problems are speed and incompatibility with the rest of the world.  You can't just 'pass' a VB string to a Windows function without some sort of conversion to an ASCII NULL terminated string.

Visual Basic uses BSTRs, which are null-terminated Unicode strings that are prepended with the length. They are backwards compatible with Unicode strings, and can typically be used interchangeably (with the exception that BSTRs allow for embedded nulls, which is something the caller may not expect). The conversion that you're talking about is necessary if you're calling the ANSI version of a function, but then that conversion would be needed with any Unicode string.

The disadvantage of BSTRs is primarily in size (each character is 16-bits) and the additional overhead of calling the COM string functions. The advantages are, of course, the ability to know the length of of the string without having to calculate it by walking the string looking for nulls, and the ability for the string to contain embedded nulls. The only real incompatibilty with the rest of the world is if the rest of the world doesn't know what Unicode is and automatically presumes that every character is 8-bits wide. A pretty bad assumption to make these days. You can freely pass a BSTR to a function that accepts a wchar_t * and it'll work.

If you plan on implementing full Unicode support for EBasic on Windows, using BSTRs as the native string type would actually be a very good idea.
Mike Stefanik
www.catalyst.com
Catalyst Development Corporation

Ionic Wind Support Team

We already have unicode support and the WSTRING type in Emergence. 
Ionic Wind Support Team

erosolmi

January 04, 2007, 04:41:23 PM #8 Last Edit: January 04, 2007, 04:45:14 PM by erosolmi
I can understand the inner part but Jim was referring to type element assignment and not to strings in general.

I think the compiler has all the info to handle such a checking in order not to produce bad results when assigning data to types elements.
But because EBasic is a parser/transator, it can add some checking code (trimming or padding the string before assign takes place) during ASM code production.
If type element is defined as ISTRING of a specific len, it should take care of it. This will not cost too much in terms of execution speed even done thousand of times. Also would avoid GPF in case of string allocation pass structure lenght.

At the end of the day, a compiled language is something that should help programmer in his/her job. If you tell it is programmer responsability you just move the problem somewhere else. In case of many different ISTRING elements in a TYPE it would be a pain to always check the len.

In my opinion, current situation in not a feature but something to think about from a compiler point of view.
A string element in a type should be space padded
An ansi string in a type should be null padded when string len is less than element len or null truncated/terminated when string len is more than element len.

Usually BSTR cannot be inserted as elements in a type. In reality they are 4 bytes dword pointer to a buffer where first 4 bytes contains real string len, followed by the string itself.
In a type, a string is always stored as a sequence n of bytes, where n is the declared len in TYPE structure. So declared string element len should be used not only to compute total structure len or the starting position of next element but also for padding string during assignment.

Regards
Eros

Mike Stefanik

By "full support", I meant having it fully integrated within the language itself. Right now, both Aurora and EBasic implement it in a fashion that's similar to C. Different types (STRING vs. WSTRING, etc.) and so forth. Ideally, the character set that is being used would be something that is completely transparent to the developer. Unicode support would be as flipping a single switch in the compiler, and the STRING type becomes Unicode. API declares would automatically switch to their Unicode version unless explicitly specified otherwise, and so on.
Mike Stefanik
www.catalyst.com
Catalyst Development Corporation

Mike Stefanik

Quote from: erosolmi on January 04, 2007, 04:41:23 PM
Usually BSTR cannot be inserted as elements in a type. In reality they are 4 bytes dword pointer to a buffer where first 4 bytes contains real string len, followed by the string itself.

The issue you're talking about is persisting the data or using data structures with fixed-length strings to store character data. But if you've already accounted for Unciode, using a BSTR is no more difficult than a Unicode string. Of course, what you can't do is just write the structure as an arbitrary block of data to disk. Generally speaking, that's a bad thing to do anyway; programs should be written to implement proper serialization. Just dumping raw structures to a file is sloppy programming which makes the data inherently non-portable and that much more difficult to modify or extend the structure.
Mike Stefanik
www.catalyst.com
Catalyst Development Corporation

erosolmi

Yes, exactly Mike. That was something I wanted to see in other languages too.
If element string is defined with a proper len, than it should be managed like a standard buffer.
If no len, it should be managed as a pointer to a dynamic string.
I will think about it in order to implement in thinBasic data structures ... Not so difficult to implement and the effort to deallocate dynamic strings inside structure element should not be so big.

If the compiled code would take care also of allocating/deallocating dynamic strings inside structures, it would be fantastic.
Some nice complex data structures easy to maintain could be possible with little effort.


Ionic Wind Support Team

Quote
I can understand the inner part but Jim was referring to type element assignment and not to strings in general.

Erosolmi,
Yes he was.  Read his post again, it matters not whether the string is stored on the stack, globally or part of structure.

An ISTRING is not a type, it is a directive.  There is no difference to the compiler between an ISTRING and a STRING type as they are one and the same.  One has a default length of 255 characters and the other a variable length.  If you were to use the TYPEOF command it would return @TYPESTRING regardless of the size.

The compiler cannot know the contents of variables at compile time.  The variables don't exist at that stage as the compiler is just creating assembly code.  The allocated length of the string is not stored anywhere during the execution of the binary.  It is just an address afterall. 

I am not changing the way intrinsic strings work in Emergence now or in the future ;)  It should be sufficient to know how it works and that you as the programmer have to account for overwrites.  The only place the compiler would have enough information is during the assignment of a constant string (enclosed in quotes) to a variable that was defined within the same scope.  If I were to do that then users would expect that functionality at all times, which as I have already shown is not possible in the context of the built in string type.

Aurora will have a string class, once we have finished operator overloading, which should appease your need for overly complicated strings that are nice and slow just like C#  ::)

Paul.



Ionic Wind Support Team

erosolmi

Paul,

compiler is compiling the ASM code EBasic parser will generate. So, if during ASM generation EBasic parser (that is aware or could be aware of the type elements len I suppose) will add some string len checking code in case of string assignment, the compiler will compiled it and during run-time proper buffer fill will take place.

In any case I will stop here arguing about this. I can accept your view of the problem and your explanations.

Regards
Eros

Ionic Wind Support Team

If I understand your sentance correctly, which was a bit hard to read, you want the compiler to only do this in the case of a structure (UDT) since the size of each element is known due to the TYPE/ENDTYPE definition.   Then yes that would be possible but would introduce other subtle problems.  Consider the use of a typecasted pointer.

int length
length = 5+1

type mytype
pointer mystring
endtype

mytype var
var.mystring = new(char,length)

*<string>mystring = "hello world"

The compiler would miss that case since it would just do a straight string copy, since the length of the string is not known at compile time.  The 'length' variable contents wouldn't be known until the executable is ran. Then you would be asking me to interpret the source before it is compiled.   I have bigger and better things to do with my time ;)

Also consider the case of a few Windows structures that use an empty string (zero dimensioned) as the last element of the UDT.  It has a length of zero until a pointer to the UDT is retrieved from an API function. 

So again I have to say nothing is broken here, just a misunderstanding.  Emergence strings work identically to character buffers in C as that was how I designed them.  Fast, simple, and no overhead.  They are not going to change in implementation, nor is there anything wrong with the way they are used.  If you want to suggest adding a new, separate, string type, as Mike has done, then that is fine.  Suggestions are always welcome.

Paul.

Ionic Wind Support Team

erosolmi

Yes, my English can be hard to understand, I'm Italian. But I can understand your English while I'm not sure if you can understand my Italian.

Your example is not on target because in your TYPE MyType structure there is no any dimensioned string but just a pointer to a dynamic buffer allocated (and casted) at run-time.

Last, I'm not asking you anything, I'm just suggesting, making considerations, arguing about a specific argument.
Suggestions can be accepted or not and you already made your decision.

That's all.

Jim Scott

Hey Paul - Sorry about the time sink my question has become for you.  It seems to be mostly healthy debate though and I've learned alot.  I understand that fundamental choices in design can be gut wrenching branches in the process.  I'm happy with your choice for handling strings, keeping them uncomplicated and efficient.  Having my programs check string lengths is really not a problem, that's what functions are handy for.
Jim Scott