long text - large strings

czoran · March 13, 2008, 04:59:02 PM

Hello,

I have pretty unusual task to solve and thinking about efficient solution. So maybe somebody will have some idea.

I have several hundred textual files to process. File sizes differ a lot. Some files are few hundreds of kilobytes, and others are up to 2.5 gigabytes.
Log and trace files. And problem is that files could contain very long lines of text because other software which generates them is old, bad, buggy, etc.
No line breaks (no crlf, lf or cr) in them, so I must parse that lines and split them properly. Sometimes file is correct completely, and sometimes contain few of these monstrums.
Lines can be longer than 100 000 characters, and not constant size. You never know which size will be next line. So I find difficult to process them in Ebasic as I am trying to load each line in string and process it.

I used to use istrings, but I have to dimension them before using, right? Which size to choose? I am upsizing istrings all the time and it happens that program crashes, or it misses remaining of line if longer than istring. And I will get faulty broken line in output. If I put istring[100000] it will work until there is line longer than that.

Can istring resize itself? Autosize? Documentation says that strings are up to available memory. But they must be fixed, otherwise we will not need istring[size], right?

Allocating memory and dealing with pointers? How to be sure that it will work at customer's pcs in future and without crashes? And can I allocate more memory than pc actually has? (My PC has 2GB, customer PCs have 256MB)

I thought to process these files in binary, but I think that there should be a way to do it line-by-line. And again I have to load data in large strings.

Does anybody of you know how other languages/compilers are dealing with large strings when they say "string size: unlimited"? I had success with some other language, simply by declaring variable as string, without any worry. But I would like to solve it in EB.

Does Ebasic swap to disk when there is not enough memory? What about performance in that case? I could not make it to swap by now. Maybe to replace my RAM with 64 MB?

)

Thanks.

LarryMc · March 13, 2008, 05:45:09 PM

What I would do is:

1. dynamically create an istring (#1) at somesize like 2000 (which would take up 2000x255+1 of memory.
2. create a file buffer(char array) of like 32K
3. read a 32K block of the file being processed
4. read a character at a time from buffer looking for "end of line marker"
5. if not eol then add character to #1 istring.
6. if eol is found before reaching 2000 then go process. ; clear #1 and delete #2 if it exists
7. if eol is not found before reaching 2000 then create dynamic istring (#2) at twice size of #1
8 copy #1 to #2 ; clear #1 and go back to step 4

9. if step 6 is reached append #1 to #2 then go process
10. after returning from processing clear #1 and delete #2
11. if step 7 was reached then create #3 twice size of #2
12. copy #2 to #3 ; add #1 to #3
13. clear #1; delete #2; recreate #2 to size 0f #3; copy #3 to #2; delete #3 and go back to step 4

If I didn't leave out a step you should be able to process any size file with any size line.
Your only limitation is available memory.

If available memory was a major problem I would write lines to temporary file instead of dynamic arrays.

That's the way I would attempt it.
I'm sure someone out there knows a slicker way.

Larry

czoran · March 13, 2008, 06:04:02 PM

Interesting solution. I was thinking of similar, but not so deep. Very good.

What worries me is assumption in line 7 and line 11. What if still not enough? After several doublings I am at the end of physical memory.
And if I write/append to files I will have to process them too. Maybe to include some recursion.
Tricky. Thinking forward.

Thank you Larry.

LarryMc · March 13, 2008, 06:31:19 PM

You could choose not to double each time but just add an additional 5k each time. Keeps it from increasing at an exponetial rate.

Also instead of having #2 and #3 existing at the same time you could write #2 to a temp file; delete #2; recreate #2 at its previous size plus x amount.

Let's say you wind up with a line 3meg long.

it still has to be parsed by your "processing routine"

The nature of the resulting line that has to be processed is the most critical issue in my opinion.

There has to be some "structure" to the information in a line and you should be able to parse that without having the whole 3meg available.

And if these "files" are coming from places where the "rules" aren't being followed I don't see how you can write a parser when there are no rules.

Larry

barry · March 13, 2008, 08:04:56 PM

Do you have any way to know when you've reached what should be the end of the line in cases where lines are merged? If you do you might want to consider something that reads in a byte at a time (buffered of course) and watches for whatever it is that ends the line and writes a new copy of the same file with all the lines properly terminated. Then you can process them in a more rational way. This can be a separate program or a pre-processor in your main program.

Barry

czoran · March 14, 2008, 03:40:03 PM

Hi,

I started working on solution that will be cobination of your ideas. I will do big files in 2 passes. First splitting them in smaller chunks of 5000-30000 characters, and then parsing them twice again.
When I find or make complete line, I am writing it to separate file by appending. If line is longer and no markups found then I am appending whole chunk to new file.
Markups are actually timestamps within square brackets. But seems that they are not always within line because of buggy logging part in other software. I have to recognize somehow where should be long line splitted and put into several new lines with last found timestamp. Seems that when several events happen in same time, logger outputs everything in one big line. As whole sentences are there it is another problem to recognize what is new event.

I am keeping maximum of 3 chunks in memory during processing. Lot of experimenting is necessary about the sizes here to make it faster.

Funny, but interesting. Maybe I will abandon that if customer decides to replace logging part in his app with my module.

IonicWind Software

News:

long text - large strings

czoran

March 13, 2008, 04:59:02 PM

LarryMc

March 13, 2008, 05:45:09 PM #1 Last Edit: March 13, 2008, 05:48:34 PM by Larry McCaughn

czoran

March 13, 2008, 06:04:02 PM #2

LarryMc

March 13, 2008, 06:31:19 PM #3

barry

March 13, 2008, 08:04:56 PM #4

czoran

March 14, 2008, 03:40:03 PM #5