March 28, 2024, 03:45:04 AM

News:

Own IWBasic 2.x ? -----> Get your free upgrade to 3.x now.........


spliting huge file

Started by philippe.tx, November 11, 2010, 03:21:27 PM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

philippe.tx

hi all,

i want to download a file from the internet and then parse it. its size is over 100 ko.
it's one single (very long) string of character.
its structure is :
{data #1.........}{data#2....}{data#3...}.........{data#n-2...........}{data#n-1.......}{data#n..........}

i can save it to my HD but how can i parse it ? or split it? or do something with it ?
what i would like to do is to split the data blocks.
if the file was made of several records, or if istrings were not limited in size, it would not be a problem.

thanks for your ideas

LarryMc

If I had to do it I would read, say, 4k chunks into memory,
then I'd use Fletchies Split routine to find each separator and write the resulting data to a text file or a dbase file, depending on what I was going to do with the data

Does the data stream contain embedded LF/CRs or NULL characters?

That's the direction I would start based upon the little you have described.

LarryMc
LarryMc
Larry McCaughn :)
Author of IWB+, Custom Button Designer library, Custom Chart Designer library, Snippet Manager, IWGrid control library, LM_Image control library

philippe.tx

November 11, 2010, 11:40:16 PM #2 Last Edit: November 11, 2010, 11:50:42 PM by philippe.tx
larry,
the file is like a 200ko long istring.  there's no CRLF or NULL characters in it.
i planned to use the "}" as a separator, as it ends each data block.
something like :

def input[300000], output[300000]:istring
def myfile: file
openfile ( myfile,"the_file_i_want_to_parse,"R")
read (myfile,input)
for i=1 to len(input)
  a$=mid$(input,i,1)
  output=ouptut+a$
   if a$="}"
     output=output+chr$(13)+chr$(10)
   endif
next i
closefile(myfile)
openfile (myfile,"the_file_is now_splitted","W")
writefile ( myfile,output)
closefile myfile
and then read "the_file_is_now_splitted" record by record, and then parse each record.

unfortunately istrings are 65536 long to the max.

i've tried to write "the_file_i_want_to_parse" into memory. but didn't succeed as
readfile(myfile, input)
writemem (mymem,input)
will only write the first 65536 characters.

openfile ( myfile,"the_file_i_want_to_parse,"R")
writemem (mymem, myfile)
doesn't work either.

when you say : "If I had to do it I would read, say, 4k chunks into memory,"
how would you read 4 chunks into memory ?

ckoehn

I'm not familiar with CBasic (I use EBasic) but it would seem you could you this logic.


Open Source file
Open CurrentTarget file
While not End Of Source file
    read Source byte  '1 char
    If byte="}" then
        write byte and CRLF to CurrentTarget
        Close CurrentTarget
        Open NextTarget
    else
        Write byte to CurrentTarget
   endif
EndWhile

philippe.tx

def in,out:file
def byte:char
openconsole
Openfile (in,getstartpath+"toutes_donnees.cba","R")
openfile(out,getstartpath+"blabla.cba","W")
for i=1 to len(in)
    read (in,byte)
print byte   
        write (out,byte)
next i
closefile in
closefile out
end


works perfectly 8)
i have to dig that char thing ( never used it before)
you made my day.
thanks a lot

LarryMc

You asked about 4K chunks I mentioned

I didn't spend a bunch of time trying to get the processing code exactly correct but you can figure that out.
4K is purely arbitrary you can make it bigger.

Because of disk read/write access times I think it will be faster the bigger you make the chunks.
reading/writing  1 char at a time is the slowest possible way to do it.

Anyway, hope this helps or at least gives you some ideas.

LarryMc

DECLARE IMPORT, _ReadFile ALIAS ReadFile(hFile AS INT,lpBuffer AS POINTER,nNumberOfBytesToRead AS INT,lpNumberOfBytesRead AS POINTER,lpOverlapped AS OVERLAPPED),INT

file fin,fout
int bytesread
istring path[260]=""
char InBuffer[4096]=""
char OutBuffer[8000]=""
char overflow[1000]=""
if openfile(fin,path,"R") = 0
do
InBuffer=""
ReadFile(f,&Buffer,4096,&bytesread,NULL)
if bytesread
/*
'do the mid thing to see if you have a piece
'if a piece of data
if overflow<>"" then add overflow to outbuffer and clear overflow
             then add it to outbuffer+LFCR
'if no end of data but not at end of inbuffer - means we broke in the middle of data
              set overflow to that leftover
write out outbuffer
set outbuffer=""
*/
endif
until bytesread=0
CloseFIle fin
Closefile fout
endif

endsub
LarryMc
Larry McCaughn :)
Author of IWB+, Custom Button Designer library, Custom Chart Designer library, Snippet Manager, IWGrid control library, LM_Image control library

philippe.tx

helps a lot, and gives a lot of ideas.
thanks larry