I've been looking everywhere it feels like, and I don't know how this could be accomplished.
I am loading a webpage containing things that needs to be evaluated (such as javascript etc) and just looking at the htmlcode won't give me any help. I would need to look at the screen itself, and try to extract the information I need from there. Are there any shortcuts to accomplish this that anyone knows of? Even the layout of the page might differ from time to time, so I'll have to check the contents and decide which parts that I have use for before processing the input as well.
Cheers.
PHP is the way to go for web forms.
I use it for data collection and database activities.
Oh, maybe I didn't explain it as well as I should've.
I'm not doing a serverside application. I'm doing something that needs a bit of "browser automation". I need to be able to let the application "collect" items from viewable things on the page. The problem is when javascript is used to dynamically load certain parts. That's why I need smthg that doesn't just read the source, but enables my application to see what the user is seeing. The Browser Control as a base is brilliant (can keep it hidden if needed), but I can't find enough information about intercepting/interpreting information shown in the browser control window.
Thanks.
Have no idea if this will help your situation or not but you can load a "webpage" into an form and then "post" the form and read everything that's there.
but to be able to "see" what's on a page when it can be in varying formats isn't going to be easy if possible.
people can display things on a web page using plain html, dhtml,xml transforms, in frames, in iframes, with java applets, with and without elaborate css files.
I wish you luck, but......
Larry
Quote from: Peter on November 27, 2008, 05:04:04 AM
Oh, maybe I didn't explain it as well as I should've.
I'm not doing a serverside application. I'm doing something that needs a bit of "browser automation". I need to be able to let the application "collect" items from viewable things on the page. The problem is when javascript is used to dynamically load certain parts. That's why I need smthg that doesn't just read the source, but enables my application to see what the user is seeing. The Browser Control as a base is brilliant (can keep it hidden if needed), but I can't find enough information about intercepting/interpreting information shown in the browser control window.
Thanks.
Peter,
I'm not sure if I'm fully understanding your request but, here goes.
I guess that your prog needs to download a webpage and examine the contents of the HTML code, and then give the user a choice from several options, according to the nature of the HTML code.
If I'm roughly on target, then I think you might need to write what is sometimes called a 'web-spider'.
If so, I can help a bit, since that is exactly what I am currently engaged in writing.
I have written (using a programming language other than EBasic) a large (2000-lines of code) specialist spider to extract and process data from the web.
The spider saves the data to a large text file in .csv format.
It is aimed at a particular website, so is written with that website's layout in mind.
It won't work with any other website, but...
I'm now re-writing the complete spider in EBasic, so that I can use its built-in database capability.
I'm not suggesting that you need all that, but maybe the routines I use to download, examine and extract data, which are more general in nature, might be of assistance?
JohnP
JohnP:
Not exactly. The html is easy to download, but the issue is when the webpage is having strange javascripts to show certain text (and even autoupdating texts) in certain places such as time, currency etc in my case. What I need is to be able to "see" the way a normal browsing user would see the page. That is, I need to "remote control" and "snoop" inside the browser window, and I know that it is possible. I just don't know how. :)
Edit:
I will also need to be able to simulate textinput, button/image/hyperlink clicks etc. It's hard to do only with GET and POST since many pages generate the code to be "executed" on the fly using javascript when an object is clicked.
Peter, I've created a mini example for you, just to show a bit about browser internals.
It opens google homepage, puts a string into search box and clicks the search button. After search results are ready, a 'Support Forums' anchor gets clicked and a messagebox shows who is online.
It does also extend the browser libray with missing DOCUMENT_COMPLETE event. I have used the first unused ID for this event.
$include "windowssdk.inc"
$include "mshtml.inc"
$include "exdisp.inc"
WINDOW g_win
UINT g_ondoccompleted ' callback pointer
setid "IDDOCCOMPLETE",0x8011 ' first unused id
declare BROWSERCB(WINDOW win, IWebBrowser2 browser)
InstallDocumentCompleteSupport()
OPENWINDOW g_win,0,0,600,400,0x10CF0000,0,"",&win_handler
if (AttachBrowser(g_win) = 0)
IWebBrowser2 g_browser
if (BrowserFromWindow(g_win, &g_browser))
' call OnGoogleOpened after the navigation completes
'update callback
g_ondoccompleted = &OnGoogleOpened
' and navigate
g_browser->Navigate(L"http://www.google.com/webhp?sourceid=navclient&ie=UTF-8",0,0,0,0)
' release browser
g_browser->Release()
waituntil g_win.hwnd = 0
endif
endif
sub OnGoogleOpened(WINDOW win, IWebBrowser2 browser)
IHTMLDocument2 document
if (BrowserGetDocument(browser, &document))
' 1. put "aurora compiler" into search box
' 2. change submit button text
' 3. submit
IHTMLInputElement textfind
if (DocumentGetElementById(document, L"q", _IID_IHTMLInputElement, &textfind))
textfind->put_value(L"aurora compiler")
textfind->Release()
IHTMLInputElement searchbutton
if (DocumentGetElementById(document, L"btnG", _IID_IHTMLInputElement, &searchbutton))
ElementFocus(searchbutton)
searchbutton->put_value(L"Click Me Click Me Click Me")
if (MessageBox(win, "search now ?", "", MB_YESNO) = @IDYES)
'update callback
g_ondoccompleted = &OnGoogleResults
ElementClick(searchbutton)
endif
searchbutton->Release()
endif
endif
document->Release()
endif
return
endsub
sub OnGoogleResults(WINDOW win, IWebBrowser2 browser)
int index
' find and click 'Support Forums'
IHTMLElementCollection all
if (BrowserGetAll(browser, &all)) /*browser.document.all*/
' number of elements
int count=0
all->get_length(&count)
' for each element
for index=0 to count-1
IHTMLAnchorElement anchor
if (CollectionGetItem(all, index, _IID_IHTMLAnchorElement, &anchor))
' anchor has text and href attributes (not only)
BSTR bstrText=0
'if (ElementGetInnerText(anchor, &bstrText))
if (ElementGetAttribute(anchor, L"innerText", &bstrText))
' bstrText is a wstring
if (wcsicmp(bstrText, L"Support Forums") = 0)
' if you are interested with this anchor ...
' ... click it
if (MessageBox(win, "click on 'Support Forums' anchor ?", "Support Forums", MB_YESNO) = @IDYES)
'if (MessageBox(win, "in new window ?", "Support Forums", MB_YESNO) = @IDNO)
anchor->put_target(NULL) ' but in same window
'update callback
g_ondoccompleted = &OnSupportForumsOpened
'else
' anchor->put_target(L"_blank")
'endif
ElementClick(anchor)
' break the FOR
index = count
endif
endif
SysFreeString(bstrText)
endif
anchor->Release()
endif
next index
all->Release()
endif
return
endsub
sub OnSupportForumsOpened(WINDOW win, IWebBrowser2 browser)
' this will be a surprise
int index
IHTMLElementCollection all
if (BrowserGetAll(browser, &all))
' number of elements
int count=0
all->get_length(&count)
' for each element
for index=0 to count-1
IHTMLAnchorElement anchor
if (CollectionGetItem(all, index, _IID_IHTMLAnchorElement, &anchor))
' anchor has text and href
BSTR bstrHref=0
if (ElementGetAttribute(anchor, L"href", &bstrHref))
' bstrHref is a wstring
if (wcsstr(bstrHref, L"?action=who"))
BSTR bstrText=0
'if (ElementGetInnerText(anchor, &bstrText))
if (ElementGetAttribute(anchor, L"innerText", &bstrText))
MessageBoxW(win.hwnd, bstrText, L"Surprise", 0)
SysFreeString(bstrText)
index = count
endif
endif
SysFreeString(bstrHref)
endif
anchor->Release()
endif
next index
all->Release()
endif
return
endsub
sub win_handler
IWebBrowser2 browser
UINT function
SELECT @MESSAGE
CASE @IDCREATE
CENTERWINDOW *<WINDOW>@HITWINDOW
CASE @IDDOCCOMPLETE
if (g_ondoccompleted and BrowserFromWindow(*<WINDOW>@HITWINDOW, &browser))
function = g_ondoccompleted
g_ondoccompleted = 0
!<BROWSERCB>function(*<WINDOW>@HITWINDOW, browser)
browser->Release()
endif
CASE @IDCLOSEWINDOW
CLOSEWINDOW *<WINDOW>@HITWINDOW
ENDSELECT
RETURN
ENDSUB
'================================================================== html util
sub BrowserFromWindow(WINDOW w, pointer ppBrowser),BOOL
' This will return IWebBrowser2 object.
' Call Release() method when finished with it.
pointer p = GetProp(w.hwnd, "BROWSER")
BOOL success = FALSE
*<int>ppBrowser = 0
if (p)
IUnknown unk = *<comref>p
if (unk <> 0)
success = (unk->QueryInterface(_IID_IWebBrowser2, ppBrowser) = 0)
endif
endif
return success
endsub
sub BrowserGetDocument(IWebBrowser2 browser, pointer ppv),BOOL
' This will return IHTMLDocument2
BOOL success = FALSE
IDispatch disp = 0
if ((browser->get_Document(&disp) = 0) and (disp <> 0))
success = (disp->QueryInterface(_IID_IHTMLDocument2, ppv) = 0)
disp->Release()
endif
return success
endsub
sub DocumentGetElementById(IHTMLDocument2 document, LPWSTR id, pointer refiid, pointer ppv),BOOL
BOOL success = FALSE
if (refiid = 0) then refiid = _IID_IHTMLElement
IHTMLDocument3 doc = 0
if (document->QueryInterface(_IID_IHTMLDocument3, &doc) = 0)
IHTMLElement element = 0
if ((doc->getElementById(id, &element) = 0) and (element <> 0))
success = (element->QueryInterface(refiid, ppv) = 0)
element->Release()
endif
doc->Release()
endif
return success
endsub
sub CollectionGetItem(IHTMLElementCollection all, int index, pointer refiid, pointer ppv),BOOL
BOOL success = FALSE
VARIANT vName
VARIANT vIndex
vName.vt = VT_I4
vIndex.vt = VT_EMPTY
vName.intVal = index
IDispatch pDisp = 0
if ((all->item(vName, vIndex, &pDisp) = 0) and (pDisp <> 0))
success = (pDisp->QueryInterface(refiid, ppv) = 0)
pDisp->Release()
endif
return success
endsub
sub BrowserGetAll(IWebBrowser2 browser, pointer ppAall),BOOL
' this will return IHTMLElementCollection
BOOL success = FALSE
IHTMLDocument2 document
if (BrowserGetDocument(browser, &document))
IHTMLElementCollection all = 0
if ((document->get_all(ppAall) = 0) and *<int>ppAall)
success = TRUE
endif
document->Release()
endif
return success
endsub
sub ElementClick(IDispatch object)
IHTMLElement element = 0
if (object->QueryInterface(_IID_IHTMLElement, &element) = 0)
element->click()
element->Release()
endif
endsub
sub ElementFocus(IDispatch object)
IHTMLElement2 element = 0
if (object->QueryInterface(_IID_IHTMLElement2, &element) = 0)
element->focus()
element->Release()
endif
endsub
/*
sub ElementGetInnerText(IDispatch object, pointer ppv),BOOL
' this will return BSTR
' you get same result calling ElementGetAttribute(element, L"innerText", &bstrText)
BOOL success = FALSE
IHTMLElement element=0
if (object->QueryInterface(_IID_IHTMLElement, &element) = 0)
if ((element->get_innerText(ppv) = 0) and *<int>ppv)
success = TRUE
endif
element->Release()
endif
return success
endsub*/
sub ElementGetAttribute(IDispatch object, LPWSTR attribute, pointer ppv),BOOL
' this will return BSTR
BOOL success = FALSE
IHTMLElement element=0
if (object->QueryInterface(_IID_IHTMLElement, &element) = 0)
VARIANT v
if (element->getAttribute(attribute, 0, &v) = 0)
if ((v.vt <> VT_UNKNOWN) and (v.vt <> VT_DISPATCH))
VariantChangeType(&v, &v, VARIANT_ALPHABOOL, VT_BSTR)
endif
if (v.vt = VT_BSTR)
*<BSTR>ppv = v.bstrVal
v.vt = VT_EMPTY
success = TRUE
endif
VariantClear(&v)
endif
element->Release()
endif
return success
endsub
'================================================
' this is a trick to receive OnDocumentComplete.
' the _DocumentComplete function is empty, and is 16 bytes long. Takes two parameters
declare extern _DocumentComplete()
declare _dcpath()
_asm
jmp _skip
_dcpath: ; replacement for _DocumentComplete
push dword _doccompl
ret
align 4
_doccompl:
add esp,12 ; eat 2 parameters and return address
mov eax,[esp+4] ; WebEvents*
mov edx,[esp+24] ; pDispParams
mov edx,[edx] ; VARIANT[]
push dword [edx+8] ; VARIANT *URL
push dword [edx+16+8]; IDispatch *pDisp
push dword [eax+12] ; hwnd
call OnDocumentComplete
ret 0x24 ; return from WebEvents::Invoke
_skip:
_endasm
sub InstallDocumentCompleteSupport()
' overwrite the _DocumentComplete function
WriteProcessMemory(GetCurrentProcess(),&_DocumentComplete,&_dcpath,6,0)
return
endsub
sub OnDocumentComplete(HWND hwnd, IDispatch pDisp, VARIANT URL)
' pDisp - Pointer to the IDispatch interface of the window or frame in which the document has loaded.
' This IDispatch interface can be queried for the IWebBrowser2 interface.
_SendMessage(hwnd, @IDDOCCOMPLETE,0,0)
return
endsub
sapero: You amaze me every time you post. Quite an advanced method, including inlineasm etc just to accomplish this, but it sure is a push in the right direction. Thanks!
I have updated the code - added OnGoogleResults function where the code is searching for an anchor with given text. If found, the anchor will be clicked and... (top secret).
I can hardly even speak. The implementation is way more advanced than anything that I could come up with, and that's just the internals on how to find controls etc. This could be a priceless library if worked on a little.
The trick with the OnDocumentComplete-function is so far out of my league that I won't even start discussing it.
You're a god.
The next example shows how to attach your dispatch class to DHTML event (onreadystatechange), how to set src attribute of image element, and how to append text to element.
Required headers update (6th december) - fixes setAttribute method!
The DHTMLDispatch class does not use reference counter (AddRef,Release), it is not required here.
$include "windowssdk.inc"
$include "mshtml.inc"
$include "exdisp.inc"
' required SDK pak from 6th december 2008, or newer
class DHTMLDispatch
declare DHTMLDispatch()
declare virtual QueryInterface(REFIID riid, pointer ppvObject),HRESULT
declare virtual AddRef(),ULONG
declare virtual Release(),ULONG
declare virtual GetTypeInfoCount(pointer pctinfo),HRESULT
declare virtual GetTypeInfo(UINT iTInfo, LCID lcid, ITypeInfo ppTInfo),HRESULT
declare virtual GetIDsOfNames(REFIID riid, LPOLESTR rgszNames[], UINT cNames, LCID lcid, pointer rgDispId),HRESULT
declare virtual Invoke(DISPID dispIdMember, REFIID riid, LCID lcid, USHORT wFlags, DISPPARAMS pDispParams, VARIANT pVarResult, EXCEPINFO pExcepInfo, UINT puArgErr byref)
uint m_callback /* OnEvent(DHTMLDispatch disp) */
' user data here
IHTMLElement m_image
IHTMLElement m_text
endclass
WINDOW g_win
DHTMLDispatch g_dLogo
OPENWINDOW g_win,0,0,600,400,0x10CF0000,0,"DHTML events", &win_handler
if (AttachBrowser(g_win) = 0)
BROWSECMD g_win, @BROWSELOAD, "<html><head><base href='http://ionicwind.com/forums/'></head><body><img id=smfLogo border=1><br><br><div id=myInfo style='border: 1px solid'></div></body></html>"
IWebBrowser2 g_browser
if (BrowserFromWindow(g_win, &g_browser))
' InitImageHref will set g_dLogo.m_image to document.smfLogo, g_dLogo.m_text to myInfo,
' image.onreadystatechange to OnImageReadyStateChanged, image.src to SMF logo url.
InitImageHref(g_browser, L"smfLogo", L"Themes/babylon/images/smflogo.gif", g_dLogo)
g_browser->Release()
AppendTextLine(g_dLogo, L"entering message loop")
waituntil g_win.hwnd = 0
' cleanup
if (g_dLogo.m_image <> 0) then g_dLogo.m_image->Release()
if (g_dLogo.m_text <> 0) then g_dLogo.m_text->Release()
endif
endif
end
sub InitImageHref(IWebBrowser2 browser, LPWSTR wszImgId, LPWSTR wszImgSrc, DHTMLDispatch dStateChangeDisp)
VARIANT v
IHTMLDocument2 document
if (BrowserGetDocument(browser, &document))
DocumentGetElementById(document, L"myInfo", _IID_IHTMLElement, &dStateChangeDisp.m_text)
IHTMLElement image
if (DocumentGetElementById(document, wszImgId, _IID_IHTMLElement, &image))
dStateChangeDisp.m_callback = &OnImageReadyStateChanged
dStateChangeDisp.m_image = image
AppendTextLine(dStateChangeDisp, L"initializing onreadystatechange")
v.vt = VT_DISPATCH
v.pDispVal = &dStateChangeDisp
ElementSetAttributeEx(image, L"onreadystatechange", v)
AppendTextLine(dStateChangeDisp, L"initializing image.src with " + *<WSTRING>wszImgSrc)
v.vt = VT_BSTR
v.bstrVal = SysAllocString(wszImgSrc)
ElementSetAttributeEx(image, L"src", v)
VariantClear(&v)
'image->Release() !! keep a reference in g_dLogo.m_image
endif
document->Release()
endif
AppendTextLine(dStateChangeDisp, L"returning from InitImageHref function")
return
endsub
' called when ready-state of image changes
sub OnImageReadyStateChanged(DHTMLDispatch disp)
VARIANT state
if (ElementGetAttributeEx(disp.m_image, L"readyState", state))
AppendTextLine(disp, L"ready state: " + state.*<WSTRING>bstrVal)
if (wcsicmp(state.bstrVal, L"complete") = 0)
disp.m_image->Release()
disp.m_image = 0
endif
VariantClear(&state)
endif
return
endsub
sub AppendTextLine(DHTMLDispatch disp, wstring wszText)
VARIANT text
if (disp.m_text <> 0)
if (ElementGetAttributeEx(disp.m_text, L"innerHTML", text))
BSTR newString = SysAllocString(text.*<WSTRING>bstrVal + wszText + L"<br>")
SysFreeString(text.bstrVal)
text.bstrVal = newString
ElementSetAttributeEx(disp.m_text, L"innerHTML", text)
SysFreeString(text.bstrVal)
endif
endif
return
endsub
sub win_handler
SELECT @MESSAGE
CASE @IDCREATE
CENTERWINDOW *<WINDOW>@HITWINDOW
CASE @IDCLOSEWINDOW
CLOSEWINDOW *<WINDOW>@HITWINDOW
ENDSELECT
RETURN
ENDSUB
' general dispatch class
sub DHTMLDispatch::DHTMLDispatch()
m_callback = 0
m_image = 0
m_text = 0
return
endsub
sub DHTMLDispatch::QueryInterface(REFIID riid, pointer ppvObject),HRESULT
if (IsEqualGUID(riid, _IID_IUnknown) or IsEqualGUID(riid, _IID_IDispatch))
*<pointer>ppvObject = this
AddRef()
return 0
endif
return E_NOINTERFACE
endsub
sub DHTMLDispatch::AddRef(),ULONG
return 1
endsub
sub DHTMLDispatch::Release(),ULONG
return 1
endsub
sub DHTMLDispatch::GetTypeInfoCount(pointer pctinfo),HRESULT
*<int>pctinfo = 0
return 0
endsub
sub DHTMLDispatch::GetTypeInfo(UINT iTInfo, LCID lcid, ITypeInfo ppTInfo),HRESULT
return E_NOINTERFACE
endsub
sub DHTMLDispatch::GetIDsOfNames(REFIID riid, LPOLESTR rgszNames[], UINT cNames, LCID lcid, pointer rgDispId),HRESULT
return E_FAIL
endsub
sub DHTMLDispatch::Invoke(DISPID dispIdMember, REFIID riid, LCID lcid, USHORT wFlags, DISPPARAMS pDispParams, VARIANT pVarResult, EXCEPINFO pExcepInfo, UINT puArgErr byref)
if (m_callback)
declare CB1(DHTMLDispatch d)
!<CB1>m_callback(*<DHTMLDispatch>this)
endif
return
endsub
' html helpers
sub ElementGetAttributeEx(IDispatch object, LPWSTR attribute, VARIANT ppv),BOOL
' this will return BSTR
BOOL success = FALSE
IHTMLElement element=0
if (object->QueryInterface(_IID_IHTMLElement, &element) = 0)
success = (element->getAttribute(attribute, 0, &ppv) = 0)
element->Release()
endif
return success
endsub
sub ElementSetAttributeEx(IDispatch object, LPWSTR name, VARIANT v)
IHTMLElement element
if (object->QueryInterface(_IID_IHTMLElement, &element) = 0)
element->setAttribute(name, v, 0)
element->Release()
endif
VariantClear(&v)
return
endsub
sub BrowserFromWindow(WINDOW w, pointer ppBrowser),BOOL
' This will return IWebBrowser2 object.
' Call Release() method when finished with it.
pointer p = GetProp(w.hwnd, "BROWSER")
BOOL success = FALSE
*<int>ppBrowser = 0
if (p)
IUnknown unk = *<comref>p
if (unk <> 0)
success = (unk->QueryInterface(_IID_IWebBrowser2, ppBrowser) = 0)
endif
endif
return success
endsub
sub BrowserGetDocument(IWebBrowser2 browser, pointer ppv),BOOL
' This will return IHTMLDocument2
BOOL success = FALSE
IDispatch disp = 0
if ((browser->get_Document(&disp) = 0) and (disp <> 0))
success = (disp->QueryInterface(_IID_IHTMLDocument2, ppv) = 0)
disp->Release()
endif
return success
endsub
sub DocumentGetElementById(IHTMLDocument2 document, LPWSTR id, pointer refiid, pointer ppv),BOOL
BOOL success = FALSE
IHTMLDocument3 doc = 0
if (document->QueryInterface(_IID_IHTMLDocument3, &doc) = 0)
IHTMLElement element = 0
if ((doc->getElementById(id, &element) = 0) and (element <> 0))
success = (element->QueryInterface(refiid, ppv) = 0)
element->Release()
endif
doc->Release()
endif
return success
endsub
Attached a project extending the above example. Has additional input box and a button. If you click it, image will be downloaded from the url typed in input box.
The url can be relative to http://www.ionicwind.com/forums/ (see <base href>) or it can be a fully qualified url.