You have to login in order to post a reply to this topic.

2 replies [Last post]
Jeremy
User offline. Last seen 1 year 51 weeks ago. Offline
Joined: 02/04/2010
Posts: 4

I've seen variations of this question asked before but nothing quite like what I need to do so I thought I would start a new post.

 

Basically, I need to extract an article headline and the article text from a web page.

I can create a job that will login into the site and navigate to the correct page but I'm not sure how to tell WinAutomation which text on the page I need to extract while ignoring the rest.

Since the text length will always be different I don't think a simple copy paste action will work with the recorder but the article will always be in the same section of the page.

Hope that makes sense.

Just wanted ot know if there was a somewhat simple way to go about something like this.  I think I know a way to cobble something together but it will be extrmely complicated and as a result the job could easily "break"

Thanks!

 

D.M.Altizer
User offline. Last seen 1 year 11 weeks ago. Offline
Joined: 01/12/2010
Posts: 204
Re: Extract Specific Text From Web Page

Hello Jeremy!

This is a very tricky question. You are absolutely right that there is no direct solution posted in the forum, and the reason is that there is no universal solution. It depends on the specific website and the specific part of the page that you want to retrieve. Most of the web mining jobs that I build for my customers use the "Download from Web" action and some regular expressions. This practically means that the specific jobs work in the background and are much faster than any UI-based job. 

If you can provide us with the source code of the various web pages that you are navigating through, someone will probably fix a sample job for you.

__________________

==Dedicated Automation Solutions==

__________________

==Dedicated Automation Solutions==

Jeremy
User offline. Last seen 1 year 51 weeks ago. Offline
Joined: 02/04/2010
Posts: 4
Re: Extract Specific Text From Web Page

Thanks for the insight and help

I actually came across another wrinkle.  The page with the data to extract is AJAX based so that makes it a lot more tricky since the text of the article doesn't actually show up in the source code.

I"m thinking I might just use a data extraction program to do the job and then have WinAutomation work around that program to tie up the loose ends otherwise I think it could get really messy just relying on WinAutomation alone.