Regular Expressions Thread
If I wanted to pull the names and prices of products costing more than $X from an amazon.com search results page, what "text to find" or would I use? Would that be a regex? I see Jag and Thomas using all sorts of complex looking stuff when parsing text….is there a library where I can see what it all means?
Here is the code from the page.
<span class="tiny"><span style="white-space:no-wrap;"><span class="asinReviewsSummary" name="B0000A1ZMU">
<a href="http://www.amazon.com/Cuisinart-DFP-3-Handy-3-Cup-Processor/product-reviews/B0000A1ZMU/ref=pd_ts_k_1_cm_cr_acr_img?ie=UTF8&showViewpoints=1"><img src="http://g-ecx.images-amazon.com/images/G/01/x-locale/common/customer-reviews/ratings/stars-4-5._V25749327_.gif" width="55" alt="4.5 out of 5 stars" align="absbottom" height="12" border="0" /></a> </span>(<a href="http://www.amazon.com/Cuisinart-DFP-3-Handy-3-Cup-Processor/product-reviews/B0000A1ZMU/ref=pd_ts_k_1_cm_cr_acr_txt?ie=UTF8&showViewpoints=1">81 customer reviews</a>)</span></span>
<span class="tiny"> | <a href="http://www.amazon.com/gp/forum/cd/forum.html/ref=cm_cd_pd_ts_fp?ie=UTF8&cdForum=FxFEB9PBNUVWC7&asin=B0000A1ZMU">3 customer discussions</a></span><br />
In Stock<br />
<table width="100%" border="0" cellspacing="0" cellpadding="0" class="priceBox">
<p class="priceBlock"><strong>List Price: </strong> <span class="listprice">$110.00</span> </p><p class="priceBlock"><strong>Price: </strong> <span class="price"><b>$59.99</b></span> </p><p class="priceBlock"><strong>You Save: </strong> <span class="price">$50.01 (45%)</span></p> <p class="priceBlock"><a href="http://www.amazon.com/gp/offer-listing/B0000A1ZMU/ref=pd_ts_k_1?ie=UTF8&condition=all">11 used & new</a> from <span class="price">$56.99</span>
As far as I am concerned, the best tool when it comes to regular expressions is called: Expresso. This little, free tool is as powerful as it gets:
– It contains a regular expression tutorial that will help you get started
– It has a small regular expression library allowing you to choose from common regular expressions
– It includes a Regex Analyzer that "translates" the regular expression in actual words of the english language
– It allows you to insert specific parts of the regular expression by clicking a button (no typing at all)
– It also lets you test the results that a specific regular expression will produce when applied to a specific test.
Notice, that all this time that I have been using it to test regex before inserting them into WinAutomation, the results produced by Expresso have always been the same with the results that have been generated by WinAutomation’s "Parse Text", "Replace Text" and "Split Text" actions.
Thanks (again) Codex
This looks much easier than I imagined. I am going to play around with this and see if I cant get a decent tune out of it.
After a quick read of the 30 min help file, Im guessing the best way to retrieve price from teh code above is to look for text between html tags in this example. <span class="listprice">$110.00</span>
Am I close?
All the best
Yep! Specifically in Expresso, go to "Design Mode" and under the "Groups" tab in the bottom of the page, use the 2 most useful groups in regular expressions: "Match prefix but exclude it" (?<, "Match suffix but exclude it" (?.
In your example the prefix is this: <span class="listprice">
and the suffix is this: </span>
So, your regular expression should look like this:
If you copy and paste this inside Expresso you will see the "translation" in the regex analyzer to the right.
Just a notice (now that I’m reviewing this thread again), regular expressions are case sensitive. This is something that caused me a lot of headaches back when I was a regex novice: the regular expression seemed absolutely perfect but it didn’t want to work since the text on the original document had a capital letter in the beginning…
Some other tips that you may find useful (especially when a regular expression seems like it should work, but it actually doesn’t):
1) Use s instead of a space character (as you can see in my example above I wrote:
(?<=<spansclass instead of (?<=<span class.
2) The period (.) wildcard matches the carriage return character (in regex r) but not the line feed character (in regex n). So, if you want to capture multiple lines you should use something like this:
(.|n)*. This basically means: match any character or the line feed character (since any does not include line feed).
And some background on the r and n characters (I found them very confusing when I started up with regex):
Text files that are saved in Unix are using the line feed character (n) to separate one line from another.
Windows’ text files on the other hand, use 2 "invisible" (or non-printable) characters to separate the lines: carriage return (r) and line feed (n) in this order: rn. If you are using expresso, in your results the carriage return character is represented with [CR] and the line feed character with (surprise surprise!) [LF].
@admin: Can you please make this thread sticky to use it as a regular expression reference? People can ask questions about regular expressions in here (even if they are not related to WinAutomation) because sometimes there is an error that may be too obvious but you just can’t see it… It’s a good
@baz: Can you please change the name of the thread to something like "Regular Expression (regex) Related Questions" so that everyone can understand why this is sticky and post his regex related questions in here?
Your solution is much better than the one I was trying to use…which didnt work. However the time I spend getting to knwo this program shows me that its very powerful in conjunction with WA.
One thing I cant figure out is the logic which allows me to pull 2 fields from teh same product record. Lets say I want the name and the price, how would I go about pulling both fields into a spreadsheet?
All the best
PS -Thread title changed
Baz, I usually use 2 different parse text actions in order to retrieve 2 different values. This way I can store them in variables with descriptive names (i.e. %Name% and %Price%). Also, keep in mind that most of the values that you are grabbing from a web page, will have different html tags for different values. It may be a different class in the span tag, or a different id or sth else. But all you have to do in order to create your brand new regular expression is replace the "prefix" and the "suffix" in Expresso with the tags that are before and after the new value that you want to grab.
If you want me to explain how I’m doing this in more details, let me know.
That makes a lot of sense Codex….and should be so easy to do…will let you know how I get on. This is exciting!
All the best
I’m following this thread closely as I’m trying to create a simple program that will:
1. Search google for a keyword.
2. Pull the top 10 websites that it comes up as listed
3. Put those websites inside a list txt or excel file.
Pretty simple and straight forward. But I’m having problems with getting the right tags and such. I thought I found it, but it doesn’t work.
I’m busy now, but soon I’ll upload a sample job for you guys to look through.
OK…I had a go and was semi successful. I successfully changed the suffixes and prefixes and pulled the data into a spreadsheet. Hooray! However…
The data is kinda jumbled up and the prices dont match the product names. Im never going to figure out why this is the case on my own. Have attached the job for exsamination.
@Rob – Have attached a job you will like
All the best
Hello Baz, I think I ‘ve found the problem. I have tested the first regular expression with Expresso and it seems to produce 28 results instead of 25. After a closer look I noticed that there are 3 products that contain a "Price After Rebate:" which has the exact same tag as the other, normal price. So I have added a few more things for the regular expression to look for.
To make a long story short, replace the regular expression in action no.2 with the following regular expression:
and your job should work OK.
Awesome, thanks baz…
Now the question, I used Expresso, but why is the "protocol" and everything after that needed? I see the need for the suffix…
I was able to look at that code and understand preciely what you’d done! Progress indeed.
All the best
I’m still curious about the rest of the regular expression…why is it needed?
You must be logged in to reply to this topic.