Regular Expressions Thread

baz

  • Joined: Nov 12, 2009
  • Posts: 47

Tue 11/17/2009 - 8:58

Hi

If I wanted to pull the names and prices of products costing more than $X from an amazon.com search results page, what "text to find" or would I use? Would that be a regex? I see Jag and Thomas using all sorts of complex looking stuff when parsing text....is there a library where I can see what it all means?

Here is the code from the page.

 Cuisinart

</span></div>
    
    <span class="tiny"><span style="white-space:no-wrap;"><span class="asinReviewsSummary" name="B0000A1ZMU"> 
              <a href="http://www.amazon.com/Cuisinart-DFP-3-Handy-3-Cup-Processor/product-reviews/B0000A1ZMU/ref=pd_ts_k_1_cm_cr_acr_img?ie=UTF8&showViewpoints=1"><img src="http://g-ecx.images-amazon.com/images/G/01/x-locale/common/customer-reviews/ratings/stars-4-5._V25749327_.gif" width="55" alt="4.5 out of 5 stars" align="absbottom" height="12" border="0" /></a>&nbsp;</span>(<a href="http://www.amazon.com/Cuisinart-DFP-3-Handy-3-Cup-Processor/product-reviews/B0000A1ZMU/ref=pd_ts_k_1_cm_cr_acr_txt?ie=UTF8&showViewpoints=1">81 customer reviews</a>)</span></span>
    <span class="tiny"> | <a href="http://www.amazon.com/gp/forum/cd/forum.html/ref=cm_cd_pd_ts_fp?ie=UTF8&cdForum=FxFEB9PBNUVWC7&asin=B0000A1ZMU">3 customer discussions</a></span><br />
    
    In Stock<br />
    

    <table width="100%"  border="0" cellspacing="0" cellpadding="0" class="priceBox">
      <tr>
        <td width="45%">
        








<p class="priceBlock"><strong>List Price: </strong> <span class="listprice">$110.00</span> </p><p class="priceBlock"><strong>Price: </strong> <span class="price"><b>$59.99</b></span> </p><p class="priceBlock"><strong>You Save: </strong> <span class="price">$50.01 (45%)</span></p> <p class="priceBlock"><a href="http://www.amazon.com/gp/offer-listing/B0000A1ZMU/ref=pd_ts_k_1?ie=UTF8&condition=all">11 used &amp; new</a> from <span class="price">$56.99</span>



</p>

#1

codex

  • Joined: Nov 12, 2009
  • Posts: 161

Thu 11/19/2009 - 18:07

As far as I am concerned, the best tool when it comes to regular expressions is called: Expresso. This little, free tool is as powerful as it gets:

- It contains a regular expression tutorial that will help you get started
- It has a small regular expression library allowing you to choose from common regular expressions
- It includes a Regex Analyzer that "translates" the regular expression in actual words of the english language
- It allows you to insert specific parts of the regular expression by clicking a button (no typing at all)
- It also lets you test the results that a specific regular expression will produce when applied to a specific test.

Notice, that all this time that I have been using it to test regex before inserting them into WinAutomation, the results produced by Expresso have always been the same with the results that have been generated by WinAutomation's "Parse Text", "Replace Text" and "Split Text" actions.

#2

baz

  • Joined: Nov 12, 2009
  • Posts: 47

Fri 11/20/2009 - 10:47

Thanks (again) Codex

This looks much easier than I imagined. I am going to play around with this and see if I cant get a decent tune out of it.

After a quick read of the 30 min help file, Im guessing the best way to retrieve price from teh code above is to look for text between html tags in this example. <span class="listprice">$110.00</span>

Am I close?

All the best

 

Barry

#3

codex

  • Joined: Nov 12, 2009
  • Posts: 161

Fri 11/20/2009 - 11:19

Yep! Specifically in Expresso, go to "Design Mode" and under the "Groups" tab in the bottom of the page, use the 2 most useful groups in regular expressions: "Match prefix but exclude it" (?<=), "Match suffix but exclude it" (?=).

In your example the prefix is this: <span class="listprice">
and the suffix is this: </span>

So, your regular expression should look like this:
(?<=<span\sclass="listprice">).*?(?=</span>)

If you copy and paste this inside Expresso you will see the "translation" in the regex analyzer to the right.

#4

codex

  • Joined: Nov 12, 2009
  • Posts: 161

Sun 11/22/2009 - 1:37

Just a notice (now that I'm reviewing this thread again), regular expressions are case sensitive. This is something that caused me a lot of headaches back when I was a regex novice: the regular expression seemed absolutely perfect but it didn't want to work since the text on the original document had a capital letter in the beginning...

Some other tips that you may find useful (especially when a regular expression seems like it should work, but it actually doesn't):
1) Use \s instead of a space character (as you can see in my example above I wrote:
(?<=<span\sclass instead of (?<=<span class.
2) The period (.) wildcard matches the carriage return character (in regex \r) but not the line feed character (in regex \n). So, if you want to capture multiple lines you should use something like this:
(.|\n)*. This basically means: match any character or the line feed character (since any does not include line feed).

And some background on the \r and \n characters (I found them very confusing when I started up with regex):
Text files that are saved in Unix are using the line feed character (\n) to separate one line from another.
Windows' text files on the other hand, use 2 "invisible" (or non-printable) characters to separate the lines: carriage return (\r) and line feed (\n) in this order: \r\n. If you are using expresso, in your results the carriage return character is represented with [CR] and the line feed character with (surprise surprise!) [LF].

2 suggestions:
@admin: Can you please make this thread sticky to use it as a regular expression reference? People can ask questions about regular expressions in here (even if they are not related to WinAutomation) because sometimes there is an error that may be too obvious but you just can't see it... It's a good
@baz: Can you please change the name of the thread to something like "Regular Expression (regex) Related Questions" so that everyone can understand why this is sticky and post his regex related questions in here?

Cheers!

#5

baz

  • Joined: Nov 12, 2009
  • Posts: 47

Sun 11/22/2009 - 21:13

Thanks Codex

Your solution is much better than the one I was trying to use...which didnt work. However the time I spend getting to knwo this program shows me that its very powerful in conjunction with WA.

One thing I cant figure out is the logic which allows me to pull 2 fields from teh same product record. Lets say I want the name and the price, how would I go about pulling both fields into a spreadsheet?

All the best

Baz

 

PS -Thread title changed

#6

codex

  • Joined: Nov 12, 2009
  • Posts: 161

Tue 11/24/2009 - 17:11

Baz, I usually use 2 different parse text actions in order to retrieve 2 different values. This way I can store them in variables with descriptive names (i.e. %Name% and %Price%). Also, keep in mind that most of the values that you are grabbing from a web page, will have different html tags for different values. It may be a different class in the span tag, or a different id or sth else. But all you have to do in order to create your brand new regular expression is replace the "prefix" and the "suffix" in Expresso with the tags that are before and after the new value that you want to grab.

If you want me to explain how I'm doing this in more details, let me know.

#7

baz

  • Joined: Nov 12, 2009
  • Posts: 47

Tue 11/24/2009 - 18:03

That makes a lot of sense Codex....and should be so easy to do...will let you know how I get on. This is exciting!

 

All the best

 

Barry

#8

ccmusicman

  • Joined: Nov 11, 2009
  • Posts: 31

Tue 11/24/2009 - 20:29

I'm following this thread closely as I'm trying to create a simple program that will:

 

1. Search google for a keyword.

2. Pull the top 10 websites that it comes up as listed

3. Put those websites inside a list txt or excel file.

 

Pretty simple and straight forward. But I'm having problems with getting the right tags and such. I thought I found it, but it doesn't work.

 

I'm busy now, but soon I'll upload a sample job for you guys to look through.

 

Thanks,

Rob

#9

baz

  • Joined: Nov 12, 2009
  • Posts: 47

Tue 11/24/2009 - 21:16

OK...I had a go and was semi successful. I successfully changed the suffixes and prefixes and pulled the data into a spreadsheet. Hooray! However...

 

The data is kinda jumbled up and the prices dont match the product names. Im never going to figure out why this is the case on my own. Have attached the job for exsamination.

 

@Rob - Have attached a job you will like :)

 

All the best

 

Baz

Amazon Parse.waj Extract Google URLs.waj

#10

codex

  • Joined: Nov 12, 2009
  • Posts: 161

Tue 11/24/2009 - 23:21

Hello Baz, I think I 've found the problem. I have tested the first regular expression with Expresso and it seems to produce 28 results instead of 25. After a closer look I noticed that there are 3 products that contain a "Price After Rebate:" which has the exact same tag as the other, normal price. So I have added a few more things for the regular expression to look for.

To make a long story short, replace the regular expression in action no.2 with the following regular expression:

(?<=Price:\s</strong>\s<span\sclass="price"><b>).*?(?=</b></span>)

and your job should work OK.

Cheers!

#11

ccmusicman

  • Joined: Nov 11, 2009
  • Posts: 31

Wed 11/25/2009 - 2:40

Awesome, thanks baz...

 

Now the question, I used Expresso, but why is the "protocol" and everything after that needed? I see the need for the suffix...

 

Thanks,

Rob

#12

baz

  • Joined: Nov 12, 2009
  • Posts: 47

Wed 11/25/2009 - 18:31

Thanks Codex

I was able to look at that code and understand preciely what you'd done! Progress indeed.

 

All the best

 

Barry

#13

ccmusicman

  • Joined: Nov 11, 2009
  • Posts: 31

Tue 12/1/2009 - 0:47

Hey,

 

I'm still curious about the rest of the regular expression...why is it needed?

 

Thanks!

Rob

#14

codex

  • Joined: Nov 12, 2009
  • Posts: 161

Tue 12/1/2009 - 12:38

Copy and paste the regular expression that you are talking about, and underline the part that you are having trouble understanding. This way I will know which part I need to explain.

Cheers!

#15

ccmusicman

  • Joined: Nov 11, 2009
  • Posts: 31

Tue 12/1/2009 - 18:50

Ok,  here it is:

 

(?<=class=r><a\shref=")(?<Protocol>\w+):\/\/(?<Domain>[\w@][\w.:@]+)\/?[\w\.?=\%&=\-@/$,]*(?="\s)

 

All of that I don't understand...lol

 

I get why I need the first part - I don't get the second.


Also, what does VV mean? When I look at it in expresso, it just shows up as V's.

 

Thanks,

Rob

#16

ccmusicman

  • Joined: Nov 11, 2009
  • Posts: 31

Tue 12/1/2009 - 18:51

Well, the underlining didn't work.


Basically, everything from protocol onward I don't get.

Rob

#17

codex

  • Joined: Nov 12, 2009
  • Posts: 161

Mon 12/7/2009 - 15:03

Alloha! I promised that I will explain the regular expression so here we go:

First of all, don't be scared by the "Protocol" and "Domain" words: the regular expression does not look for these words in the text that it parses. Think of them as "tags" that help you understand what the specific part of the regular expression does. I will break the regex in pieces and explain each one of them. I will use the following HTML code as an example:
[quote]http://winautomation.com/kb/send-keys-use-modifiers[/quote]

(?<=class=r><a\shref=")
This is pretty simple. It means: start capturing from the text that follows the text class=r><a href="

(?<Protocol>\w+)
This is very simple too. After the href=" part, capture one or more alphanumeric characters.
The <Protocol> is just a tag, so that the characters that will be captured (i.e. http) will be grouped and the group will have a specific name (i.e. Protocol).
The part of the HTML code that will be matched here is: http

:\/\/
This is very simple too. Capture the : character and 2 slash characters (//). They are not 2 V characters but a backslash character, a slash character, another backslash character and another slash character. The backslash character is used to point out that we are talking about the literal slash character.
The part of the HTML code that will be matched here is: ://

(?<Domain>[\w@][\w.:@]+)
The "Domain" is just a tag. This part of the regex will capture any character that is:
alphanumeric (\w) or the period character (.) or : or @.
The part of the HTML code that will be matched here is: winautomation.com

\/?
The slash character after the domain (if there is one).
The part of the HTML code that will be matched here is: winautomation.com/kb (the slash character before kb)

[\w\.?=\%&=\-@/$,]*
Any of the characters:
alphanumeric, period, =, %, &, -, @, /,$ and ,. These are all the characters that are allowed in a valid URL.
The part of the HTML code that will be matched here is: kb/send-keys-use-modifiers

#18

ccmusicman

  • Joined: Nov 11, 2009
  • Posts: 31

Mon 12/7/2009 - 23:12

ah ha, that all makes sense!

 

Thanks!

#19

D.M.Altizer

  • Joined: Jan 12, 2010
  • Posts: 204

Tue 1/26/2010 - 10:41

Since this is the sticky thread for regular expressions, I decided to share here some of the regular expressions that helped me a lot throughout the years. If you are working with PHP or you administering a Linux server, regular expressions inevitably become your best buddy. So, here we go:

Get Urls that start with http://, https://, ftp://, file://
\b(https?|ftp|file)://\S+

Extract Host from a specific URL (e.g. extract www.example.com from http://www.example.com/something/something-else/index.php)
(?<=\A[a-z][a-z0-9+\-.]*://)([a-z0-9\-._~%!$&'()*+,;=]+@)?([a-z0-9\-._~%]+|\[[a-z0-9\-._~%!$&'()*+,;=:]+\])

Replace <b> and </b> with <strong> and </strong>
Use the "Replace Text" action. In the "Text to Find" field enter:
<(/?)b\b((?:[^>"']|"[^"]*"|'[^']*')*)>
In the "Replace With" field enter:
<$1strong$2>
Do not forget to check the "Use Regular Expressions for find and replace" option.

Find all tags in an HTML page
<[^>]*>
This is the "quick and dirty" solution. There are a lot of variations to retrieve even more tags but for the time being there is no reason for me to get deeper into this. You can of course use this in the "Replace Text" action and enter %""% in the "Replace With" field to remove all html tags.

Three tips for the specific thread:
1) Stay tuned for more regular expressions.
2) Subscribe to this thread to be notified automatically about new regular expressions that I will add.
3) Don't be afraid to ask for specific regular expressions that you may need.

#20

jvandread

  • Joined: Jan 28, 2010
  • Posts: 12

Wed 2/10/2010 - 12:33

Hi,

I hope someone can help me about getting a specific text using a regualr expression.

What I'm trying to do is I want to get a specific text in a txt. file

The content of the txt file:

 

Subject: This is a test ++
Message: Campaign posters went up and jingles blared at election rallies and motorcades all over the country yesterday as the country’s
richest politician and the son of its democracy icon fired the opening salvos in the tight race to succeed.

 

I only need to get the "This is a test" and put in one variable and "Campaign posters.." in another variable.

 

I try using the regex /p{Subject}\:\s(.*)\s\+\+/  to get the "This is a test" 

 

but I always get a blank result.

Thanks.

 

 

 

 

 

#21

D.M.Altizer

  • Joined: Jan 12, 2010
  • Posts: 204

Wed 2/10/2010 - 12:53

Try:

[quote](?<=Subject\:\s)(\w|\s)*[/quote]

to get the subject. 

Also, what are the conditions for the part of the message that you want to retrieve? You always want it to retrieve the first 2 words?

#22

jvandread

  • Joined: Jan 28, 2010
  • Posts: 12

Thu 2/11/2010 - 9:13

Hi D.M.,

 

Thank you very much.  The regex you gave me is working and I can get the subject and the message an dput it in seperate variable.

 

What I did is put to parse action to get the subject and message in seperate variable.

 

I use the code you gave me for the subject (?<=Subject\:\s)(\w|\s)*

 

And for the message:  (?<=Message\:\s)(\w|\s|\W)*

 

Thanks a lot.

 

#23

baz

  • Joined: Nov 12, 2009
  • Posts: 47

Sun 6/13/2010 - 7:44

Guys

Been trying this for hours...and I can't figure this out. Have attacked a job. Basically, I 've been scraping HTML no problem, but cant do this silly thing. Trying to take http://www.google.com/whatever and get rid of the http://

Job attached...have tried almost every permutation in this thread to get this to work. Any help is appreciated.

All the best

 

Barry

parse test.waj

#24

baz

  • Joined: Nov 12, 2009
  • Posts: 47

Sun 6/13/2010 - 16:40

Got it...replace text http:// with %""%. D'oh!

 

#25

D.M.Altizer

  • Joined: Jan 12, 2010
  • Posts: 204

Mon 6/14/2010 - 9:53

 Yep %""% is the solution. All inputs in WinAutomation automatically have their whitespace trimmed from the start and the end. I am guessing that this is a choice by design, since allowing spaces at the end of a string would result in a lot of head scratching if it was mistakenly placed there (for example, it is very difficult to spot an extra space and realize that this is causing your conditional to fail).

#26

BlackIce

  • Joined: Jul 10, 2010
  • Posts: 2

Sat 7/10/2010 - 11:28

Hello,

Is there a way to make regex expressions case insensitive ?

#27

D.M.Altizer

  • Joined: Jan 12, 2010
  • Posts: 204

Sat 7/10/2010 - 12:24

(?i:your_regular_expression)

#28

BlackIce

  • Joined: Jul 10, 2010
  • Posts: 2

Sat 7/10/2010 - 13:18

Thank you

#29

D.M.Altizer

  • Joined: Jan 12, 2010
  • Posts: 204

Sat 7/10/2010 - 13:50

cheers :D

#30

peter-

  • Joined: Nov 20, 2009
  • Posts: 73

Fri 8/13/2010 - 20:05

Hi

I have been trying with very little success to make a regex to pull the digits along with the plus or minus symbol from this code it would be great if some one could help me out

<DIV>What is 27 + 3 ?<BR><INPUT size=50 name=security_answer>
<DIV><LABEL>what is 8+6=<BR><INPUT size=50 name=security_answer>
<DIV>3+4<BR><INPUT size=50 name=security_answer>
<DIV>Intrebare: 4+3=?<BR><INPUT size=50 name=security_answer>
<DIV><LABEL>what is 8+6=<BR><INPUT size=50 name=security_answer>
<DIV>What is 5 + 1?<BR><INPUT size=50 name=security_answer>
<DIV>10-4=<BR><INPUT size=50 name=security_answer>

Thanks

 

#31

Samantha

  • Joined: Apr 23, 2010
  • Posts: 2734

Mon 8/16/2010 - 11:25

Hello Peter, You can try out this regular expression \d{1,2}\s?(\+|\-)\s?\d{1,2} it will give you all the numbers plus the symbol between them. (if you need WA to make the calculation, make sure you trim the spaces) :) Samantha

#32

peter-

  • Joined: Nov 20, 2009
  • Posts: 73

Mon 8/16/2010 - 12:28

That’s fantastic Samantha thank you
 

I had almost given up with the regex idea and was thinking of pulling the data between the two tags <DIV> and <INPUT size=50 name=security_answer> then making an a-z character list adding to that the opening and closing brackets ect and looping thought them to remove the unwanted data but your expression save all them actions

thanks again for that

#33

David

  • Joined: Sep 5, 2010
  • Posts: 37

Sun 9/5/2010 - 23:26

Hello and good evening!

Let's revive this useful thread with a new challenge I've been facing for the ... past 48 hours or so, and I am still unsatisfied with the result. Here's a simplified version of what the REGEX should be doing (apart from other things which are clear):

After the initial captured http:// and some link blurb, there must NOT be another string http, neither before the rss or feed, nor behind, within the same link.

And here's the very basic version which (based on all docu I've found) SHOULD be doing the magic: (?<=href=")http://.*?(?:(?!http))?(rss|feed).*?(?:(?!http))?(?=")

But it doesn't. It still captures a second http within some links: You can run it in Expresso and you'll see that it still captures things like this: <a href="http://www.netvibes.com/subscribe.php?type=rss&url=http://www.ideamarketers.com/rss.xml" />

The part (?:(?!http))? is basically supposed to do the trick, because [^http] would exclude h or t or p, but not what I need: exclude http - the exact string.

Although for many it may look difficult, at least 3 of you will know exactly what I mean! ;-)  Naturally, this does not just relate to web addresses. In fact, the right answer everyone could use for an awful lot of things. Imagine how often we want to exclude a string within a string! Weirdly, all "cheat sheets" I've found discuss the rarest things, and [^] in detail, but leave out this one for everyday-use!

Looking forward to seeing your response! :-)

David

PS: Shame, I saw 2009 posts for which I would have had precise answers now... I arrived too late.

#34

Samantha

  • Joined: Apr 23, 2010
  • Posts: 2734

Mon 9/6/2010 - 11:19

Hello David, Have I got it right, you only want the first url alone.. How about this regex: ((?<=href=")http://.*?(?:(?!http))?(rss|feed)){1} This should give you the first one alone, no matter how many urls are found on that line. and this regex ((?<=href="http://.*?(?:(?!http))?(rss|feed).*?)http://.*(?<=([^"\s/>(.*)+\r?\n]))) will give u the second url alone. Give it a try and let me know :) Samantha

#35

David

  • Joined: Sep 5, 2010
  • Posts: 37

Mon 9/6/2010 - 13:15

Hello Samantha :-)

I now realize that I wasn't precise with my challenge definition! What I meant to say is: If there is any http link with another link inside, do not catch it at all! Reason being, I found that such feed links are either no feeds at all, or useless one-entry feeds of a single article...!

Hence, although your second regex is interesting :)  it doesn't help me here. What I need is:

"http", anything but no http again, "rss" or "feed", anything but no http again. Or in other words: Exclude such double-whammy links, but catch all other rss/feed.

 

#36

codex

  • Joined: Nov 12, 2009
  • Posts: 161

Mon 9/6/2010 - 17:12

 Hm... What about grabbing all the URLs, save them in a list (e.g. AllURLs), loop through the list, parse the current item for http:// and if there is exactly 1 match, add it to the "Final List". Otherwise, move on to the next URL.

Let me know if you need an example on how to do that.

#37

David

  • Joined: Sep 5, 2010
  • Posts: 37

Mon 9/6/2010 - 18:59

Thanks Codex, but no, that would be like breaking a butterfly on a wheel. A) because it's one simple line in regex, and B) because, then, I have already a more efficient solution with WA. But I'm sure that there's an easier/quicker way with regex alone. Also, as mentioned, the regex "not a string in a string" would be usable in so many other ways too...! :-)

 

#38

codex

  • Joined: Nov 12, 2009
  • Posts: 161

Wed 9/8/2010 - 12:42

There is nothing that can be translated as: 

"Not a string in a string"

in regular expressions. However, you can use something like this:

(?<=href=")[^"?&]*(?=")

#39

codex

  • Joined: Nov 12, 2009
  • Posts: 161

Thu 9/9/2010 - 13:24

For an explanation regarding the reason why the "not a string in a string" cannot be translated in regular expressions, there is a very interesting chapter in the book Mastering Regular Expressions that explains how a regex is processed by a regex engine. Most engines do not even allow you to use a regex in the lookbehind because they cannot apply a regular expression backwards. 

However, the engine that is used by WinAutomation (.NET) allows pretty much everything, including conditionals, named groups and named balancing groups.

#40

David

  • Joined: Sep 5, 2010
  • Posts: 37

Fri 9/10/2010 - 21:56

Hi Codex,

Thanks for that!

>There is nothing that can be translated as: "Not a string in a string" in regular expressions. However, you can use something like this: (?<=href=")[^"?&]*(?=")

No, I can't use that (tried similar versions already), it wouldn't catch what I want and wouldn't exclude what I don't want. ;-)  - It does no more than excluding the triplet "?& in ANY string between href=" and ".

>there is a very interesting chapter in the book Mastering Regular Expressions

Very impressive, what you are reading there! ;-)  - If I had an easier life I would now go and buy and read it. But can't.

Nonetheless, I continue to be pretty sure that something like this DOES exist. It MUST. :-)

It's the most obvious necessity for processing text. (obvious to me at least)

I won't give up so easily ;-)

Samantha's 2nd version + some tweaking might hopefully give me sth to deal with...

 

#41

David

  • Joined: Sep 5, 2010
  • Posts: 37

Fri 9/10/2010 - 22:00

Oh, I actually have a new "quiz" :-)

A regex for counting frequency of words with length > x characters in an entire text, not line. Is there one? Anyone knows?

---

PS: uuuh, these captchas here are sometimes really hard to decipher!

#42

codex

  • Joined: Nov 12, 2009
  • Posts: 161

Fri 9/10/2010 - 23:41

Regex are very powerful to grab a specific portion of text, but they are not a programming language. For example, you cannot loop or count in a regular expression (you can actually do some very primitive counting in balanced groups and only if you use .NET or JSoft regex engine but they will not help you in this case).

Regular expressions should be used in conjunction with something, if you need to do extreme stuff. In this case, your regular expression should look like this:

(?<=(?:\s|\G|\A))\w{5,}(?=(?:\s|\Z|\.|\?|\!))

for words with 5 letters or more. This way, you get a number of matches. Then you use a regex like:

\w+

to get all the words, convert both results into numbers and divide them. This way you will have a percentage.

#43

David

  • Joined: Sep 5, 2010
  • Posts: 37

Sat 9/11/2010 - 14:48

(?<=(?:\s|\G|\A))\w{5,}(?=(?:\s|\Z|\.|\?|\!)) - This is awesome codex! I have currently 6 "rules" I want to implement in regex, and although your regex doesn't look like it, it somehow fulfills 3(!) of them:

- Without having a ' in there, your regex magically excludes 'x (x for anything), this was my "rule" #2, to eliminate words like don't, doctor's, you're, etc.

- It excludes all punctuation marks - this I see in your regex, and it was my "rule" #3!

- It picks up words > x characters, which was my "rule" #6!

So, very good! Only, it surprisingly does NOT pick up long words when they are in (), eg curriculum and vitae in "The Best CV (Curriculum Vitae) Layout.". Nonetheless, very good!

Reading the regex masterbook you recommended earlier really DOES seem to help! ;-)

 

#44

David

  • Joined: Sep 5, 2010
  • Posts: 37

Sat 9/11/2010 - 15:49

Just wondering, do you also know what to add to the regex if the last letter of any word must not be an "s" (as in plural)?

(I am not sure but it seems that most english words that end in an "s" are plurals, so I would forget about the rest ;-)

#45

David

  • Joined: Sep 5, 2010
  • Posts: 37

Sun 9/12/2010 - 0:22

Codex, the last one (only that) I have just found out myself:

(?<=(\s|\>|\"|\(|\[|\'|\-|“))\w{4,}(?=(\s|\<|\.|\,|\;|\!|\?|\'|\"|\)|\]|\-|”|s))

This captures ALL words >= 4 characters, no matter in what kind of weird text they are hidden. (Hope I haven't forgotten any of those limiting characters ;-)

The earlier "quiz" is still open

#46

lokesh

  • Joined: Aug 5, 2010
  • Posts: 12

Tue 10/5/2010 - 13:55

Leaving the quiz apart... I've something that may help all the regex newbie's here.

I have to constantly work with regex to extract a portion of text, i made this job to keep out the initial work.

Example

Lets say you want to extract urls from Google results. You see something like this:

<h3 class="r"><a href="http://www.winautomation.com/" class=l onmousedown="return clk(this.href,'','','','1','lkk5xtECOZNg9d-HPbl3UQ','0CBsQFjAA')">

Now to begin with making a regex for extracting that url, you have top start with stripping out all the whitespace characters and replacing them with "\s".Then, you need to take care of the escape characters. The final result is something like this:

<h3\sclass="r"><a\shref="http://www\.winautomation\.com/"\sclass=l\sonmousedown="return\sclk\(this\.href,'','','','1','lkk5xtECOZNg9d-HPbl3UQ','0CBsQFjAA'\)">

What the attached job does is create the above version of regex automatically. It does so by simple replace commands. Also, it assumes that the regex is placed on clipboard and replaces the clipboard contents with the above version. I have a keyboard trigger attached to this job which comes real handy.

Lokesh

PS: I'm still a regex newbie, so pardon any ignorance on my part.

Prepare Regex.waj

#47

David

  • Joined: Sep 5, 2010
  • Posts: 37

Sun 10/10/2010 - 21:00

Okay, but why not create the final URL? All that other garbage won't let you access the webpage of winautomation - which would be a shame ;-)

So, use: (?<=http://).*?(?=")

and off you go to the Winautomation website! :-)

(Would still love to see a solution to my "quiz"/difficult task...)

#48

masterwaldo

  • Joined: Dec 15, 2009
  • Posts: 24

Tue 12/7/2010 - 9:52

I've been looking on google for the past few hours to solve for my problem but I'm not able to find the solution.

 

I want to make a scraper that grab articles from article directory. The problem I'm having is how I'm going to parse the content section once I have downloaded it. Usually I use this: (?<=<tag>).*(?=</tag>) to parse but in this case now, it wouldn't work.

 

So, this is the particular area that I'm interested off:

 

<div class="article_cnt">
              <div class="KonaBody">
<p>The obvious choice of any bike rider is to avoid flats. Although this might be impossible for any wheel system that uses tubes, prevention is the next best option. To help prevent flat tires, proper maintenance is key:</p>
<ol><li>Keep your tires inflated to the proper pressure as indicated on the side of the tire.</li>
<li>Replace tires at the first sign of worn tread or deteriorating sidewalls. </li>
<li>Replace tubes that have already been patched more than a dozen times. </li>
<li>Inspect tire tread for objects stuck in the tread that may cause a puncture. </li>
</ol><p>If you're willing to spend a little extra money to prevent punctures, consider investing in Kevlar-reinforced tires. The composite fibers that make up Kevlar are strong enough to resist punctures that would normally occur from contact with sharp objects. Kevlar tires typically run about $15 to $20 more than regular tires.</p>
<p>If Kevlar tires aren't in your budget, try tire liners, which are made of strong, lightweight fibers and line the inside of the tire to provide extra protection to the tube. Other options are thorn-resistant tubes and tubes with flat sealant that fills small holes from the inside without the rider even knowing he's had a puncture.</p>
<p>Even if you take all the preventive steps mentioned in this section, you're still likely to get an occasional flat. Once, on a trip from Ireland to Italy with a heavily loaded bike, Dennis went three months without a single flat. A few months later, on a trip in the United States, he was pulling his hair out on the side of the road after five flats in one week. Go figure!</p>
<p>If you keep a patch kit, tire levers, pump, and spare tube with you while you bike and you practice the steps described in this chapter, the chances are good that you'll be able to fix a flat and be back on your bike quicker than your partner can finish off a PowerBar. This will give you the confidence to take long, worry-free bike rides and save you from the embarrassment of having to ask another biker to show you how to change a flat or from walking your bike home.</p> </div>
</div>
    
  I'm interested with anything insude Konabody tag.    What is the suitable regex for it?        

 

source.txt

#49

Samantha

  • Joined: Apr 23, 2010
  • Posts: 2734

Tue 12/7/2010 - 11:16

Hello! I think that the regex that would do the trick for you is this one: (?<=)(.|\n)*?(?=
) :) Samantha

#50
 1 2 3 4 > 

Copyright 2014 - Softomotive Ltd