Facing new challenges everyday

Using regular expressions in Groovy script to retrieve data from html pages

March 31st, 2008

I have been working with regular expressions in Java, regular expressions are very useful to retrieve some data based on document structure. In my example I’m extracting cellular model and brand based on particular html document structure, take a look on html code below:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
<html>
  <head>
     <title>Regular expressions sample data</a>
  </head>
  <body>
      <font style="font-size: 8pt;" face="Verdana"><br>
      </font><font face="Verdana" size="1"><b>Audiovox:<br>
      </b></font>
      <a href="http://cellular.com/audiovox9500.shtml">
      <font size="1">CDM 9500</font></a><font size="1">
      |
      </font>
      <a href="http://cellular.com/thera.shtml">
      <font size="1">PDA - (Thera)</font></a><font size="1">
      |
      <a href="http://cellular.com/audiovoxppc6600.shtml">
      PDA - PPC6600 (Harrier)</a>.<br>
      <br>
      </font>
      <font style="font-weight: 700;"
         face="Verdana" size="1">Cyberbank:<br>
      </font>
      <a href="http://cellular.com/cyberbankpoz.shtml">
      <font size="1">CB 0870 BR (PoZ)</font></a><font size="1">
      |
      </font>
      <a href="http://cellular.com/cyberbank_cb880.shtml">
      <font size="1">CB 0880 BR (Triton)</font></a><font size="1">
      |
      <a href="http://cellular.com/cyberbank_x315.shtml">
      CP X315 BR (PoZ EVDO)</a>.<br>
      <br>
      </font><b><font face="Verdana" size="1">Compaq:<br>
      </font></b><font size="1">
      <a href="http://cellular.com/compaq_ipac3700.shtml">
      IPAC-3700</a>.<br>
      <br>
  </body>
</html>


now look the groovy code used to retrieve cellular model and brand information:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
def inputFile = new File(args[0])
def outputFile = new File("devices.txt");
def brandName = null;

println "Processing file: ${inputFile.name}"

inputFile.eachLine {
   line ->
       
   brandMatch = (line =~ /<b>([a-zA-Zs]*):<br>/)
   if(brandMatch)
      brandName = brandMatch[0][1]

   brandMatch =
     (line =~ /<fonts*.*s*size="1">([a-zA-Zs]*):<br>/)
   if(brandMatch)
      brandName = brandMatch[0][1]

   modelMatch =
     (line =~ /^([a-zA-Z_0-9-()s]*)</a>/)
   if(modelMatch)
       outputFile << "${brandName} ${modelMatch[0][1]}n"

   modelMatch =
   (line =~ /<as*href=".*">([a-zA-Z_0-9-()s]*)</a>/)
   
   if(modelMatch)
       outputFile << "${brandName} ${modelMatch[0][1]}n"

   modelMatch =
   (line=~/<fonts*.*s*size="1">([a-zA-Z_0-9-()s]*)</font>/)
   
   if(modelMatch)
       outputFile << "${brandName} ${modelMatch[0][1]}n"
}

In the example above we are extracting data between <font> and <br> tags and using it as brand name, we are using little different approach for model name, in this case we are extracting all words between start of line and </a> tag, the =~ is used to perform a regex evaluation on each line, a regex match is stored on brandMatch or modeMatch bimensional arrays, just read these arrays and you will get brand and model names. Finally, a new file is generated where each line contains a “model | brand” name pair.

Groovy supports all regex constructs available on Java Pattern class.

DZoneGoogle BookmarksFacebookEvernoteLinkedInDeliciousShare

No Comments »

No comments yet.

Leave a comment

:mrgreen: :neutral: :twisted: :shock: :smile: :???: :cool: :evil: :grin: :oops: :razz: :roll: :wink: :cry: :eek: :lol: :mad: :sad:
*

RSS feed for these comments. | TrackBack URI

Visitors Around the World

Polls

How Is My Site?

View Results

Loading ... Loading ...

Categories

Meta

Links

hosted by easy2use.net