Using regular expressions in Groovy script to retrieve data from html pages

Posted by – March 31, 2008

I have been working with regular expressions in Java, regular expressions are very useful to retrieve some data based on document structure. In my example I’m extracting cellular model and brand based on particular html document structure, take a look on html code below:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
<html>
  <head>
     <title>Regular expressions sample data</a>
  </head>
  <body>
      <font style="font-size: 8pt;" face="Verdana"><br>
      </font><font face="Verdana" size="1"><b>Audiovox:<br>
      </b></font>
      <a href="http://cellular.com/audiovox9500.shtml">
      <font size="1">CDM 9500</font></a><font size="1">
      |
      </font>
      <a href="http://cellular.com/thera.shtml">
      <font size="1">PDA - (Thera)</font></a><font size="1">
      |
      <a href="http://cellular.com/audiovoxppc6600.shtml">
      PDA - PPC6600 (Harrier)</a>.<br>
      <br>
      </font>
      <font style="font-weight: 700;"
         face="Verdana" size="1">Cyberbank:<br>
      </font>
      <a href="http://cellular.com/cyberbankpoz.shtml">
      <font size="1">CB 0870 BR (PoZ)</font></a><font size="1">
      |
      </font>
      <a href="http://cellular.com/cyberbank_cb880.shtml">
      <font size="1">CB 0880 BR (Triton)</font></a><font size="1">
      |
      <a href="http://cellular.com/cyberbank_x315.shtml">
      CP X315 BR (PoZ EVDO)</a>.<br>
      <br>
      </font><b><font face="Verdana" size="1">Compaq:<br>
      </font></b><font size="1">
      <a href="http://cellular.com/compaq_ipac3700.shtml">
      IPAC-3700</a>.<br>
      <br>
  </body>
</html>


now look the groovy code used to retrieve cellular model and brand information:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
def inputFile = new File(args[0])
def outputFile = new File("devices.txt");
def brandName = null;

println "Processing file: ${inputFile.name}"

inputFile.eachLine {
   line ->
       
   brandMatch = (line =~ /<b>([a-zA-Zs]*):<br>/)
   if(brandMatch)
      brandName = brandMatch[0][1]

   brandMatch =
     (line =~ /<fonts*.*s*size="1">([a-zA-Zs]*):<br>/)
   if(brandMatch)
      brandName = brandMatch[0][1]

   modelMatch =
     (line =~ /^([a-zA-Z_0-9-()s]*)</a>/)
   if(modelMatch)
       outputFile << "${brandName} ${modelMatch[0][1]}n"

   modelMatch =
   (line =~ /<as*href=".*">([a-zA-Z_0-9-()s]*)</a>/)
   
   if(modelMatch)
       outputFile << "${brandName} ${modelMatch[0][1]}n"

   modelMatch =
   (line=~/<fonts*.*s*size="1">([a-zA-Z_0-9-()s]*)</font>/)
   
   if(modelMatch)
       outputFile << "${brandName} ${modelMatch[0][1]}n"
}

In the example above we are extracting data between <font> and <br> tags and using it as brand name, we are using little different approach for model name, in this case we are extracting all words between start of line and </a> tag, the =~ is used to perform a regex evaluation on each line, a regex match is stored on brandMatch or modeMatch bimensional arrays, just read these arrays and you will get brand and model names. Finally, a new file is generated where each line contains a “model | brand” name pair.

Groovy supports all regex constructs available on Java Pattern class.

Share

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>