Extraction

In regular expressions, extraction refers to the storage of strings matched by one part of the regular expression with the purpose of using them elsewhere in the expression. This is very useful for parsing and for general text processing.

An extraction group is delimited by parenthesis. For each grouping, the part of the string that matches inside the parenthesis goes into a particular position within an array of matched groupings. In PBL, the extraction can be done with the match function, which returns the array of substrings for each grouping.

For example, suppose that you have a string with the current time, in hh:mm:ss format. You can build a basic regular expression for matching times in that format, such as /\d\d:\d\d:\d\d/. However, you want to know what the value of just one element, such as the hour, is. To obtain it, group each element with parenthesis. For example, /(\d\d):(\d\d):(\d\d)/. The following example shows how to display hours, minutes and seconds using the index numbers of the array:
time as String
matches as String[]
input "Enter a time (hh:mm:ss):" time
	
matches = time.match('/(\d\d):(\d\d):(\d\d)/')

if matches is not null then
    display "Hours: " + matches[1] + "\n" +
            "Minutes: " + matches[2] + "\n" +
            "Seconds: " + matches[3]
else
    display "Invalid time!"
end
Note: When a regular expression is matched against a string, the whole part of the string that matches is stored in position 0 (zero) of the array.

For the previous example, if you enter "12:40:23", the array will contain the following:

Position Value
1 12:40:23
2 12
3 40
4 23

Positions are assigned to each group from left to right.

Extraction Example

The following is a real world example of extraction. Suppose that you need to interpret a text file with lines with the following format:

property = value

The file can also have comment lines, which begin with the pound sign (#). A sample of the file follows:

# Configuration parameters 
adminEmail=admin@yoursite.com
serverHost=server.yoursite.com
serverPort=12345

# some preferences
soundEnabled=false
fontSize=12

# colors
background = white
foreground = blue
It would be useful if you could create an associative array, for simple access to each property. For example, to get the value of the serverPort property defined in the file we would use:
port = properties["serverPort"]

First, you need to define the regular expression to interpret a valid line in the file. As mentioned before, lines can be in property = value format or they may start with a pound (#) sign. In the latter case, the line must be ignored.

The assignment lines can be matched with /\w+=\w+/. This looks for a word (\w+) and equals sign (=) and another word (\w+).

The following allows optional white space around the equals sign:
/\w+\s?=\s?\w+/
Now you need to group the left side word (before the equals) and the right side word (after the equals sign) so that you can extract the values:
/(\w+)\s?=\s?(\w+)/
One more detail is required. Let's force the regular expression to match the whole string. You achieve this by adding the ^ and $ anchors:
/^(\w+)\s?=\s?(\w+)$/
The following code fragment tests the expression:
input "Enter a line:" line
m = line.match('/^(\w+)\s?=\s?(\w+)$/')
if m is not null then
    display "Property: " + m[1] + "\nValue: "
	           + m[2]
else
	   display "ERROR, invalid line!"
end
A comment is easy to match by using the following regular expression (remember comment lines begin with the pound sign # in the sample text file):
/^#.*/
The expression /^#.*/ means a line beginning with # and followed by any number of characters. An alternation will allow comment lines to match and test the Method again:
input "Enter a line:" line
	
m = line.match('/(^#.*$)|^(\w+)\s?=\s?(\w+)$/')
	
if m is not null then
    if m[1] = "" then
        display "Property: " + m[3] + "\nValue: "
                + m[4]
    else
        display "Comment line found: " + m[0]
    end
else
    display "ERROR, invalid line!"
end
Now that you have tested the regular expression, you can remove the display statements and write the code that builds the associative array. Instead of reading the lines from an input, we read them from a file:
for each line in TextFile("/tmp/test.txt").lines
    m = line.match('/(^#.*$)|(^(\w+)\s?=\s?(\w+)$)/')
    if m is not null then
        // if m is not a comment
        if m[1] = "" then
            props[m[3]] = m[4]
        end
    else
	       // erroneous line - ignore it
    end
end
	
display props
Replace tmp/test.txt with a valid file name and location before testing the code.
Note: The TextFile component contains a built-in function for creating an associative array from a properties file. This example just shows you how to use regular expressions in a real problem. If the file were compatible with a Java properties file, then the Textfile.loadPropertiesFrom component is the easiest solution.

The following examples show regular expression solutions to common problems.

Example 1

Obtain the path from the filename of a fully-qualified UNIX path and filename such as /usr/utilities/reader/readme.txt. This requires two extractions, as follows:
/(.*)\/([^\/]*)$/

Position [1] will contain the path (usr/utilities/reader ), and position [2] will contain the name of the file (readme.txt).

Example 2

Obtain the user ID and the host name from an e-mail address such as support@bea.com. This requires two extractions, as follows:
/([\w\.]+)@([\w\.]+)/

Position [1] will contain the user ID (support), and position [2] will contain the host name (bea.com).

Example 3

To extract the parts of a URL such as http://www.bea.com:80/index.html. We require the protocol, host name, port number, and resource:
/(\w+):\/\/([^:\/]+)(:(\d+))?(\/.*)?/
The following values will be obtained:
Position Value
1 http
2 www.bea.com
3 :80
4 80
5 /index.html
Note that to obtain the port number both with and without the colon, a nested extraction was used.