Turn in assignment as both a hardcopy in class and a zipped file, using this link to cardea. Please have all your code as runnable .java files.
The purpose of this assignment is to give you some practice writing and using regular expressions.
In processing real texts, we often need to extract elements of data that match specific patterns. In this assignment, we explore this general requirement by looking at the specific problem of extracting dates from a collection of texts.
In lab you practiced writing regular expressions. In this assigment, you'll modify some java code that processes the output of regular expressions in an interesting way.
Java version 1.4 contains support for regular expressions; up until this version, you had to use an external package to do regex's in Java.
The basic idea is that you first define a regular expression (expressed as a String), then use this as input to a regular expression compiler that creates an object of type Pattern. (I suspect the compiler produces an FSA.) The Pattern object is then given a String to be matched as input; as output it produces an object of type Matcher. This object is then used to see if a match to the regex was found in the input string; it also has a method called group() that used to pull out various pieces of the matches from the input string.
For example, if you were trying to pull out years from input text, and you had a line of input such as "The 1997 crop produced lovely Merlot.", the Matcher could be used to pull out the substring "1997".
In more detail, my code works as follows. First, define the regex as a string. Here is a simple pattern for recognizing recent years:
String year = "(19|20|')([0-9]{2})";
Now, create a Pattern object that has a compiled version of this regex:
Pattern compiledRegex = Pattern.compile(year, Pattern.CASE_INSENSITIVE);
Note that I declared this to be case insensitive, although this isn't
strictly needed for year.
Next I test to see if an input string contains the pattern:
String str = "The 1997 crop produced lovely Merlot."
Matcher matcher = compiledRegex.matcher(str);
boolean found = matcher.find();
Finally, I use the group() method to pull out the year. Each item
that appears in parentheses in the regex can be pulled out using the
group() method; the parenthesized groups are numbered from left to right. Group 0 always
refers to the entire match.
In this case, assume I only want the last two digits -- if the year is 1997, I only want to print out 97, which is index number 2.
if (found)
System.out.println("The last two digits of the year are: " +
matcher.group(2));
My code, supplied below, has more details on how to do this in a bit
more general a way. Also, note that in that code I write:
String year = "(?:19|20|')([0-9]{2})";
The ?: means don't make this group capturable; thus using
this notation I'd
never be able to extract out the first two digits of the year. This
is more efficient if I don't need to use the group. For more information, see
The Java Developers Almanac 1.4 contains an excellent description of the regex facility. It is all useful, but see especially the section on using groups
These slides also explain things well.
In this assignment you should write java code that takes as input a file of conference announcements, uses regular expressions to recognize the various date formats, and then outputs the dates in a standard format: DD-MMM-YYYY
For example, 15th January 1991 would be represented as
15-JAN-1991
If any of the extracted dates are date ranges, then they should be specified as a pair of dates separated by a vertical bar, as in the following example:15-JAN-1991|17-JAN-1991
So the results would look something like this:
(Note that you need to trim down and convert January to JAN, February to FEB and so on. There are at least two elegant ways to do this.)
To get you started, I have made a simple class file called RegexDates.java that shows you how to use the results of matching regex's in Java. Feel free to do it yourself without looking at my code if you like. Otherwise, use the supplied file. You mainly need to add more regular expression patterns and code to use the output of the patterns.
You'll need to write regular expressions to recognize dates that appear in various formats. For example, you should have regex's that can recognize all of the following:
One strategy is to create a set of regular expressions, rather than one huge one, to make it easier to cover the different date formats. I've put together an example in the code file supplied above.
The regex's in my code miss many of the formats shown above. In addition, it lets errors slip by; for example it will allow Feb 30, 1999, even though February has a maximum of 28 days. However, in this assignment you will be extracting dates from text, not checking for their accuracy.
It is a good idea to test out your regex's before you code them using the regex applet from lab.
(This assignment is a modified version of one created by Robert Dale)
MAH -- last modified 11/12/02