Content Pipeline Guide > Content Pipelines > Creating Filters and Filter Adapters > grep Filter Example
  
grep Filter Example
This example describes a filter that provides functionality similar to the grep command. The grep filter uses a regular expression to look for character matches in the input. Matches are reported to the error handler.
If you use this filter in a pipeline that runs in Arbortext Editor, matches are listed in the Event log window with links to the location within the document where the match occurred.
The filter has the following configuration:
A regular expression.
The number of characters reported (includes the matched string) upon a successful match.
Spanning of non-character events such as start and end elements. For example, when you span non-character events, the regular expression “testcase” results in a match in the following situation:
<element1>test</element1>
<element1>case</element1>
Maximum number of characters that the filter tries to match against the regular expression.
This filter implements the characters, startElement and endElement methods in the ContentHandler interface. All SAX events are repeated since grep functionality does not alter content.
To implement this filter, you subclass the DefaultSAXFilter class, which overrides the characters, startElement, endElement, and initFilter methods. The initFilter method is called by the publishing framework to initialize filters.
/*
* grep.java Version 1.0
*
* Created 15-Aug-02
*
* 1000 Victor’s Way, Ann Arbor, MI, 48108, U.S.A.
*/
package com.arbortext.epic.compose.examples;
import java.util.Map;
import org.xml.sax.Attributes;
import org.xml.sax.Locator;
import org.xml.sax.SAXException;
import org.xml.sax.SAXParseException;
import gnu.regexp.RE;
import gnu.regexp.REMatch;
import com.arbortext.epic.saxfilter.DefaultSAXFilter;
import com.arbortext.epic.saxfilter.utils.Logger;
/**
* grep
*
* The grep filter performs an operation similar to the grep
* command on characters that are present in the SAX event stream.
* When a match is found, the filter creates a SAXParseException
* and sends it to the ErrorHandler resource. The SAXParseException
* contains the ’matched’ string, along with some context as the
* message. A locator object is also set in the constructor of the
* exception.
*
* The filter can be configured with these parameters:
* - regexp is the regular expression that is used to perform
* the match.
* - numContextChars is the number of characters that are reported
* (includes the matched string) upon a successful match.
* - crossNonCharEventBoundary indicates if the match crosses
* non-character events such as start and end element
* events.
* - maxMatchSize is the maximum number of characters that the
* filter tries to match against the regular expression.
* NOTE: This filter does not affect the SAX events that
* pass through. Therefore, all methods that override the
* base class methods call the respective super class methods.
*/
public class grep extends DefaultSAXFilter
{
//Parameters
public static final String PARAM_REGEXP = "regexp";
public static final String PARAM_NUM_CONTEXT_CHARS =
"numContextChars";
public static final String
PARAM_CROSS_NONCHAR_EVENT_BOUNDARIES =
"crossNonCharEventBoundary";
public static final String PARAM_MAX_MATCH_SIZE =
"maxMatchSize";
//constants used for default value for the parameters
public static final int DEFAULT_NUM_CONTEXT_CHARS = 100;
public static final int DEFAULT_MAX_MATCH_SIZE = 1000;
//instance variables
private boolean crossNonCharEventBoundary;
private int numContextChars;
private int maxMatchSize;
private RE reObj; //regular expression object used to perform
the actual match.
private StringBuffer currString;
private Logger logger;
/**
* Default constructor.
*/
public grep()
{
resetParameters();
}
/**
* Helper function that resets the parameter values to the
* default values.
*/
private void resetParameters() {
crossNonCharEventBoundary = false;
numContextChars = DEFAULT_NUM_CONTEXT_CHARS;
maxMatchSize = DEFAULT_NUM_CONTEXT_CHARS;
}
/**
* initFilter
*
* The initFilter method sets-up the parameters.
*
* @param parameters is a map containing the parameters.
*
* @throws exception if the regular expression string is
* not present.
*/
public void initFilter(Map parameters) throws Exception
{
//The super class, DefaultSAXFilter, manages the output
//handler objects. Therefore, this class needs to call
//the super class’ initFilter methods to set up the output
//handlers.
super.initFilter(parameters);
logger = Logger.getLogger(this.getClass());
//Uses the utility logger class
resetParameters();
Object paramValue;
//Get the regular expression string.
String regExpString = (String)parameters.get(PARAM_REGEXP);
if (regExpString == null || regExpString.equals(""))
throw new Exception("Grep filter requires a"
+ "valid regular expression.");
//How many context chars?
paramValue = parameters.get(PARAM_NUM_CONTEXT_CHARS);
if (paramValue != null) {
try {
numContextChars =
Integer.parseInt((String)paramValue);
}
catch (NumberFormatException nfe) {
logger.warn("Illegal value for "
+ PARAM_NUM_CONTEXT_CHARS);
numContextChars = DEFAULT_NUM_CONTEXT_CHARS;
}
}
//Can the match cross non char event boundaries?
if ("true".equalsIgnoreCase((String)parameters.get
(PARAM_CROSS_NONCHAR_EVENT_BOUNDARIES))) {
crossNonCharEventBoundary = true;
}
paramValue = parameters.get(PARAM_MAX_MATCH_SIZE);
if (paramValue != null) {
try {
maxMatchSize =
Integer.parseInt((String)paramValue);
}
catch (NumberFormatException nfe) {
logger.warn("Illegal value for "
+ PARAM_MAX_MATCH_SIZE);
maxMatchSize = DEFAULT_MAX_MATCH_SIZE;
}
}
reObj = new RE(regExpString);
currString = new StringBuffer(maxMatchSize);
} //initFilter()
/**
* endElement resets the current string if the
* match can’t span non-character events.
*
*/
public void endElement(String NameSpaceURI, String lName,
String qName) throws SAXException
{
super.endElement(NameSpaceURI, lName, qName);
//If the match can’t cross non-character event
//boundaries, then the match string should be reset.
if (!crossNonCharEventBoundary)
currString.setLength(0);
} //endElement()
/**
* startElement resets the current string if the match
* can’t span non-character events.
*/
public void startElement(String NameSpaceURI, String lName,
String qName, Attributes atts) throws SAXException
{
super.startElement(NameSpaceURI, lName, qName, atts);
if (!crossNonCharEventBoundary)
currString.setLength(0);
} // startElement()
/**
* characters - This method tests the current string
* against the regular expression. The method does
* the following:
* - Maintains the current string. The maximum size
* of the string is controlled by the "maxMatchSize"
* parameter.
* - Tests the current string against the regular
* expression.
* - If a match is found, a SAXParseException is created
* with the match string.
*
* The filter gets the document Locator using the
* getDocumentLocator() method. The document locator
* must have been set for the locator to be useful.
* The publishing framework typically sets the document
* locator.
*
* IMPLEMENTATION NOTE: If generateEpicDirectives is set to
* true for the epicGenerator filter, then the locator is set
* for all filters in the pipeline that follow the epicGenerator.
*/
public void characters(char[] ch, int start, int length)
throws SAXException
{
super.characters(new string cch, start, length);
currString.append(ch);
if (currString.length() > maxMatchSize) {
currString.delete(0, currString.length() - maxMatchSize);
}
//See if the current string matches the regular expression.
//For now ignore multiple matches.
REMatch match = reObj.getMatch(currString);
if (match != null) {
int matchStartIndex = match.getStartIndex();
int matchEndIndex =
Math.min(currString.length(),
matchStartIndex + numContextChars);
SAXParseException ex = new SAXParseException
(currString.substring(matchStartIndex,
matchEndIndex),
getDocumentLocator());
getErrorHandlerResource().warning(ex);
//Reset the current string to avoid duplicate matches
//in subsequent character method calls.
currString.setLength(0);
}
} // characters(...)
}
The filter does not affect the SAX events that pass through the filter. This is because the super class methods are called in all the methods that the filter overrides.
super.initFilter(parameters);
...
super.characters(ch, start, length);
...
You can put this grep filter between an epicGenerator and an xslTransformer with no data loss. The filter uses the com.arbortext.epic.saxfilter.utils.Logger class to log warnings. We recommend using this class to report messages that are not related to SAX events.
logger = Logger.getLogger(this.getClass());
...
logger.warn("Illegal value for " + PARAM_MAX_MATCH_SIZE);
The filter resets its parameters in the initFilter method so that subsequent calls (the filter is reused by the publishing framework) do not result in stale values for non-required parameters (for example, PARAM_MAX_MATCH_SIZE).