Thursday, July 22, 2010

4 strategies for building type system agnostic UIMA components

One the promises of UIMA is to allow for a community of developers to build components for a common platform such that best-of-breed strategies for text analysis can be realized by mixing and matching components built by disparate development groups. Unfortunately, just because a component is built on UIMA does not mean it can be seamlessly integrated into any arbitrary UIMA pipeline. Generally speaking, to easily mix and match components that do roughly the same work requires that they either submit to the same type system or they somehow be type system agnostic (i.e. can work with many/any type systems.) While I think it behooves the community to promote standard type systems - it is also helpful (and maybe better) to also think about creating more components that don't submit to any type system.

Here I summarize four strategies for creating UIMA components that are type system agnostic.

1) View Abuse
This approach takes advantage of the ability to create new views where data can be placed. This is the only approach of the four that does not require any type system at all. Not surprisingly, this is the most limiting of the four strategies and the most "hackish" for most scenarios. Suppose you have some piece of information that you need to add to the CAS. One approach would be to extend your type system and add that piece of information to a new type or a new feature of an existing type. An alternative approach proposed here is to instead create a new view and put that piece of information into that view as e.g. a string. For example, you may want to attach an identifier such as a URI to each document that is run through your pipeline. Instead of creating a new type that has a feature called "URI" and putting it there you could instead create a new view called "URIView" and make the URI the document text of that view. A utility method for setting the URI of a CAS might look something like this:

//copied from org.cleartk.util.ViewURIUtil
//(subject to this copyright/license statement)
public static void setURI(CAS cas, String uri) {
CAS view = cas.createView(ViewNames.URI);
view.setSofaDataURI(uri, null);
}

I have also used this approach in the context of document classification by stuffing the "classification" into it's own view rather than extending DocumentAnnotation and putting it there. Similarly, if you had some misc. comments or description of a document you could put this into a view rather than somewhere in the type system.

2) Type System Mapping

In this approach the structure of the type system is assumed but the specific type system is not. That is to say, this approach is type system agnostic so long as your type system can be directly mapped into what's expected by the component. The mapping takes place at initialization time via configuration parameters that correspond to the names of the types and the features that will be used by the components. A nice example of this approach can be found in the OpenNLP UIMA wrappers project. Here we will look at their part of speech analysis engine: opennlp.uima.postag.POSTagger (subject to this copyright/license statement). In a typical part of speech tagging scenario, the tagger iterates through sentences, examines the tokens in the sentence and determines the correct part of speech tag for each token. Therefore, it is reasonable to expect types corresponding to sentences and tokens and a feature of the token type that corresponds to the part of speech tags. Thus, the POSTagger class has the following three member variables:

private Type sentenceType;
private Type tokenType;
private Feature posFeature;

These are initialized in the method typeSystemInit(TypeSystem typeSystem) which is a method inherited from org.apache.uima.analysis_component.CasAnnotator_ImplBase. For example, sentenceType is initialized with the following:

this.sentenceType = AnnotatorUtil.getRequiredTypeParameter(this.context, typeSystem, UimaUtil.SENTENCE_TYPE_PARAMETER);

Here they have created their own utility method in the class AnnotatorUtil for instantiating a type using a configuration parameter. Now, in the process method, iteration over sentences is performed using the type given at runtime in the UimaContext rather than using one specific sentence type that is statically compiled into the code.

This approach is nice because it allows the user to plug in their own type system when running the part of speech tagger rather than having to incorporate the type system provided with the OpenNLP UIMA wrappers. However, this approach suffers from the fact that the part of speech tagger assumes a specific structure to your type system which may not be desirable. For example, the JULIE lab's type system is defined such that a single token may have multiple part-of-speech tags that are defined as their own types (i.e. there is a type called POSTag). So, it is not possible to use the OpenNLP UIMA wrapper to the JULIE type system as is.

3) Generic Typing

In this approach generic typing of your analysis engine is employed. Here we will declare generic types as part of the analysis engine's class definition (which will be declared as an abstract class.) An example of this approach can be found in ClearTK's part-of-speech tagger. Here the class is declared with the following:

public abstract class POSAnnotator <TOKEN_TYPE extends Annotation, SENTENCE_TYPE extends Annotation> ...

These generic types must be declared by specific types by some subclass of the AE corresponding to specific type system classes. This requires the user to either subclass this part-of-speech tagger or use the default subclass provided by ClearTK. This approach provides a little more flexibility than the "type system mapping" approach because instead of requiring a string feature where the part-of-speech tag will go we can instead define an abstract method for handling the generated part-of-speech tag. In the ClearTK part-of-speech tagger the following method is defined:

public abstract void setTag(JCas jCas, TOKEN_TYPE token, String tag);

It is up to the subclass to decide how to apply the part-of-speech tag to the token. So, an implementation using the JULIE type system could easily implement such a method to put the tag in the right place. The default implementation provided by ClearTK that uses the ClearTK type system looks like this:

...
import org.cleartk.type.Sentence;
import org.cleartk.type.Token;

public class DefaultPOSAnnotator extends POSAnnotator <Token, Sentence> ...

public void setTag(JCas cas, Token token, String tag) {
token.setPos(tag);
}

Of course, the obvious disadvantage of this approach is that you force the user to write some code in order to use your component (assuming they don't want to use your type system - which they don't.) Also, it is not clear just how widely applicable this approach is for more complicated scenarios in which many types are involved.

4) Interfaces

Suppose you wanted to write a UIMA wrapper for a widely-used open-source tool such as the Berkeley Parser. Your wrapper AE would likely have an instance of an implementation edu.berkeley.nlp.PCFGLA.Parser such as edu.berkeley.nlp.PCFGLA.CoarseToFineMaxRuleParser. To invoke this parser you could call getBestConstrainedParse(words,tags) which takes two lists of strings corresponding to the words and part-of-speech tags for a single sentence. From this, you would get back a edu.berkeley.nlp.syntax.Tree object. So, your job is to get words and part-of-speech tags out of your type system, call the Berkeley Parser, then post the contents of the returned parse tree object into your type system as you see fit. Instead of hard wiring all of this code into your type system you could instead define a simple interface such as:

//implementations of this method should expect two empty, modifiable lists
//which it will populate
public void getWordsAndTags(List words, List tags);

//implementations of this method should add the contents of the parseTree
//to the CAS as it sees fit.
public void postTree(JCas jCas, Tree parseTree);

This is a nice approach because it allows the user maximum flexibility to do whatever is desired with the resulting parse tree by posting as little or as much information from the parse tree as desired or required (a parse tree can have a lot of information in it.)

This approach can be generalized to any analysis engine implementation by thinking about it in terms of a UIMA-independent analysis core and a UIMA wrapper that goes around it. The core implementation of your analysis engine would be independent of any type system (i.e. uses plain-old-java-objects, or POJOs) and likely sit in some other class apart from the actual analysis engine implementation. It may be worth pointing out that this approach could be combined with the "generic typing" described above. For example, instead of requiring the implementation to fill in the words and types - that part could be accomplished by the analysis engine if it is given enough type information similar to what was done in the part-of-speech tagging example given above.

While this approach may be the most flexible approach of the four it also suffers from feeling a bit un-UIMA. It seems wrong to be constantly translating contents of a CAS to POJOs, performing analysis on the POJOs, and then translating the results back into the CAS. There may be performance costs for all of this object translation and may introduce unnecessary complexity.

Conclusion
This post presents four strategies for creating type system agnostic UIMA components. Each strategy has advantages and disadvantages and none may be appropriate for your particular task. Again, I am not advocating that all components need to be type system agnostic - I simply would like to see more of them that are.

Finally, this list is almost certainly not comprehensive given the dynamic nature of the CAS and the multitude of ways that type systems can be created and modified at run time. Please add your strategy if you have one.


3 comments:

  1. Nice post Philip! I'm glad to see more discussion of what seems to be an important, yet avoided topic.

    The 5th choice, is of course to write type system adapter AEs. If you have the same logical annotation created in one of a number of type systems, the question is where does the code that converts between them live? In your solutions above, you preempt the problem by having the AE exist in a modified state so that it writes the annotation to the CAS in the desired type system. This is accomplished in a derived class in choice 3 and a separate class in choice 4. In choice 5, the code lives in a separate AE altogether, though the annotations have been reified/vivified in the CAS in the "wrong" type system. The only advantage I could see in this scenario is that the conversion code is contained in the converter AE instead of distributed among different AEs. This might be advantageous if you had a small target type system that would result in much duplicated code using the other methods.

    I'd like to find the time and more fully understand the differences holding the community back from standardization. Lots of other communities have found productivity or economy in a shared standard. Given the right motivations, this one will too.

    ReplyDelete
  2. Thanks a lot, Philip - Excellent post!

    BTW, I've reported on UIMAfit on my blog here:
    http://jochenleidner.posterous.com/are-you-fit-for-uima-uimafit-provides-support

    Keep up the good work,
    Jochen

    ReplyDelete
  3. An open-source project that I want to investigate related to the issues raised here is http://code.google.com/p/uima-type-mapper/.

    ReplyDelete