users@jaxb.java.net

"Easy" unmarshalling an JAXB1

From: Aleksei Valikov <valikov_at_gmx.net>
Date: Sat, 25 Feb 2006 21:04:05 +0100

Hi.

I have recently met a need to unmarshall incomplete XML data with JAXB
1. The situation is as follows. We have developed a relatively large
XML-based metadata management system on the basis of JAXB 1. Now wee
need to import the existing data - and it appears that 80% of documents
are "a bit" invalid. That is, sometimes few elements or attributes are
missing. At the same time, documents are structurally "almoust" correct.

What we needed was a way to import invalid data - as much as it is
possible. With JAXB 1. I have searched the web and found that this issue
is addressed in JAXB 2, and there's no solution for JAXB 1.

http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=5023635
http://www.thescripts.com/forum/threadnav84719-1-10.html

This is actually pretty bad for us, since I can't go out there and tell
my customers "sorry, guys, your 10000+ documents are invalid, we can't
import them".

Again many thanks to Sun for making JAXB RI open-source. I've digged a
bit and here are my results.

JAXB RI builds unmarshallers on the basis of
com.sun.tools.xjc.generator.unmarshaller.automaton.Automaton. Automaton
is produced by
com.sun.tools.xjc.generator.unmarshaller.AutomatonBuilder, which
examines the expression tree and creates a structure of States
(com.sun.tools.xjc.generator.unmarshaller.automaton.State).

In case of non-mandatory constructs, generated states have so-called
"delegated states". As far as I understood, if a state is for some
reason not processed, then processing switches to the delegated state.
State with delegation is generated, for instance for optional elements
which are actually represented by "choice(element, epsilon)" structures.
In this case, expression is epsilon-reducible so if the element does not
appear, automaton will switch to the delegated state.

So to allow "easy" unmarshalling, I actually needed to assign delegated
states even in case of non-epsilon-reducible expressions, sequences, and
so on.

I've tried changing the code of onSequence, onChoice and _onRepeated
methods:

In the sequence, always delegate to the next state:

         public Object onSequence( SequenceExp exp ) {
             Expression[] children = exp.getChildren();

             State currentTail;

             for( int i=children.length-1; i>=0; i-- )
             {
                 currentTail = tail;
                 tail = (State)children[i].visit(this);
                 tail.setDelegatedState(currentTail);
             }

             return tail;
         }

In choice, turn on delegation even if expression is not epsilon-reducible:

         public Object onChoice( ChoiceExp exp ) {
             Expression[] children = exp.getChildren();

             State currentTail = tail;
             State head = new State();

             for( int i=children.length-1; i>=0; i-- ) {
                 tail = currentTail;
                 State localHead = (State)children[i].visit(this);
                 if( localHead==currentTail )
                     continue; // use delegation to produce a smaller
state machine
                 head.absorb( localHead );
             }

//lexi if( exp.isEpsilonReducible() ) {
                 // optimization
                 if( head.hasTransition() )
                     head.setDelegatedState(currentTail);
                 else
                     head = currentTail;
//lexi }

             return head;
         }

In repeated expressions, act as if zero was always allowed:

         private State _onRepeated( Expression itemExp, boolean
isZeroAllowed ) {
             State _tail = tail;
             State newHead = (State)itemExp.visit(this);

             _tail.absorb(newHead);
// return isZeroAllowed?_tail:newHead;
           return _tail;
         }

Now, with classes generated with this code, I can unmarshall even
invalid XML.

Well, I understand that it's quite a hacking approach, but it had worked
for me. I'd like to ask JAXB developers, what you guys think of it and
is there any chance to get these corrections into the official code. Of
course, not in the default mode, but if I turn on something like
noValidatingUnmarshaller, JAXb could generate an "easy" one.

Bye.
/lexi