Re: StAX features <was> Re: setFeature Bug?

From: Tatu Saloranta <cowtowncoder_at_yahoo.com>
Date: Thu, 5 Jan 2006 11:30:11 -0800 (PST)

--- Paul Sandoz <Paul.Sandoz_at_Sun.COM> wrote:

> Tatu Saloranta wrote:
> > Paul, since I'm quite interested in "all things
> stax",
> > I was curious about this comment in particular:
...
> > I was curious as to what features would these be,
> and
> > whether they'd be implementation-specific
> extensions
> > (customer properties for sjsxp), or actual
> additions
> > to a later stax api/specs revision (1.1 or 2.0)?
>
> Only the former, but specified on java.net. Koksuke

Ok, even that might be useful. I have been thinking if
it'd be possible to specify de facto standard shared
properties: this does not need spec updates, just
consensus on useful things. I think SAX also has set
of agreed upon standard feature/property ids.

I have added a few custom properties to Woodstox stax
parser (http://woodstox.codehaus.org), and of those,
ones I thought might have most general use I added to
"Stax2" API (more on this later).

I would be very interested in discussing possibility
of coming up with some set of additional optional
properties that sjsxp, ref. impl. and woodstox (at
least, and anyone else who wants to recognize them --
but these are easiest to work on I think) would all at
least recognize if not support.

> created a new
> project, see here [1], for the interface
> definitions. As for the spec
> stuff i really do not know, i do hope that something
> happens in this
> respect.

Yeah, me too. ;-)

> At the moment this project contains some basic stuff
> and there are a
> couple of documented TODOs.
>
> Fundamentally we want to be able to:
>
> 1) iterate on the in-scope namespaces; and

I think this is accessible via XMLStreamReader already
(to some degree -- combination of NamespaceContext()
for bindings from parent; getNamespaceXxx for ones
declared for the current element).
But it would be useful to be able to precisely iterate
through all active mappings, I agree. Not to mention
quite eas to implement -- parser already has to keep
track of that info.

>
> 2) reading/writing of primitive types.

That would be interesting, and could allow for more
efficient access to data-oriented document content.

> > There are lots of things that could be added, and
> many
> > maybe should be added... not to mention tons of
> > underspecified things in 1.0 specs.
>
> Totally agree on all three points!
>
> I would be interested in knowing your views on what
> you think should and
> may be added. Can you enumerate?

Sure! I think a reasonable overview is the "Stax2"
extension API I defined for Woodstox (2.0.x
originally, extending for 3.0.x); Javadocs for current
pre-3.0 are available at:

http://woodstox.codehaus.org/curr/javadoc/index.html

(package 'org.codehaus.stax2.*' and its subpackages).

But here is a grouping of the main things (this does
not contain all woodstox extensions, just the ones I
thought might generally useful):

* XMLStreamReader:
  * extra access to things Stax1.0 has limited access
to;
    * DTD information, via DTDInfo interface (I named
these interfaces XxxInfo; was considering XxxAccess;
they are usually implemented by the reader, but need
not be, same was as NamespaceContext may be). Things
like public id, system id, root element name. And for
validation framework, actual built schema object, to
be reused if necessary (esp. for output validation)
    * Attribute type information: specifically knowing
which of attributes (if any) is of type ID. This is
accessible via AttributeInfo. It could also accomodate
typed value accessors.
    * Nested location information: Location doesn't
support concept of nested input, which I think is a
flaw. It's important to know what has been expanded
(entity, XInclude). Also, separate byte and char
offsets would be neat (if impl. supports providing
them): it should not be unclear which is which.
  * Streaming access to text (mostly to support
pass-through copying to stream writer); can avoid
having to allocate memory buffers for long text
segments
  * Simple sub-tree skipping (pointing to
START_ELEMENT, can traverse to matching END_ELEMENT).
Convenience feature, but can also be done more
efficiently by the reader (can skip namespace
bindings, all contained events etc). And trivially
easy to implement if no optimizations are needed (just
call next() method to get to END_ELEMENT).
* XMLStreamWriter:
  * In general, making it more symmetric with reader
side: allow optional
    access to output location (if writer is to keep
track of rows, lines,
    etc; useful when debugging output side problems,
and esp. with validation).
  * Configurable output escaping (which characters to
output as char entities etc); separately configurable
for attribute values and text.
  * Validation, using same/similar framework as
parser. I think this is pretty neat; but this area is
work-in-progress for 3.0 (I'm really hoping to connect
MSV in near future -- right now only DTD validation is
implemented, but works for XMLStreamWriter quite
nicely).
  * Copy-through methods from XMLStreamReader; while
this adds bit of coupling, it allows for zero-copy (or
at least minimal copy) output of input text (and some
other element information too, attr values). It's also
a convenience feature, to be able to just copy
whatever input stream reader points to.
* Factories:
  * Convenience constructors: URL and File should be
usable as is, and they are better input source
definitions than String system id (easier and more
reliable to dereference relative refs from DTDs etc).
* General fixes:
  * Stax 1.0 does not allow closing of the underlying
input source. With javax.xml.stream.Source, for
example, this is a bug: reader should close it, and/or
expose a method to close it. This is also necessary to
support convenience input sources (URL, File) since
caller has no access to the
underlying input source.
* Generic validation framework
(org.codehaus.stax2.validation):
  * Bi-directional (reader and writer side)
  * Chainable (multiple validators; automatic ones for
DTD, W3C Schema, manual ones for Relax NG; custom ones
for business validation)
  * Modular between implementations and reader/writer
side
* Misc other properties:
  * Whether to report ignorable prolog white space or
not
  * Report CDATA as CHARACTERS (I think that's
supported by sjsxp and
    ref. impl.)
  * Whether names, URIs are interned: calling
application can use fast
    identify comparison when they are interned
  * Whether impl. is to keep track of locations
(XMLEvent does support
    storing of Location info) -- turning this off is
actually quite
    significant with Event API (10-15% reduction, due
to lesser GC I
    assume) for events or not. For stream reader
difference need not
    be big (Location objects can be constructed as
needed).
  * Whether lazy parsing is allowed or not (it is good
to defer reading
    of some things, like comment content, as it's
quite often not
    used by the app -- but doing this causes
async/delayed exceptions
    for invalid docs).
  * XMLStreamWriter: is it namespace aware; whether to
validate structural
    aspects, name validity (these are useful but also
add overhead);
    how to add/change character data/attribute value
escaper object.

So, anyway, I think it would be great if some subset
of these and other features could be discussed.
Perhaps Stax 1.1/2.0 could and should be built based
on experiences from implementations, things that more
than one implementation has and that people think are
genuinely useful general additions.

One final thing; although I am mostly working on
Woodstox, I also have committer rights to the ref.
impl., and could help in adding some of the simpler
features there.

-+ Tatu +-


__________________________________________
Yahoo! DSL – Something to write home about.
Just $16.99/mo. or less.
dsl.yahoo.com