Proposal: XML Stream

Author(s): Sam Whited sam@samwhited.com

Last updated: 2017-03-09

Discussion at https://golang.org/issue/19480

Abstract

The encoding/xml package contains an API for tokenizing an XML stream, but no API exists for processing or manipulating the resulting token stream. This proposal describes such an API.

Background

The encoding/xml package contains APIs for tokenizing an XML stream and decoding that token stream into native data types. Once unmarshaled, the data can then be manipulated and transformed. However, this is not always ideal. If we cannot change the type we are unmarshaling into and it does not match the XML format we are attempting to deserialize, eg. if the type is defined in a separate package or cannot be modified for API compatibility reasons, we may have to first unmarshal into a type we control, then copy each field over to the original type; this is cumbersome and verbose. Unmarshaling into a struct is also lossy. As stated in the XML package:

Mapping between XML elements and data structures is inherently flawed: an XML element is an order-dependent collection of anonymous values, while a data structure is an order-independent collection of named values.

This means that transforming the XML stream itself cannot necessarily be accomplished by deserializing into a struct and then reserializing the struct back to XML; instead it requires manipulating the XML tokens directly. This may require re-implementing parts of the XML package, for instance, when renaming an element the start and end tags would have to be matched in user code so that they can both be transformed to the new name.

To address these issues, an API for manipulating the token stream itself, before marshaling or unmarshaling occurs, is necessary. Ideally, such an API should allow for the composition of complex XML transformations from simple, well understood building blocks. The transducer pattern, widely available in functional languages, matches these requirements perfectly.

Transducers (also called, transformers, adapters, etc.) are iterators that provide a set of operations for manipulating the data being iterated over. Common transducer operations include Map, Reduce, Filter, etc. and these operations are are already widely known and understood.

Proposal

The proposed API introduces two concepts that do not already exist in the encoding/xml package:

// A Tokenizer is anything that can decode a stream of XML tokens, including an
// xml.Decoder.
type Tokenizer interface {
	Token() (xml.Token, error)
	Skip() error
}

// A Transformer is a function that takes a Tokenizer and returns a new
// Tokenizer which outputs a transformed token stream.
type Transformer func(src Tokenizer) Tokenizer

Common transducer operations will also be included:

// Inspect performs an operation for each token in the stream without
// transforming the stream in any way.
// It is often injected into the middle of a transformer pipeline for debugging.
func Inspect(f func(t xml.Token)) Transformer {}

// Map transforms the tokens in the input using the given mapping function.
func Map(mapping func(t xml.Token) xml.Token) Transformer {}

// Remove returns a Transformer that removes tokens for which f matches.
func Remove(f func(t xml.Token) bool) Transformer {}

Because Go does not provide a generic iterator concept, this (and all transducers in the Go libraries) are domain specific, meaning operations that only make sense when discussing XML tokens can also be included:

// RemoveElement returns a Transformer that removes entire elements (and their
// children) if f matches the elements start token.
func RemoveElement(f func(start xml.StartElement) bool) Transformer {}

Rationale

Transducers are commonly used in functional programming and in languages that take inspiration from functional programming languages, including Go. Examples include Clojure transducers, Rust adapters, and the various “Transformer” types used throughout Go, such as in the golang.org/x/text/transform package. Because transducers are so widely used (and already used elsewhere in Go), they are well understood.

Compatibility

This proposal introduces two new exported types and 4 exported functions that would be covered by the compatibility promise. A minimal set of Transformers is proposed, but others can be added at a later date without breaking backwards compatibility.

Implementation

A version of this API is already implemented in the mellium.im/xmlstream package. If this proposal is accepted, the author volunteers to copy the relevant parts to the correct location before the 1.9 (or 1.10, depending on the length of this proposal process) planning cycle closes.

Open issues

  • Where does this API live? It could live in the encoding/xml package itself, in another package (eg. encoding/xml/stream) or, temporarily or permanently, in the subrepos: golang.org/x/xml/stream.
  • A Transformer for removing attributes from xml.StartElement's was originally proposed as part of this API, but its implementation is more difficult to do efficiently since each use of RemoveAttr in a pipeline would need to iterate over the xml.Attr slice separately.
  • Existing APIs in the XML package such as DecodeElement require an xml.Decoder to function and could not be used with the Tokenizer interface used in this package. A compatibility API may be needed to create a new Decoder with an underlying tokenizer. This would require that the new functionality reside in the encoding/xml package. Alternatively, general Decoder methods could be reimplemented in a new package with the Tokenizer API.