Proposal: Natural XML

Author(s): Sam Whited sam@samwhited.com

Last updated: 2016-09-27

Discussion at https://golang.org/issue/13504.

Abstract

The encoding/xml API is arguably difficult to work with. In order to fix these issues, a more natural API is needed that acts on nodes in a tree like structure instead of directly on the token stream.

Background

XML parsers generally operate in one of two modes of operation, a “DOM style” mode in which entire documents are parsed into a tree-like data structure, the “Document Object Model” (DOM), and an event-driven “SAX style” mode (Simple API for XML) in which tokens are streamed one at a time and only handled if they would trigger a callback or event. The benefit of a DOM style node is that all information contained in the XML is rapidly accessible and can be accessed at will, whereas in a SAX style mode only information at the current parse location is readily available and other arrangements have to be made to store previously visible information. However, the SAX style mode generally provides a relatively small and stable memory footprint, while the DOM style mode requires parsers to load an entire document into memory.

Go currently supports a hybrid approach to this situation: entire documents or elements may be read into native data structures, or individual tokens may be read off the wire and handled directly by the application. This works well for simple elements where the entire structure is known, but for XML with an arbitrary format it forces use of the low-level token stream APIs directly which is error prone and cumbersome.

Proposal

Having a higher level tree-like API will allow users to manipulate arbitrary XML in a more natural way that is compatible with Go's hybrid SAX and DOM style approach to parsing XML.

Implementation

An interface originally suggested by RSC is proposed:

// An Element represents the complete parse of a single XML element.
type Element struct {
	StartElement
	Child []Child
}

// A Child is an interface holding one of the element child types:
// *Element, CharData, or Comment.
type Child interface{}

The *Element type will implement xml.Marshaler and xml.Unmarshaler to make it compatible with the existing (*xml.Encoder) Encode and (*xml.Decoder) Decode methods for situations where entire XML elements should be consumed. This makes it compatible with both styles of XML parsing in Go. For example, an entire element could be unmarshaled simply:

el := xml.Element{}
err := d.Decode(&el)

Or specific children could be unmarshaled:

tok, _ := d.Token()
el := xml.Element{StartElement: tok.(StartElement)}

// Only unmarshal the child named "body"
for ; err == nil; tok, err = d.Token() {
	if start, ok := tok.(StartElement); ok && start.Name.Local == "body" {
		child := xml.Child{}
		_ = xml.DecodeElement(&child, start)
		el.Child = append(el.Child, child)
	}
}

The author volunteers to complete this work in the next release cycle with enough time left after this proposal is accepted and conservatively estimates that a week of work would be required to complete the changes, including tests. The changes themselves are relatively easy and this lengthy estimate is mostly because the authors time is limited to evenings and weekends. If someone who's job permitted them to work on Go were to accept the task, the work could almost certainly be completed much quicker.

Rationale

For large XML documents or streams that cannot be parsed all at once, the given approach does make parsing less complicated since we still have to iterate over the token stream. It may be possible to fix this by adding new methods to the *xml.Encode and *xml.Decode types specifically for dealing with elements, but the author deems that the benefit is not worth the added complexity to the XML package. The current solution is simple and does not preclude adding a more robust Element based API at a later date.

Compatibility

This proposal does not introduce any changes that would break compatibility with existing code. It adds two types which would need to be covered under the compatibility promise in the future.

Open issues (if applicable)

  • For elements with large numbers of children, accessing a specific child via a slice may be slow. Using a map would be a simple fix, but this makes accessing arrays with few elements slower (the crossover is somewhere around 10 elements in a very informal benchmark). Using a trie or some other appropriate tree-like structure can give us the best of both worlds, but adds a great deal of complexity that is almost certainly not worth it. It may, however, be worth not making the children slice public (and using accessor methods instead) so that the implementation could easily be switched out at a later date.