All about XML








pykml.parser vs. lxml parser?


element = start-tag + end-tag, or empty-element tag

attribute = markup construct consisting of a name–value pair that exists within a start-tag or empty-element tag

node, element, attribute


    <child name="child1">


child1 = value

name = attribute

child = element

root = node

ElementTree, a DOM-like API is available here:

Lxml, which implements the ElementTree API and also provides XSLT and XPath and more, is available here:

XML parser = processor. Strings of characters that are not markup are content.

The XML specification defines a valid XML document as a well-formed XML document which also conforms to the rules of a Document Type Definition (DTD). In addition to being well-formed, an XML document may be valid. This means that it contains a reference to a Document Type Definition (DTD), and that its elements and attributes are declared in that DTD and follow the grammatical rules for them that the DTD specifies. The oldest schema language for XML is the document type definition (DTD), inherited from SGML.

Existing APIs for XML processing tend to fall into these categories:

Stream-oriented facilities require less memory and, for certain tasks based on a linear traversal of an XML document, are faster and simpler than other alternatives. Tree-traversal and data-binding APIs typically require the use of much more memory, but are often found more convenient for use by programmers; some include declarative retrieval of document components via the use of XPath expressions.

XSLT is designed for declarative description of XML document transformations, and has been widely implemented both in server-side packages and Web browsers. XQuery overlaps XSLT in its functionality, but is designed more for searching of large XML databases.

Document Object Model (DOM) is an API that allows for navigation of the entire document as if it were a tree of node objects representing the document's contents.

For Python: pyKML, SimpleKML, fastkml ("If available, lxml will be used to increase its speed.")


This is an easier alternative to SOAP to build web services, ie. writing routines accessible from a remote host through the HTTP protocol to carry XML-formated data. Obviously, you should take into consideration the latency of calling routines over a network link, especially over a WAN link like the Internet, and also the fact that the code that make up the routines must be interpreted each time the routine is called; take a look at JIT compilers and cache managers to lower the cost.

Here's a sample XML-RPC call looks like (as copied from here):

POST xmlrpcexample.php HTTP/1.0
User-Agent: xmlrpc-epi-php/0.2 (PHP)
Host: localhost:80
Content-Type: text/xml
Content-length: 191
<?xml version='1.0' encoding="iso-8859-1" ?>

Some solutions to access a web service through XML-RPC from VB and PHP code:



Here's a sample that worked using (web server) and vbXMLRPC (VB5 client):

//Server : /xmlrpc/timsrv.php
function getTime($args) {
        return "Yes! Ain't XML-RPC cool && dandy?";
$server = new IXR_Server(array('test.getTime' => 'getTime'));
Dim linsRequest As New XMLRPCRequest
Dim linsResponse As XMLRPCResponse
Dim linsUtility As New XMLRPCUtility
linsRequest.HostName = "localhost"
linsRequest.HostPort = 80
linsRequest.HostURI = "/xmlrpc/timesrv.php"
linsRequest.MethodName = "test.getTime"
Set linsResponse = linsRequest.Submit
Label1.Caption = linsResponse.Params(1).StringValue

Reading Notes from "Beginning XML, 2nd Edition" By David Hunter, Kurt Cagle, Chris Dix et al., 2003

This is where the extensible in Extensible Markup Language comes from: anyone is free to mark up data in any way using the language, even if others are doing it in completely different ways.

There have already been numerous projects to produce industry-standard vocabularies to describe various types of data. For example, Scalable Vector Graphics (SVG) is an XML vocabulary for describing two-dimensional graphics.

XSLT was created for transforming XML documents from one format to another and that could potentially make these kinds of transformations very simple.

What HTML does for display, XML is designed to do for data exchange.

XML also groups information in hierarchies. The items in our documents relate to each other in parent/child and sibling/sibling relationships. These "items" are called elements.

This structure is also called a tree; any parts of the tree that contain children are called branches, while parts that have no children are called leaves.

Because the <name> element has only other elements for children, and not text, it is said to have element content. Conversely, since <first>, <middle>, and <last> have only text as children, they are said to have simple content. Elements can contain both text and other elements. They are then said to have mixed content.

Document type: structured in a specific way, to describe a specific type of information.

DTDs and Schemas provide ways to define our document types.

Namespaces provide a means to distinguish one XML vocabulary from another, which allows us to create richer documents by combining multiple vocabularies into one document type.

XPath describes a querying language for addressing parts of an XML document. This allows applications to ask for a specific piece of an XML document, instead of having to always deal with one large "chunk" of information.

For simpler cases, we can use Cascading Style Sheets (CSS) to define the presentation of our documents. And, for more complex cases, we can use Extensible Stylesheet Language (XSL), that consists of XSLT, which can transform our documents from one type to another, and Formatting Objects, which deal with display.

XLink and XPointer are languages that are used to link XML documents to each other, in a similar manner to HTML hyperlinks.

Two ways for traditional applications to interface with XML documents: document object model (DOM), and Simple API for XML (SAX).

XML is also used as a protocol for Remote Procedure Calls (RPC). Using a technology called the Simple Object Access Protocol (SOAP), allows this to occur even through a firewall, which would normally block such calls, providing greater opportunities for distributed computing.

The text between the start-tag and end-tag of an element is called the element content.

The root element contains the entire XML document.

An empty element is called a self-closing tag, eg. <parody />.

Reading Notes from "Learning XML, 2nd Edition" By Erik T. Ray, O'Reilly, September 2003

XML's markup divides a document into separate information containers called elements.

If XML markup is a structural skeleton for a document, then tags are the bones. They mark the boundaries of elements, allow insertion of comments and special instructions, and declare settings for the parsing environment. A parser, the front line of any program that processes XML, relies on tags to help it break down documents into discrete XML objects. Inside element start tags, you sometimes will see some extra characters next to the element name in the form of name="value". These are attributes. They associate information with an element that may be inappropriate to include as character data.

An XML document has two parts. First is the document prolog, a special section containing metadata. The second is an element called the document element, also called the root element for reasons you will understand when we talk about trees. The root element contains all the other elements and content in the document. The prolog is optional. If you leave it out, the parser will fall back on its default settings.

The markup symbols are delineated by angle brackets (<>). <to> and </villain> are two such symbols, called tags. The data, or content, fills the space between these tags.

Document type definition (DTD). <!DOCTYPE...> is one example of a type of markup called a declaration. Declarations are used to constrain grammar and declare pieces of text or resources to be included in the document. This line isn't required unless you want a parser to validate your document's structure against a set of rules you provide in the DTD.

The document element is also sometimes called the root element.

The empty tag <graphic.../>, which represents an empty element. Rather than containing data, this element references some other information that should be used in its place, in this case a graphic to be displayed. Empty elements do not mark boundaries around text and other elements the way container elements do, but they still may convey positional information. Every element that contains data has to have both a start tag and an end tag or the empty form used for graphic. (It's okay to use a start tag immediately followed by an end tag for an empty element; the empty tag is effectively an abbreviation of that.)

Strictly speaking, XML is not a markup language. A language has a fixed vocabulary and grammar, but XML doesn't actually define any elements. Instead, it lays down a foundation of syntactic constraints on which you can build your own language. So a more apt description might be to call XML a markup language toolkit.

Because XML doesn't have a predetermined vocabulary, it's possible to invent a markup language as you go along. Documents that follow the syntax rules of XML are well-formed XML documents. A document model is the blueprint for an instance of a markup language. It gives you an even stricter test than well-formedness. When a document instance matches a document model, we say that it is valid.

There are several ways to define a markup language formally. The two most common are document type definitions (DTDs) and schemas. Schemas are a later invention, offering more flexibility and a way to specify patterns for data, which is absent from DTDs.
One limitation of DTDs is that they don't do much checking of text content. An alternative document modeling scheme provides the solution. XML Schemas provide much more detailed control over a document, including the ability to compare text with a pattern you define.

The XPath language provides a convenient method to specify which nodes to return in a tree context. A parser written as a hybrid will only need to return a list of nodes that match an XPath expression. A stream parser efficiently searches through the document to find the nodes, then passes the locations to a tree builder that assembles them into object trees. XPath's advantage is that it is has a very rich language for specifying nodes, giving the developer a lot of control and flexibility.

The two most popular stylesheets are Cascading Style Sheets (CSS) and the Extensible Style Language (XSL). The former is very simple and fine for most online documents. The latter is highly detailed and better for print-quality documents.

Extensible Style Language Transformations (XSLT) can automate the task of converting between one format and another in a process called transformation. Transformation in XML is typically done with the language XSLT, essentially a programming language optimized for transforming XML. It requires a transformation instruction which happens to be called a stylesheet (not to be confused with a CSS stylesheet). An XSLT processor is a program that takes an XML document and an XSLT stylesheet as input and outputs a transformed document.

Most programming languages have support for parsing and navigating XML. They frequently make use of two standard interfaces. The Simple API for XML (SAX) is very popular for its simplicity and efficiency. The Document Object Model (DOM) outlines an interface for moving around an object tree of a document for more complex processing.

PyXML supports DTD validation, SAX2, DOM2, PullDOM.

To return to the ideals of generic coding, some people tried to adapt SGML for the Web—or rather, to adapt the Web to SGML. This proved too difficult. SGML was too big to squeeze into a little web browser. A smaller language that still retained the generality of SGML was required, and thus was born the Extensible Markup Language (XML).


DOM and SAX are often too complex for a simple query like this. XPath is a shorthand for locating a point inside an XML document. It is used in XPointers and also in places like XSLT and some DOM implementations to provide a quick way to move around a document.

STOPPED 2.4 Elements

XPath: Each step in a path touches a branching or terminal point in the tree called a node. In keeping with the arboreal terminology, a terminal node (one with no descendants) is sometimes called a leaf. In XPath, there are seven different kinds of nodes:

XPath uses chains of steps. The terms "child" and "parent" are still applicable. A location path is a chain of location steps that get you from one point in a document to another. If the path begins with an absolute position (say, the root node), then we call it an absolute path. Otherwise, it is called a relative path because it starts from a place not yet determined. A location step has three parts: an axis that describes the direction to travel, a node test that specifies what kinds of nodes are applicable, and a set of optional predicates that use Boolean (true/false) tests to winnow down the candidates even further.

XPath expressions are statements that can extract useful information from the tree. Instead of just finding nodes, you can count them, add up numeric values, compare strings, and more. They are much like statements in a functional programming language.

XML Pointer Language (XPointer) uses XPath expressions to find points inside external parsed entities, as an extension to uniform resource identifiers (URIs). It could be used, for example, to create a link from one document to an element inside any other.

XSL is really three technologies rolled into one:

XSL Transformations (XSLT): An XSLT processor (I'll call it an XSLT engine) takes two things as input: an XSLT stylesheet to govern the transformation process and an input document called the source tree. The output is called the result tree.

The two main methods of working with XML files with computer languages are event streams (SAX) and object trees (DOM).

The stream approach treats XML content as a pipeline. As it rushes past, you have one chance to work with it, no look-ahead or look-behind. It is fast and efficient, allowing you to work with enormous files in a short time, but depends on simple markup that closely follows the order of processing. An XML stream emits a series of tokens or events, signals that denote changes in markup status. For example, an element has at least three events associated with it: the start tag, the content, and the end tag.
The XML stream is constructed as it is read, so events happen in lexical order. The content of an element will always come after the start tag, and the end tag will follow that. Somewhere between chopping up a stream into tokens and processing the tokens is a layer one might call an event dispatcher. It branches the processing depending on the type of token. The code that deals with a particular token type is called an event handler. There could be a handler for start tags, another for character data, and so on. A common technique is to create a function or subroutine for each event type and register it with the parser as a call-back, something that gets called when a given event occurs.

SAX implements what we call push parsing. The parser pushes events at the program, requiring it to react. The parser doesn't store any state information, contextual clues that would help in decisions for how to parse, so the application has to store this information itself.
Pull parsing (eg. XMLPULL) is just the opposite. The program takes control and tells the parser when to fetch the next item. Instead of reacting to events, it proactively seeks out events. This allows the developer more freedom in designing data handlers, and greater ability to catch invalid markup.

The workhorse of SAX is the SAX driver. A SAX driver is any program that implements the SAX2 XMLReader interface. It may include a parser that reads XML directly, or it may just be a wrapper for another parser to adapt it to the interface. It may even be a converter, transmuting data of one kind (say, SQL queries) into XML.

Where streams fail are situations in which data is so complex that it requires a lot of searching around. For example, XSLT jumps from element to element in an order that may not match the lexical order at all. When that is the case, we prefer to use the tree model.

The tree method is luxurious in comparison to streams. This structure requires more resources to build and store, so you will only want to use it when the stream method cannot help. This persistence is the key reason for using trees. Since a tree is acyclic (it has no circular links), you can use simple traversal methods that won't get stuck in infinite loops. Like a filesystem directory tree, you can represent the location of a node easily in simple shorthand. Like real trees, you can break a piece off and treat it like a smaller tree. Most important, you have all the information in one place for as long as you need it. With streams, you are forced to work with events as they arrive, perhaps storing bits of data for later use. Tree processing is usually object-oriented. The data structure representing the document is composed of objects whose methods allow you to traverse in different directions, pull out data, or modify values.

While SAX defines an interface of handler methods, the DOM specification calls for a number of classes, each with an interface of methods that affect a particular type of XML markup. Thus, every object instance manages a portion of the document tree, providing accessor methods to add, remove, or modify nodes and data. These objects are typically created by a factory object, making it a little easier for programmers who only have to initialize the factory object themselves. In DOM, every piece of XML (the element, text, comment, etc.) is a node represented by a node object.

If streams and trees are the two extremes on a spectrum of XML processing techniques, then the middle ground is home to solutions we might call hybrids. They combine the best of both worlds, low resource overhead of streams with the convenience of a tree structure, by switching between the two modes as necessary. The idea is, if you are only interested in working with a small slice of a document and can safely ignore the rest, then you only need to work with a subtree. The parser scans through the stream until it sees the part that you want, then switches to tree building mode.

Data binding: Some developers don't need direct access to XML document structures—they just want to work with objects or other data structures. Data binding approaches minimize the amount of interaction between the developer and the XML itself. Instead of creating XML directly, an API takes an object and serializes it. Instead of reading an XML document and interpreting its parts, an API takes an XML document and presents it as an object. Data binding processing tends to focus on schemas, which are used as the foundation for describing the XML representing a particular object.