xml-coreutils project homepage


SF Portal
Mailing List
Download1

Laird Breyer

Download2

previous next

XML-Coreutils: A Tutorial

by Laird A. Breyer

This tutorial will bring you up to speed on the xml-coreutils(7) command line tools. These are a collection of utilities very similar to the traditional Unix shell core utilities, but intended for reading and writing XML files.

There are many programs and libraries of code available today which can process XML files, but very few of them are targeted directly at users of the Unix shell (system administrators, developers and casual users), and practically none of them interact cleanly with the existing Unix shell tools. The xml-coreutils are intended to fill this gap.

You can learn about the initial design of xml-coreutils(7) here, or you can read the current manual pages. However, the most important thing you should know is this:

The xml-coreutils try to be as close as possible to the traditional Unix tools. Where it makes sense, they have the same names, the same short command line switches, and behave the same way, except that they work on XML files instead of ordinary text.

Let's begin. If you want to type along, you will need to have an installed copy of xml-coreutils version 0.8 or later (this tutorial was written with version 0.8, if you have a later version you might see small differences in output).

Here's how to check if the tools are installed. Open a terminal and type "xml-file" followed by the enter key as follows (do not type the %, it's only a placeholder for your shell prompt):


% xml-file
Usage: xml-file [OPTION]... FILE [FILE]...
Determine type of FILE(s).

      --help     display this help and exit
      --version  display version information and exit

If you see the usage message, then xml-coreutils is already installed and ready to be used. If you see an error message such as "No such file or directory" then you first have to download and install xml-coreutils from its website, or otherwise.

Opening and viewing existing XML files

We will work below with a simple XML file called food.xml. What can we do with it? The simplest thing we can do is send it to the terminal for display:


% xml-cat food.xml
<?xml version="1.0"?>
<root>

  <product price="3">Chicken</product>
  <product price="11.50">Lobster</product>
  <product price=".20">Apple</product>
  <product price="1.09">Milk (2 litres)</product>

</root>

Of course this is trivial, and we don't need a special command just to do this. We could simply use the ordinary cat(1) command. However, there are some differences here.

The first difference is that xml-cat(1) also checks the file food.xml for integrity, and whereas cat(1) prints whatever it finds in the file as-is, here xml-cat(1) will print an error (and refuse to continue) as soon as it finds that food.xml is not well formed XML. It actually is well formed, so we don't see an error message in this case.

Thus we get an implicit guarantee from xml-cat(1), that whatever it allows to be printed will be suitable for another XML processor to consume. The guarantee is weak, however, and is not a full validity guarantee, only a well formedness guarantee. All the xml-coreutils(7) commands process well formed XML documents and always ignore validity. This is because they are likely to be used on XML fragments, which don't usually carry their own validation specs.

The second difference between cat(1) and xml-cat(1) is at first surprising: the existing top level element (called <products>) in the food.xml file is discarded, and replaced with a generic <root> tag. Why does this occur?

Just like with cat(1), the main task of xml-cat(1) is concatenation, ie taking two or more XML files as input and creating a single XML file which contains them all as output. But a well formed XML file must only contain a single top level tag, and therefore xml-cat(1) does the simplest thing it can to satisfy this constraint (as well as a few others we won't mention here): it removes the top level tag from each input file, and wraps the output in a single <root> tag. You'll see this in action below. The generic root tag is also a handy reminder that the output is no longer associated with a DTD.

Although xml-cat(1) is nice for inspecting small XML files, for larger files a specialized viewer is essential. The xml-coreutils(7) include such a viewer, called xml-less(1). This is a terminal based interactive viewer, which is inspired by less(1), but with some extra advantages: because it understands the structure of XML files, it can do things that less(1) cannot, such as folding (press the TAB key), word wrapping (press the W key), showing or hiding attributes (press the A key), etc. You can try it out as follows:


% xml-less food.xml

One more command should be discussed straight away, and that is xml-fixtags(1). This command takes an XML file which is not necessarily well formed, and repairs it so that it becomes well formed XML. It can be used to fix small problems, and can even convert an HTML file into XML. However, be warned that the repairs are "dumb", and will probably not be as expected.

Aside from xml-fixtags(1), all the other xml-coreutils(7) commands expect their input XML files to be well formed, or will signal an error. This follows the XML standard modus operandi, and also prevents duplication of functionality.


% xml-fixtags food.xml | xml-less
% xml-fixtags --html xml_coreutils_tutorial.html | xml-fmt

Writing small XML files

To create a small XML document on the fly, we can use the xml-echo(1) command.


% xml-echo -e "[products/product@price=2.70]Soft Drink"
<?xml version="1.0"?>
<products>
	<product price="2.70">
		Soft Drink
	</product>
</products>

There are a number of things to note here: the indentation is automatic (this could be disabled with the -n switch), and all the tags are properly closed when necessary. The argument string contains instructions for building the XML file structure (these instructions are surrounded by square brackets []).

One way to combine the previous example with the food.xml file is as follows (you don't need to type the backslash \ if you don't split the commands over several lines):


% xml-echo -e "[products/product@price=2.70]Soft Drink" \
        | xml-cat food.xml stdin
<?xml version="1.0"?>
<root>

  <product price="3">Chicken</product>
  <product price="11.50">Lobster</product>
  <product price=".20">Apple</product>
  <product price="1.09">Milk (2 litres)</product>


	<product price="2.70">
		Soft Drink
	</product>
</root>

Note that in general, this output would be passed to a formatting tool to fix the final presentation. xml-coreutils(7) contains just such a tool, called xml-fmt(1).

Here is a second example, which is more complicated, to better see how xml-echo(1) builds up an XML file incrementally and show off some other features.


% xml-echo -en "\i[People/Person@Name=Fred Davis/Address]\i" \
        "\I[LineOne]4 Bushy Street[..]\i" \
        "\I[LineTwo]Green Road[..]\i" \
        "\I[County]Mayo[..]\i" \
        "\I[Country]Ireland[..]" \
        "\I[..]\i" \
        "\I[TelNo]+353 96 45232[..]\i"
<?xml version="1.0"?>
<People>
	<Person Name="Fred Davis">
		<Address>
			<LineOne>4 Bushy Street</LineOne>
			<LineTwo>Green Road</LineTwo>
			<County>Mayo</County>
			<Country>Ireland</Country>
		</Address>
		<TelNo>+353 96 45232</TelNo>
	</Person>
</People>

Just like with echo(1), several strings can be given on the command line, and they will be concatenated by xml-echo prior to being printed. In this example, the indentation of the output is controlled with the -n, \i and \I switches. The -n switch disables automatic indentation, and \i (resp. \I) turns indenting on (resp. off) for the subsequent characters. Although it isn't shown, direct indentation by inserting \t and \n characters is also possible. Finally, the [..] path closes the currently open tag.

Extracting strings from an XML file

The xml-coreutils(7) are intended to work well with existing core utilities, which only understand freeform text. Thus there are a few commands which extract the data from an XML file.

The xml-strings(1) command simply removes all the markup (tags, comments, etc) from an XML file:


% xml-strings food.xml | grep Milk
Milk (2 litres)
% cat food.xml | grep Milk
  <product price="1.09">Milk (2 litres)</product>

If you have slightly more complex requirements, a good command to use is xml-printf(1). This is one of a family of commands which accept an XPATH, which you can learn about on the xml-coreutils(7) manpage. An XPATH represents a collection of elements within an XML document, and xml-printf(1) just prints the strings from those elements. Here are a few examples:


% xml-printf 'I like %s ~:>\n' food.xml :/products/product[1]
I like Chicken ~:>
% xml-printf 'The %s costs $%.2f\n' \
        food.xml :/products/product[3] \
        :/products/product@price[3]
The Apple costs $0.20
% xml-printf 'The products are:\n%30s\n' \
        food.xml :/*/product
The products are:
                       Chicken
                       Lobster
                         Apple
               Milk (2 litres)

The first argument of xml-printf(1) is a format string similar to the format string of printf(3). The remaining arguments are an XML file (food.xml) and various XPATHs, which start with a colon ':' to distinguish them from a file. In the first two examples, these XPATHs contain the single strings "Chicken" and the strings "Apple" and "0.20" respectively. In the last example, the XPATH represents all four tags named "product" in the food.xml document.

If you don't know what the W3C XPath specification is, then a good way to think of an XPATH is as a directory path, where each tag in an XML file is thought of as a directory, containing text or other tags. If you look at the food.xml file, then the "Chicken" string is contained in the first "product" tag, which is itself contained in the "products" top level tag.

If you're familiar with the W3C XPath specification, then you should know that, while the XPATH notation is inspired by the W3C XPath 1.0 specification, it is not a complete implementation, and likely never will be (namespaces, axes and functions are not very shell tool friendly).

Besides printing text in between the tags, you can print a list of the tags themselves by using the xml-find(1) command:


% xml-find food.xml 
/products
/products/product
/products/product
/products/product
/products/product

However, xml-find(1) is really much more useful than that, it is in fact a general purpose selection tool, which can extract XML fragments from a file using one or more XPATH(s). We'll show this later.

Extracting the structure of an XML file

The simplest structural information about an XML file is its type or file format. If this is all you wish to know, use xml-file(1):


% xml-file food.xml xml_coreutils_tutorial.html 
food.xml:                    XML text
xml_coreutils_tutorial.html: HTML text fragment

Just as file(1) uses heuristics to identify a file type from its binary contents, xml-file(1) uses various pieces of data, such as the DOCTYPE and the name of the root tag to (attempt to) identify an XML file. However, xml-file(1) is not a replacement for file(1), and will output "unrecognized file" if the file is anything other than XML. Moreover, it will not recognize broken (malformed) files if the break is below the tags it looks for.

Every shell user knows how to navigate their home directory using ls(1) and cd(1). In xml-coreutils(7), the command xml-ls(1) lets you navigate and list the "directory" structure of an XML file using XPATHs. Here's an example using the People.xml file we discussed earlier.


% xml-ls People.xml :/
<?xml version="1.0"?>
<root>
	<People>
		<Person/>
	</People>
</root>
% xml-ls People.xml :/People/Person
<?xml version="1.0"?>
<root>
	<Person>
		<Address/>
		<TelNo/>
	</Person>
</root>
% xml-ls People.xml :/People/Person/Address
<?xml version="1.0"?>
<root>
	<Address>
		<LineOne/>
		<LineTwo/>
		<County/>
		<Country/>
	</Address>
</root>
% xml-ls People.xml :/People/Person/Address/Country
<?xml version="1.0"?>
<root>
	<Country>
		Ireland
	</Country>
</root>

The output of xml-ls(1) is XML. This makes sense if you recall that ls(1) prints both directory names and file names together. If we think of a tag as analogous to a directory, then text (such as the string "Ireland" in the last example) could be analogous to an ordinary file. To support well formed XML output, there must be some constraints, such as wrapping the output in a root tag. After all, the original doctype is not directly relevant.

To extract a structure based upon the presence or absence of textual contents, use xml-grep(1). The output will again be an XML file (so it can be xml-grepped again!), but containing only the structure necessary to access the text. The following examples give an idea of how this works.


% xml-grep 'Green' People.xml 
<?xml version="1.0"?>
<root>
	<Person Name="Fred Davis">
		<Address>
			
			<LineTwo>Green Road</LineTwo>
			
			
		</Address>
		
	</Person>
</root>
% xml-grep -E '(Fred|Ire*)' People.xml 
<?xml version="1.0"?>
<root>
	<Person Name="Fred Davis">
		<Address>
			
			
			
			<Country>Ireland</Country>
		</Address>
		
	</Person>
</root>
% xml-grep -i --subtree 'fReD' People.xml 
<?xml version="1.0"?>
<root>
	<Person Name="Fred Davis">
		<Address>
			<LineOne>4 Bushy Street</LineOne>
			<LineTwo>Green Road</LineTwo>
			<County>Mayo</County>
			<Country>Ireland</Country>
		</Address>
		<TelNo>+353 96 45232</TelNo>
	</Person>
</root>
% xml-grep -v 'o' People.xml 
<?xml version="1.0"?>
<root>
	<Person Name="Fred Davis">
		<Address>
			<LineOne>4 Bushy Street</LineOne>
			
			
			<Country>Ireland</Country>
		</Address>
		<TelNo>+353 96 45232</TelNo>
	</Person>
</root>

Last but not least, there is xml-find(1), which we already mentioned earlier. Just like its namesake find(1) traverses a directory, looking for interesting files, and executing actions, xml-find(1) actually traverses an XML file one node at a time, looking for (selecting) interesting tags, and executing actions. This makes xml-find(1) into an iterator. Before we can illustrate this properly, we'll build up with a series of rather boring examples.

The simplest action is to search for a tag name and print it:


% xml-find People.xml -name 'Tel*' -print
/People/Person/TelNo

The tag name can also be passed to a program (or a script), for example echo:


% xml-find People.xml -name 'Tel*' \
        -exec echo 'The tag is ' '{}' ';'
The tag is  /People/Person/TelNo

If this were a tutorial on find(1), then the placeholder {} would be the name of a file, which the -exec'd program could open and read. However, this is not possible here because {} is only a tag name. So in xml-find(1), there are two more placeholders, {@} which expands to a list of attributes of the selected tag (if any), and {-} which expands to the name of a temporary XML file which contains everything that belongs to the current node. Thus:


% xml-find People.xml -name 'Tel*' \
        -exec cat '{-}' ';'
<?xml version="1.0"?>
<People>
<Person>
<TelNo>+353 96 45232</TelNo></Person>
</People>

It's time to combine all these ideas into a final example. We'll iterate through the food.xml file using xml-find(1) to stop at each product, and printing the data we find using xml-printf(1).


% xml-find food.xml -name 'product' \
        -exec xml-printf 'Price of %-20s: %5.2f\n' \
        {-} ://product ://product@price ';'
Price of Chicken             :  3.00
Price of Lobster             : 11.50
Price of Apple               :  0.20
Price of Milk (2 litres)     :  1.09

Changing the structure of an XML file

Besides cat(1), one of the most useful shell commands for interactive use is head(1), which truncates its input after a few lines. There are multiple generalizations of this idea for XML documents.

The xml-head(1) command has three main switches. The switch -t truncates the tags, ie displays only the first few tags (but still generates well formed XML). The -c switch truncates the text fields, ie displays only the first few characters wherever text is present, but leaves the tags as is, and the -n switch tuncates lines, so that each text field does not exceed a certain number of lines. All three main switches can be combined.


% xml-head -t 3 People.xml
<?xml version="1.0"?>
<People>
	<Person Name="Fred Davis">
		<Address>
			<LineOne>4 Bushy Street</LineOne>
</Address>
</Person>
</People>
% xml-head -c 2 People.xml
<?xml version="1.0"?>
<People>
	<Person Name="Fred Davis">
		<Address>
		<LineOne>4 </LineOne>
		<LineTwo>Gr</LineTwo>
		<County>Ma</County>
		<Country>Ir</Country>
		</Address>
		<TelNo>+3</TelNo>
	</Person>
</People>

Another way to modify the structure of an XML file is with xml-cut(1). In traditional Unix, the cut(1) command prints columns from an input file that is viewed as a table (the exact meaning of a column is determined by switches). To understand xml-cut(1), think of a fully indented XML file, where each level of indentation is printed in its own column:


         0           | 1  | 2  | 3  | 4
----------------------------------------
<?xml version="1.0"?>|    |    |    |
                     |<a> |    |    |
                     |    |<b> |    |
                     |    |    |<c> |
                     |    |    |    |xyz
                     |    |    |</c>|
                     |    |</b>|    |
                     |</a>|    |    |

Now we can print only the columns 2 and 4 as follows:


% xml-echo -e '[a/b/c]xyz' | xml-cut -t 2,4
<?xml version="1.0"?>
<root>

	<b>
			xyz
		</b>

</root>

Note that the closing tag </b> in this example is out of alignment. This makes sense, once you realize that the "xyz" text field really begins with the first newline after <c> and contains all the whitespace before </c>. As usual, xml-fmt(1) can be used to align the tags if necessary.

Structural surgery can also be performed using xml-rm(1), xml-cp(1) and xml-mv(1). These commands remove, copy, and move entire subtrees of an XML document.


% xml-rm food.xml :/products/product[2]
<products>

  <product price="3">Chicken</product>
  
  <product price=".20">Apple</product>
  <product price="1.09">Milk (2 litres)</product>

</products>
% xml-cp food.xml :/products/product[2]/ \
        People.xml ://TelNo/
<?xml version="1.0"?>
<People>
	<Person Name="Fred Davis">
		<Address>
			<LineOne>4 Bushy Street</LineOne>
			<LineTwo>Green Road</LineTwo>
			<County>Mayo</County>
			<Country>Ireland</Country>
		</Address>
		<TelNo>Lobster</TelNo>
	</Person>
</People>
% xml-mv food.xml :/products/product[3] \
        food.xml :/products/product[1]/
<products>

  <product price="3"><product price=".20">Apple</product></product>
  <product price="11.50">Lobster</product>
  
  <product price="1.09">Milk (2 litres)</product>

</products>

Editing an XML stream

The last command to be discussed in this tutorial is xml-sed(1), which can be viewed as the swiss army knife of command line XML editing.

For search and replace operations, xml-sed(1) is invoked just like sed(1):


% cat food.xml | xml-sed 's/Apple/Orange/'
<products>

  <product price="3">Chicken</product>
  <product price="11.50">Lobster</product>
  <product price=".20">Orange</product>
  <product price="1.09">Milk (2 litres)</product>

</products>

Although this cannot be seen here, the two commands xml-sed(1) and sed(1) do differ. Whereas sed(1) will replace text anywhere within the XML file, even if it occurs within a tag name, xml-sed(1) as invoked above only replaces text that resides outside of tag elements. Moreover, xml-sed(1) understands editing constraints in the form of an XPATH. Compare:


% cat food.xml | sed 's/e/E/g'
<products>

  <product pricE="3">ChickEn</product>
  <product pricE="11.50">LobstEr</product>
  <product pricE=".20">ApplE</product>
  <product pricE="1.09">Milk (2 litrEs)</product>

</products>
% cat food.xml | xml-sed 's/e/E/' ://product[3]
<products>

  <product price="3">Chicken</product>
  <product price="11.50">Lobster</product>
  <product price=".20">ApplE</product>
  <product price="1.09">Milk (2 litres)</product>

</products>

For 99% of editing tasks, the above is all you need to know about xml-sed(1). For the remaining 1%, we have to make a digression.

Consider your favourite text file in Unix. It consists of a number of lines, separated by the newline character '\n'. This character isn't directly visible, but it has an important structural function. Without it, all the lines would join and the text file would be one long stream of words and symbols.

Whenever the text is shown on a terminal, this newline character is interpreted, rather than merely displayed as an ordinary character. This distinction between '\n' and, say, the letter 'a' is what makes sed(1) useful as a way to alter the structure of a text document.

Think about what happens if you search and replace all the occurrences of the letter 'a' with the letter 'A'. You get the same structural document, but with altered letters. Now suppose you replace each 'a' with '\n'. You have a document with a completely different number of text lines. It is by altering the embedded meta information represented by the character '\n' (using ordinary editing commands), that a structural alteration is obtained.


% echo -e "Carol's cat carries carrots in a cart."
Carol's cat carries carrots in a cart.
% echo -e "CArol's cAt cArries cArrots in A cArt."
CArol's cAt cArries cArrots in A cArt.
% echo -e "C\\nrol's c\\nt c\\nrries c\\nrrots in \\n c\\nrt."
C
rol's c
t c
rries c
rrots in 
 c
rt.

What does all this mean for sed(1)? In principle, editing a text document can be done without specialized (meta) commands for inserting or deleting a line, ie the only thing that is needed are commands for altering strings of characters.

The same principle also applies to xml-sed(1). There is no need for specialized commands that create or remove tags, attributes, subtrees etc, provided that the structural (meta) information which describes an XML document is embedded directly in the text being edited. The language used by xml-sed(1) to embed the structure of an XML document is the same one used by xml-echo(1).

If you have an existing XML file, you can feed it to xml-unecho(1) to recover the embedded structure:


% xml-unecho --xml-sed food.xml 
[/products]\n\n  
[/products/product@price=3]Chicken
[/products]\n  
[/products/product@price=11.50]Lobster
[/products]\n  
[/products/product@price=.20]Apple
[/products]\n  
[/products/product@price=1.09]Milk (2 litres)
[/products]\n\n

The --xml-sed switch tells xml-unecho(1) to print exactly what xml-sed(1) would see. Normally, xml-unecho(1) prints a slightly altered form which, if interpreted by xml-echo(1), would recover the original XML file. The --xml-sed form is preferable for stream editing, because the absolute path of the current node is always available, and this helps prevent side effects.

Now suppose we edit the above, using sed(1) (that's right, we're not using xml-sed(1) yet):


% xml-unecho --xml-sed food.xml \
        | sed 's/]Apple/@juicy=true]A [bold]big[..] orange/'
[/products]\n\n  
[/products/product@price=3]Chicken
[/products]\n  
[/products/product@price=11.50]Lobster
[/products]\n  
[/products/product@price=.20@juicy=true]A [bold]big[..] orange
[/products]\n  
[/products/product@price=1.09]Milk (2 litres)
[/products]\n\n

We've just inserted an extra attribute, and a new tag! But this isn't XML until we interpret it. Let's do everything at once using xml-sed(1) now:


% cat food.xml \
        | xml-sed 's/]Apple/@juicy=true]A [bold]big[..] orange/z'
<products>

  <product price="3">Chicken</product>
  <product price="11.50">Lobster</product>
  <product price=".20" juicy="true">A <bold>big</bold> orange</product>
  <product price="1.09">Milk (2 litres)</product>

</products>

The important ingredient here is the z flag in the s///z command. This flag tells xml-sed(1) to edit the full echo-leaf (the lines generated by xml-unecho(1) are called echo-leaves). If the z is missing, then the path and attribute information (wich are surrounded by square brackets []) are not editable. This restriction is solely for the benefit of casual users' feet.

The remaining aspects of xml-sed(1) are not very surprising if you already know sed(1). There is a pattern and a holding space (which contains the current echo-leaf), and each editing command can be addressed individually. The available editing commands are the same as for sed(1), with minor (and rather obvious) alterations to accomodate the echo-leaf concept.

There is more to say but this tutorial is at an end. Happy hacking.

previous next