Gorille - a tool for specifying XML Unicode usage


com.simonstl.gorille Code for testing character usage in XML documents against rules documents.


Gorille - a tool for specifying XML Unicode usage

Introduction | Warnings | Future Directions | License | Acknowledgments | Download

Gorille is a small Java package designed to let developers of various kinds of XML processors test the content and names of XML structures in their XML documents. While Gorille ships with test files for both XML 1.0 and the draft XML 1.1, you can create your own configuration files as well.

Introduction - Why Gorille?

Gorille attempts to provide a standard means of addressing complex issues between XML and Unicode. The initial release of XML 1.1 has provoked discussion, much like the earlier Blueberry(1 2) discussions, and it doesn't appear that issues emerging from XML's lack of direct synchronization with Unicode are going to disappear any time soon. Gorille sidesteps the notion of fixed Unicode character assignments dictated by W3C specification, and opens the field to character listings of whatever form seem necessary. XML 1.0 and XML 1.1 conventions are supported (in the xml10chars.xml and xml11chars.xml configuration files), but developers can go their own route as well (as demonstrated by the asciichars.xml file).

The main class is CharRules, where most of the processing logic takes place. The supporting classes provide containers for information (CharRange, CharRanges), file loading support (CharRulesLoader), or testing capabilities (the command-line CharTester and the simple test suite TestCharRules). In version 0.04, the CharRulesGen class for generating fixed Java classes from XML configuration files was added, along with Xml10Rules, Xml11Rules, AsciiRules - code it generated.

Gorille is named after an unruly brute who stars in George Brassens' Le Gorille (French English). Gare au gorille!


Gorille is thoroughly experimental and perhaps not even a good idea. Gorille will definitely see future use in Markup Object Events (MOE) for name and content-checking, but its infinite configurability certainly opens enormous possibilities for very very bad practice. Using Gorille you can, for instance, require that all content be represented as control characters and all names as ideographs.

Future directions

This is just getting started. Gorille will eventually sprout a SAX filter interface for checking content and names, as well as connections to MOE. A tool which performs Gorille checking on Java Readers seems like a good future idea as well.


The contents of this package are subject to the Mozilla Public License Version 1.1 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.mozilla.org/MPL/.

Software distributed under the License is distributed on an "AS IS" basis, WITHOUT WARRANTY OF ANY KIND, either express or implied. See the License for the specific language governing rights and limitations under the License.

The Original Code is available at http://simonstl.com/projects/fragment/original.

The Initial Developer of the Original Code is Simon St.Laurent. Portions created by Simon St.Laurent are Copyright (C) 2001 Simon St.Laurent. All Rights Reserved.



Thanks to Elliotte Rusty Harold for pointing out surrogate pair issues and John Cowan for pointing the way to a solution.

Thanks to the xml-dev mailing list for its continuing discussions of all these issues.

Thanks to BBC Correspondent John Simpson's book, A Mad World, My Masters for leading me to the Le Gorille song.


A download is available.

Introduction | Warnings | Future Directions | License | Acknowledgments | Download