Welcome

OpenAP

802.11

mpeg4

XML

ASPI

Linux Toshiba Satellite A15-S1292

Authenticode

About

Zillabit Projects

Geeky Tidbits

Update - 2006

In 2000 when this code was originally written there were no open source tiny C++ XML parser projects that I could find. As I write this in 2006 the situation is very different. These days I'm using the excellent TinyXml parser in my own work. The rest of this page remains only for sentimental reasons.

XML document object class

This C++ code lets one build a document object class for which instances can be easily parsed from an XML document. It uses James Clark's expat code as the core XML parsing engine. This code is recommended only when:
  • Your document type is relatively simple and not changing in the future --- writing some code specific to your document type is required
  • The size of your built code is critical, for example you are building downloadable client code (compare to full DOM XML parsers which are tens or hundreds of times larger)
If either of these are not appropriate, using these classes will probably be more trouble than it's worth.

The object oriented code consists of the parser class (expatParser.cpp and expatParser.h) and the interface (docObjectIface.h) for which you must implement a subclass specific to your document type. In the example below i've just put those three files into the expat directory with James Clark's code, but of course you could place these files elsewhere if you deal with include paths properly. Also, I recommend using preprocessor definition XML_MIN_SIZE (used by James Clark's code, not mine).

This code was originally written in 2000 and used successfully for two commercial implementations. One was in a downloadable Windows application, and the other was in an ActiveX control, but in both cases the convenience of a document object class and the small size of the compiled code were very advantageous, and the specific document types were simple enough that implementation using this code was easy enough.

Since this code was written, Tim Smith has done a similar project. There are probably some other ones out there too.

Example: Shakespeare plays dialog search

To demonstrate the use of this code, here's an example of a custom-built application for searching through Shakespeare's plays for certain dialog. We make use of Jon Bosak's XML markups, also available in a single zip file.

The task we'll demonstrate is this: parse all the plays, then allow a user to specify a string of dialog for which to search. This application will then find all instances of that string in a line of dialog, and let us know the play, act, scene, and speaker for each instance. For simplicity we'll just use the C strstr() function on each line as the lines were split up by the author. (i.e. we won't span lines).

Because we can ignore some features of the documents (e.g. we don't care about the Dramatis Personae information or the Scene Description) we have a pretty simple view of the documents:
<PLAY> 
  [we expect only one PLAY per document]
<TITLE>The Tragedy of Hamlet, Prince of Denmark</TITLE>
  [we only expect one TITLE tag per PLAY]

<ACT>
  [we expect multiple ACTs per PLAY]
<TITLE>ACT I</TITLE>
  [we expect only one TITLE tag per ACT]

<SCENE>
  [we expect multiple SCENEs per ACT]
<TITLE>SCENE I.  Elsinore. A platform before the castle.</TITLE>
  [we expect only one TITLE tag per SCENE]

<SPEECH> 
  [we expect multiple SPEECHes per SCENE]
<SPEAKER>BERNARDO</SPEAKER> 
  [we expect only one SPEAKER per SPEECH]
<LINE>Well, good night.</LINE> 
  [we expect multiple LINEs per SPEECH]
<LINE>If you do meet Horatio and Marcellus,</LINE>
<LINE>The rivals of my watch, bid them make haste.</LINE>
</SPEECH>

[more SPEECHes here]
</SCENE>

[more SCENEs here]
</ACT>

[more ACTs here]
</PLAY>
At any point, if we encounter any unrecognized tags we'll just ignore them. If we expected exactly one instance of a tag, but we encounter zero or more than one instance, we'll print a warning message (to the stream the user of the object requests, which could be NULL, stderr, or a file, for example) but will deal with the situation gracefully. (In fact, in the plays there are some SPEECHes with multiple SPEAKERs, but for our application noting only one of them is fine.)

The document class source code is shakespearedoc.cpp and shakespearedoc.h. The source code for the application itself is shakespearesearch.cpp. For convenience, you can download all this source code ready to be built in a MSVC project, xmlexample-MSVC.zip. If you'd just like to run it, here's a windows executable: shakespearesearch.exe. I haven't gotten around to doing a makefile for make/gcc environments but it should be quite easy to do and the code should be fully portable --- just be sure to note the XML_MIN_SIZE preprocessor flag as mentioned above.

Here's an example of running the application. Note that in this example we are using a shell that's capable of exanding wildcard filenames; if running under DOS you'd have to explicitly name all the XML files on the command line. On a 750 MHz Pentium it takes about 1 or 2 seconds to parse all the plays, and the time to do the searches is too fast to measure.
>shakespearesearch *.xml
Parsing file : a_and_c.xml
Parsing file : all_well.xml
Parsing file : as_you.xml
Parsing file : com_err.xml
Parsing file : coriolan.xml
Parsing file : cymbelin.xml
Parsing file : dream.xml
Parsing file : hamlet.xml
Parsing file : hen_iv_1.xml
Parsing file : hen_iv_2.xml
Parsing file : hen_v.xml
Parsing file : hen_vi_1.xml
Parsing file : hen_vi_2.xml
Parsing file : hen_vi_3.xml
Parsing file : hen_viii.xml
Parsing file : j_caesar.xml
Parsing file : john.xml
Parsing file : lear.xml
Parsing file : lll.xml
Parsing file : m_for_m.xml
Parsing file : m_wives.xml
Parsing file : macbeth.xml
Parsing file : merchant.xml
Parsing file : much_ado.xml
Parsing file : othello.xml
Parsing file : pericles.xml
Parsing file : r_and_j.xml
Parsing file : rich_ii.xml
Parsing file : rich_iii.xml
Parsing file : t_night.xml
Parsing file : taming.xml
Parsing file : tempest.xml
Parsing file : timon.xml
Parsing file : titus.xml
Parsing file : troilus.xml
Parsing file : two_gent.xml
Parsing file : win_tale.xml

Enter a search string, or RETURN to quit searching:
to kiss you
FOUND OCCURANCE number 1:
 PLAY 14: The Third Part of Henry the Sixth
 ACT 3: ACT III
 SCENE 3: SCENE III.  France. KING LEWIS XI's palace.
 SPEECH 17: WARWICK
 LINE 3: Humbly to kiss your hand, and with my tongue

FOUND OCCURANCE number 2:
 PLAY 15: The Famous History of the Life of Henry the Eighth
 ACT 1: ACT I
 SCENE 4: SCENE IV.  A Hall in York Place.
 SPEECH 42: KING HENRY VIII
 LINE 3: And not to kiss you. A health, gentlemen!

FOUND OCCURANCE number 3:
 PLAY 28: The Tragedy of King Richard the Second
 ACT 1: ACT I
 SCENE 3: SCENE III.  The lists at Coventry.
 SPEECH 13: Lord Marshal
 LINE 2: And craves to kiss your hand and take his leave.


Enter a search string, or RETURN to quit searching:
doughnut

Enter a search string, or RETURN to quit searching:
on the stream
FOUND OCCURANCE number 1:
 PLAY 1: The Tragedy of Antony and Cleopatra
 ACT 1: ACT I
 SCENE 4: SCENE IV.  Rome. OCTAVIUS CAESAR's house.
 SPEECH 6: OCTAVIUS CAESAR
 LINE 6: Like to a vagabond flag upon the stream,

FOUND OCCURANCE number 2:
 PLAY 23: The Merchant of Venice
 ACT 1: ACT I
 SCENE 1: SCENE I.  Venice. A street.
 SPEECH 4: SALARINO
 LINE 12: Would scatter all her spices on the stream,

FOUND OCCURANCE number 3:
 PLAY 35: The History of Troilus and Cressida
 ACT 2: ACT II
 SCENE 3: SCENE III.  The Grecian camp. Before Achilles' tent.
 SPEECH 57: ULYSSES
 LINE 2: But carries on the stream of his dispose


Enter a search string, or RETURN to quit searching:
(RETURN)

License

The source code files expatParser.cpp, expatParser.h, and docObjectIface.h, as well as the Shakespeare example source code files, are distributed under the BSD license.

License for James Clark's expat is explained on his site.