|
Zillabit Projects
Geeky Tidbits
Update - 2006
In 2000 when this code was originally written there were no open source tiny C++ XML parser
projects that I could find.
As I write this in 2006 the situation is very different.
These days I'm using the excellent TinyXml
parser in my own work. The rest of this page remains only for sentimental reasons.
XML document object class
This C++ code lets one build a document object class for which instances can be
easily parsed from an XML document. It uses
James Clark's expat code
as the core XML parsing engine.
This code is recommended only when:
- Your document type is relatively simple and not
changing in the future --- writing some code
specific to your document type is required
- The size of your built code is critical, for example you are building
downloadable client code (compare to full DOM XML parsers which are
tens or hundreds of times larger)
If either of these are not appropriate, using these classes will probably be
more trouble than it's worth.
The object oriented code consists of the parser class
(expatParser.cpp and
expatParser.h)
and the interface (docObjectIface.h)
for which you must implement a subclass
specific to your document type. In the example below i've just put those
three files into the expat directory with James Clark's code, but of course
you could place these files elsewhere if you deal with include paths properly.
Also, I recommend using preprocessor definition XML_MIN_SIZE
(used by James Clark's code, not mine).
This code was originally written in 2000 and used successfully for two
commercial implementations. One was in a downloadable Windows application,
and the other was in an ActiveX control, but in both cases the
convenience of a document object class and the small size of the compiled
code were very advantageous, and the specific document types were simple
enough that implementation using this code was easy enough.
Since this code was written, Tim Smith has done a
similar project.
There are probably some other ones out there too.
Example: Shakespeare plays dialog search
To demonstrate the use of this code, here's an example of a custom-built
application for searching through Shakespeare's plays for certain dialog.
We make use of
Jon Bosak's
XML markups, also available
in a single zip file.
The task we'll demonstrate is this: parse all the plays, then allow a user to
specify a string of dialog for which to search. This application will then
find all instances of that string in a line of dialog, and let us know the
play, act, scene, and speaker for each instance.
For simplicity we'll just use the C strstr() function on each line
as the lines were split up by the author.
(i.e. we won't span lines).
Because we can ignore some features of the documents (e.g. we don't
care about the Dramatis Personae information or the Scene Description)
we have a pretty simple view of the documents:
<PLAY>
[we expect only one PLAY per document]
<TITLE>The Tragedy of Hamlet, Prince of Denmark</TITLE>
[we only expect one TITLE tag per PLAY]
<ACT>
[we expect multiple ACTs per PLAY]
<TITLE>ACT I</TITLE>
[we expect only one TITLE tag per ACT]
<SCENE>
[we expect multiple SCENEs per ACT]
<TITLE>SCENE I. Elsinore. A platform before the castle.</TITLE>
[we expect only one TITLE tag per SCENE]
<SPEECH>
[we expect multiple SPEECHes per SCENE]
<SPEAKER>BERNARDO</SPEAKER>
[we expect only one SPEAKER per SPEECH]
<LINE>Well, good night.</LINE>
[we expect multiple LINEs per SPEECH]
<LINE>If you do meet Horatio and Marcellus,</LINE>
<LINE>The rivals of my watch, bid them make haste.</LINE>
</SPEECH>
[more SPEECHes here]
</SCENE>
[more SCENEs here]
</ACT>
[more ACTs here]
</PLAY>
|
At any point, if we encounter any unrecognized tags we'll just ignore them.
If we expected exactly one instance of a tag, but we encounter zero or more than one
instance, we'll print a warning message (to the stream the user of the
object requests, which could be NULL, stderr, or a file, for example)
but will deal with the situation
gracefully. (In fact, in the plays there are some SPEECHes with
multiple SPEAKERs, but for our application noting only one of them is fine.)
The document class source code is
shakespearedoc.cpp and
shakespearedoc.h.
The source code for the application itself is
shakespearesearch.cpp.
For convenience, you can download all this source code ready to be built
in a MSVC project,
xmlexample-MSVC.zip.
If you'd just like to run it, here's a windows executable:
shakespearesearch.exe.
I haven't gotten around to doing a makefile for make/gcc
environments but it should be quite easy to do and the code should be
fully portable --- just be sure to note the XML_MIN_SIZE
preprocessor flag as mentioned above.
Here's an example of running the application. Note that in this
example we are using a shell that's capable of exanding wildcard filenames;
if running under DOS you'd have to explicitly name all the XML files on the
command line. On a 750 MHz Pentium it takes about 1 or 2 seconds to parse
all the plays, and the time to do the searches is too fast to measure.
>shakespearesearch *.xml
Parsing file : a_and_c.xml
Parsing file : all_well.xml
Parsing file : as_you.xml
Parsing file : com_err.xml
Parsing file : coriolan.xml
Parsing file : cymbelin.xml
Parsing file : dream.xml
Parsing file : hamlet.xml
Parsing file : hen_iv_1.xml
Parsing file : hen_iv_2.xml
Parsing file : hen_v.xml
Parsing file : hen_vi_1.xml
Parsing file : hen_vi_2.xml
Parsing file : hen_vi_3.xml
Parsing file : hen_viii.xml
Parsing file : j_caesar.xml
Parsing file : john.xml
Parsing file : lear.xml
Parsing file : lll.xml
Parsing file : m_for_m.xml
Parsing file : m_wives.xml
Parsing file : macbeth.xml
Parsing file : merchant.xml
Parsing file : much_ado.xml
Parsing file : othello.xml
Parsing file : pericles.xml
Parsing file : r_and_j.xml
Parsing file : rich_ii.xml
Parsing file : rich_iii.xml
Parsing file : t_night.xml
Parsing file : taming.xml
Parsing file : tempest.xml
Parsing file : timon.xml
Parsing file : titus.xml
Parsing file : troilus.xml
Parsing file : two_gent.xml
Parsing file : win_tale.xml
Enter a search string, or RETURN to quit searching:
to kiss you
FOUND OCCURANCE number 1:
PLAY 14: The Third Part of Henry the Sixth
ACT 3: ACT III
SCENE 3: SCENE III. France. KING LEWIS XI's palace.
SPEECH 17: WARWICK
LINE 3: Humbly to kiss your hand, and with my tongue
FOUND OCCURANCE number 2:
PLAY 15: The Famous History of the Life of Henry the Eighth
ACT 1: ACT I
SCENE 4: SCENE IV. A Hall in York Place.
SPEECH 42: KING HENRY VIII
LINE 3: And not to kiss you. A health, gentlemen!
FOUND OCCURANCE number 3:
PLAY 28: The Tragedy of King Richard the Second
ACT 1: ACT I
SCENE 3: SCENE III. The lists at Coventry.
SPEECH 13: Lord Marshal
LINE 2: And craves to kiss your hand and take his leave.
Enter a search string, or RETURN to quit searching:
doughnut
Enter a search string, or RETURN to quit searching:
on the stream
FOUND OCCURANCE number 1:
PLAY 1: The Tragedy of Antony and Cleopatra
ACT 1: ACT I
SCENE 4: SCENE IV. Rome. OCTAVIUS CAESAR's house.
SPEECH 6: OCTAVIUS CAESAR
LINE 6: Like to a vagabond flag upon the stream,
FOUND OCCURANCE number 2:
PLAY 23: The Merchant of Venice
ACT 1: ACT I
SCENE 1: SCENE I. Venice. A street.
SPEECH 4: SALARINO
LINE 12: Would scatter all her spices on the stream,
FOUND OCCURANCE number 3:
PLAY 35: The History of Troilus and Cressida
ACT 2: ACT II
SCENE 3: SCENE III. The Grecian camp. Before Achilles' tent.
SPEECH 57: ULYSSES
LINE 2: But carries on the stream of his dispose
Enter a search string, or RETURN to quit searching:
(RETURN)
|
License
The source code files expatParser.cpp, expatParser.h,
and docObjectIface.h, as well as the
Shakespeare example source code files, are distributed under
the BSD license.
License for James Clark's expat is explained on
his site.
|