Links
CS home page
Dick Brown's home page
Site home page
Printable version of this page
-----
CSA online text
Laboratory assignments
Homework assignments
Escher: Web Portfolio Manager Project
Course directory ~cs284
-----
Java API
Project log form




Overview of XML

CS 284 (CSA), Spring 2005

In this page, we present an outline of markup languages in general and XML in particular, providing a framework for programming with XML. We make no particular assumptions, but assume general prior exposure to HTML.

Markup

  • Terms: markup, markup languages, rendition ("source" text with markup), presentation (formatted view), style sheet (determines how to present rendition).

  • Markup languages such as HTML and XML use tags for the markup

    • General form: <name attrib="value" ...> content... </name>.

    • Tags such as <name attrib="value" ...> are open tags; </name> is the corresponding close tag

    • Elements of the document are indicated by tag (pair)s <name ...> content... </name>.
      The content of such an element is the (marked up) text between the open and close tags.

    • Attributes are options of the form attrib="value" ... within an open tag.

    • Entities are additional objects within the rendition. For example, the entity &nbsp; represents a "non-breaking space", and &lt; represents the less-than character. Other ideas for entities: an entity used to insert a company's logo graphic; and an entity used to insert a standard body of text, such as a copyright notice.

History of markup languages

  • SGML, Standard Generalized Markup Language, Goldfarb et al since 1960's, beginning at IBM.

    • Goldfarb coined term markup in 1970

    • Standards in '86, '91

    • Tags of form <name attrib="value" ...> content... </name>. Properly bracketed, i.e, every open tag in the rendition has a close tag.

    • Fundamental goals:

      • Common rendition representation

      • Extensibility, i.e., ability to define new tags, etc.

      • Document type rules

    • Document type rules represented in a separate DTD (Document Type Definition) language, using regular expressions. See below

  • LaTeX, Leslie Lamport published 1985

    • Implemented as a macro package over TeX (Donald Knuth, 1978-81) typesetting language.

    • Some proper bracketing: e.g., "environments" have the form \begin{name}...\end{name}

    • Common rendition representation, extensible, but no explicit document type rules.

    • commonly used in Mathematics and CS research publications.

  • HTML, Tim Berners-Lee (creator of WWW) and Anders Berglund, '89

    • Tags <name attrib="value" ...> content... </name> as in predecessor language SGML

    • Common rendition, but no extensibility (at first), no (modifyable) document type rules.

    • Designed as a simplification of SGML for WWW authoring.

  • XML, Berners-Lee et al (WWW consortium, w3c.org) '96, standard '98.

    • Simplified subset of SGML, but with extensibility and document type rules

    • Document types may be expressed using DTDs or XML Schema (an XML form for specifying document types).

DTDs

  • Examples: Note.dtd, SpecML.dtd.

  • Uses a form of regular expressions to represent patterns:

    symbolmeaningexample
    ,sequence: items expected in order(to,from,message)
    |OR(nosuperclass | (superclass | interface)+)
    *0 or morevar*
    +1 or morevar+
    ?0 or 1var?
    ( )grouping(nosuperclass | (superclass | interface)+)

Processing XML

  • There are two main approaches to processing XML in a language such as Java:

    • SAX, the Simple API for XML, performs actions as an XML document is parsed (input as a rendition and prepared for processing as elements, attributes, and entities).

    • DOM, the Document Object Model, creates an internal data structure (a DOM tree) during parsing, atructure that can be manipulated later by the program code.

  • SAX

    • The API for SAX provides a parse() method for performing the parsing of XML input, and a class ContentHandler with methods for specifying actions to be performed when certain tags (elements) are encountered in the XML input stream.

    • Example ContentHandler methods:

      ContentHandler_____
    • _____

    • _____

    _____
  • _____

  • _____

  • _____

_____



rab@stolaf.edu, October 09, 2006