
Chapter 12
Next Page >
XML Support in SQL Server 2000
Microsoft SQL Server
has become a giant among the select group of enterprise-ready Relational
Database Management Systems, but as with those other RDBMSs, its roots
are in pre-Internet solutions.
The Internet
revolution has highlighted a set of old tactical and strategic challenges
for the Microsoft SQL Server development team. These challenges include:
- Storing the large
amounts of textual information that Web-based, user-friendly database
applications require
- Delivering that
textual (and other) stored information to the Web
- Sharing information
with other departments and organizations that do not use the same
RDBMS system
In earlier editions
of SQL Server, Microsoft has addressed these issues with such features
as Full Text Search, the Web Publishing Wizard, DTS, ADO, and OLE DB.
SQL Server 2000 introduces XML compatibility-the new holy grail of the
computing industry and the latest attempt to tackle the same old problems.
XML
(R)evolution
To communicate with customers in today's rich-content world, you need
to provide them with information. Until very recently, such information
was inevitably encapsulated in proprietary, document-based formats that
are not shared easily. For example, word processor documents are optimized
for delivery on paper, and relational databases are often structured and
normalized in formats unsuitable to end users.
The first step in the right direction was Standard Generalized Markup
Language (SGML). Although it was designed by Charles Goldfarb in the late
1960s, it became the international standard for defining markup languages
in 1986 after the creation of the ISO standard. In the late 1980s, companies
and government agencies started to adopt this tag-based language. It allowed
them to create and manage paper documentation in a way that was easy to
share with others.
Then in the 1990s, the Web appeared on the scene and our collective focus
shifted from isolated islands of personal computers and local networks
to a global network of shared information. SGML's tagged structure would
seem to make it a perfect candidate to lead the Internet revolution, but
the complexity of SGML makes it difficult to work with and unsuited to
Web application design.
Instead of SGML, the developers of the Internet adopted the Hypertext
Markup Language (HTML), a simple markup language used to create hypertext
documents that are portable from one platform to another. HTML is a simplified
subset of SGML. It was defined in 1991 by Tim Berners-Lee as a way to
organize, view, and transfer scientific documents across different platforms.
It uses HTTP (the Hypertext Transfer Protocol) to transfer information
over the Internet. This new markup language was an exciting development
and soon found nonscientific applications. Eventually, companies and users
started to use it as a platform for e-commerce-the processing of business
transactions without the exchange of paper-based business documents.
Unfortunately, HTML has some disadvantages. One of the biggest is a result
of its main purpose. HTML is designed to describe only how information
should appear-that is, its format. It was not designed to define the syntax
(logical structure) or semantics (meaning) of a document. It could make
a document readable to a user, but it required that user to interact with
the document and interpret it. The computer itself could not parse the
document, because the necessary "meta-information" (literally,
information about the information) was not included with the document.
Another problem with
HTML is that it is not extensible. It is not possible to create new tags.
HTML is also a "standard" that exists in multiple versions-and
multiple proprietary implementations. Web developers know that they have
to test even their static HTML pages in all of the most popular browsers
(and often in several versions of each), because each browser (and each
version of each browser) implements this "standard" somewhat
differently. Different development toolsets support different versions
of this standard (and often different features within a single standard).
I want "eXtensible" not "Extensible". It was not a
typo. -- Djn
In 1996, a group working under the auspices of the World Wide Web Consortium
(W3C) created a new standard tagged language called XML (Extensible Markup
Language). It was designed to address some of the problems of HTML and
SGML. XML is a standardized document formatting language, a subset of
SGML, that enables a publisher to create a single document source that
can be viewed, displayed, or printed in a variety of ways. As is the case
with HTML, XML is primarily designed for use on the Internet. HTML, however,
is designed primarily to address document formatting issues, while XML
addresses issues relating to data and object structure. XML provides a
standard mechanism for any document builder to define new XML tags within
any XML document. Its features lower the barriers for creation of integrated,
multiplatform, application-to-application protocols.
(1)Introduction to
XML
In today's world,
words such as "tag," "markup," "element,"
"attributes," and "schema" are buzzwords that you
can hear anywhere (well, at least in the IT industry), but what do these
terms mean in the context of markup languages?
(2)Introduction to
Markup Languages
In a broader sense,
a markup is anything that you place within a document that provides additional
meaning or additional information. For example, in this book we use italic
font to emphasize each new phrase or concept that we define or introduce.
I have a habit of using a highlighter when I am reading books. Each time
I use my highlighter, I change the format of the text as a means of helping
me find important segments later.
Markups usually define:
-
Formatting
-
Structure
-
Meaning
A reader has to have
an implicit set of rules for placing markups in a document-otherwise those
markups are meaningless to him. A markup language is a set of rules that
defines:
- What constitutes
a markup
- What a markup means
(2)Building
Blocks of Markup Languages
The syntax of markup
languages such as SGML, HTML, and XML is based on tags, elements, and
attributes.
A tag is a markup language building block that consists of delimiters
(angled brackets) and the text between them:
|<TITLE>
An element is a markup language building block that consists of a pair
of tags and the content between them:
|<TITLE>SQL Server 2000 Stored Procedure Programming</TITLE>
Each element has an opening tag and a closing tag. The text between these
tags is called the content of the element.
An attribute is a building block in the form of a name/value pair that
delimits a tag:
|<font size="2">
Okay, let's say that you have created a document and that you have marked
up some parts of it. Now what? You can share it with others. They will
use something called a user agent to review the document. In a broader
context, a user agent could be a travel agent that helps a customer buy
tickets for a trip. However, in the IT industry, a user agent is a program
that understands the markup language and presents information to an end
user. An example of such a program is a Web browser designed to presen
t documents created using HTML.
(2)XML
Let's take a
look at a simple example of an XML document:
<Inventory>
<Asset Inventoryid="5">
<Equipment>Toshiba Portege 7020CT</Equipment>
<EquipmentType>Notebook</EquipmentType>
<LocationId>2</LocationId>
<StatusId>1</StatusId>
<LeaseId>1234</LeaseId>
<LeaseScheduleId>1414</LeaseScheduleId>
<OwnerId>83749271</OwnerId>
<Cost>6295.00</Cost>
<AcquisitionType>Lease</AcquisitionType>
</Asset>
</Inventory>
(3)Elements
An XML document must contain one or more elements. One of
them is not part of any other element and therefore it is called the document's
root element.
It must be uniquely named. In the preceding example, the root element
is named Inventory.
Each element can contain one or more other elements. In the preceding
example, the Inventory
element contains one Asset
element. The Asset
element also contains other elements. The Equipment
element contains just its content-the text string "Toshiba Portege
7020CT".
Unlike HTML, XML is case sensitive. Therefore, <Asset>,
<asset>,
and <ASSET>
are different tag names.
It is possible to define an empty element. Such elements can be
displayed using standard opening and closing tags:
|<Inventory></Inventory>
or using special notation:
|<Inventory/>
If an element contains attributes but no content, an empty element is
an efficient way to write it.
| <Asset Inventoryid="5"/>
An element can have more than one attribute. The following example shows
an empty element that contains nine attributes:
<asset Inventoryid="12" EquipmentId="1" LocationId="2"
StatusId="1" LeaseId="1" LeaseScheduleId="1"
OwnerId="1" Lease="100.0000" AcquisitionTypeID="2"/>
You are not allowed to repeat an attribute in the same tag.
The following example shows a syntactically incorrect element:
|<Inventory Inventoryid="12" Inventoryid="13"/>
(3)Processing
Instructions
An XML document often starts with a tag that is called a processing
instruction. For example, the following processing instruction notifies
the reader that the document it belongs to is written in XML that complies
with version 1.0.
|<?xml version="1.0"?>
A processing instruction has the following format:
|<?name data?>
The name portion identifies the processing instruction to the application
that is processing the XML document. Names must start with XML. The data
portion that follows is optional. It could be used by the application.
Tip: It is not required but is recommended that you start an XML document
with a processing instruction that explicitly identifies that document
as an XML document defined using a specified version of the standard.
(3)Document
Type Definition
and Document Type Declaration
We mentioned earlier that markups are meaningless if it is not possible
to define rules for:
- What constitutes a markup
- What a markup means
A document type definition (DTD) is a type of a document that is often
used to define such rules for XML documents. The DTD contains descriptions
and constraints (naturally, not Transact-SQL constraints) for each element
(such as the order of element attributes and membership). User agents
can use theDTD file to verify that an XML document complies with its rules.
The DTD can be an external file that is referenced by an XML document:
|<!DOCTYPE Inventory SYSTEM "Inventory.dtd">
or it can be part of the XML document itself:
<?xml version="1.0"?>
<!DOCTYPE Inventory [
<!ELEMENT Inventory (Asset+)>
<!ELEMENT Asset (EquipmentId, LocationId, StatusId, LeaseId,
LeaseScheduleId, OwnerId, Cost, AcquisitionTypeID)>
<!ATTLIST Asset Inventoryid CDATA #IMPLIED>
<!ELEMENT EquipmentId (#PCDATA)>
<!ELEMENT LocationId (#PCDATA)>
<!ELEMENT StatusId (#PCDATA)>
<!ELEMENT LeaseId (#PCDATA)>
<!ELEMENT LeaseScheduleId (#PCDATA)>
<!ELEMENT OwnerId (#PCDATA)>
<!ELEMENT Cost (#PCDATA)>
<!ELEMENT AcquisitionTypeID (#PCDATA)>
]>
<Inventory>
<Asset Inventoryid="5">
<EquipmentId>1</EquipmentId>
<LocationId>2</LocationId>
<StatusId>1</StatusId>
<LeaseId>1</LeaseId>
<LeaseScheduleId>1</LeaseScheduleId>
<OwnerId>1</OwnerId>
<Cost>1295.00</Cost>
<AcquisitionTypeID>1</AcquisitionTypeID>
</Asset>
</Inventory>
The DTD document does not have to be stored locally. A reference can
include a URL or URI that provides access to the document:
<!DOCTYPE Inventory SYSTEM
"http://www.trigonblue.com/dtds/Inventory.dtd">
A universal resource identifier
(URI) identifies a persistent resource on the Internet. It is a number
or name that is globally unique. A special type of URI is a universal
resource locator (URL) that defines a location of a resource on the
Internet. A URI is more general because it should find the closest copy
of a resource or because it would eliminate problems in finding a resource
that was moved from one server to another.
(3)XML Comments and CDATA sections
It is possible to write comments within an XML document. The basic syntax
of the comment is:
|<!--commented text-->
Commented text can be any character string that does not contain two consecutive
hyphens (--) and that does not end with a hyphen (-). Comments can stretch
over more than one line:
<!-- This is a comment. -->
<!--
This is another comment.
-->
Comments cannot be part of any other tag:
<Order <!-- This is an illegal comment. --> OrderId = "123">
...
</Order>
You can use CDATA sections in XML documents to insulate blocks of text
from XML parsers. For example, if you are writing an article about XML
and you want also to store it in the form of an XML document, you can
use CDATA sections to force XML parsers to ignore markups with sample
XML code.
The basic syntax of a CDATA sections is:
|<![CDATA[string]]>
The string can be any character string that does not contain "]]>"
in sequence. CDATA sections can occur anywhere in an XML document where
character data is allowed.
<Example>
<Text>
<![CDATA[<Inventory Inventoryid="12"/>]]>
</Text>
</Example>
(3)Character and Entity References
Like HTML and SGML, XML also includes a simple way to reference characters
that do not belong to the ASCII character set. The syntax of a character
reference is:
The decimal (NNNNN) or hexadecimal (XXXX) code of the character must be
preceded by "&#" or "&#x", respectively, and
followed by a semicolon (;).
Entity references are used in XML to insert characters that would cause
problems for the XML parser if they were inserted directly into the document.
This type of reference is basically a mnemonic alternative to a character
reference. There are five basic entity references:
|
Entity |
Meaning |
|
& |
& |
|
' |
' |
|
< |
< |
|
> |
> |
|
" |
" |
Entity references are often used to represent characters with special
meaning in XML. In the following example, entity references are used to
prevent the XML parser from parsing the content of the <Text> tag:
<Example>
<Text>
<Inventory Inventoryid="12"/>
</Text>
</Example>
Structure of XML
Documents
XML Documents consists of the three parts that you can see in the following
illustration:
Illustration 1
The first part of the document is called the prolog or document type declaration
(not document type definition). It is not required. It can contain processing
instructions, a DTD, and comments. The body of the document contains the
document's elements. The data in these elements is organized into a hierarchy
of elements, their attributes, and their content. Sometimes an XML document
contains an epilog, an optional part that can hold final comments, or
processing instructions, or just white space.
(2)XML Document Quality
There are two levels of document quality in XML:
- Well-formed documents
- Valid documents
An XML document is said to be a well-formed document when:
The concept of a valid document has been imported to XML from SGML. In
SGML all documents must be valid. XML is not so strict. It is possible
to use an XML document even without a DTD document. If the user agent
knows how to use the XML document without the DTD, then the DTD need not
even be sent over the Net. It just increases traffic and ties up bandwidth.
(3)XML Schema
DTD is not the only type of document that can store rules for an XML document.
Several companies (including Microsoft) have submitted a proposal to W3C
for an alternative type of metadata document called an XML schema or XML
data. These are the major differences between a DTD and an XML schema:
- XML schemas support datatypes and range constraints.
- The language in which XML schemas are written is XML. Developers do
not have to read an additional language as they do with DTDs.
- XML schemas support namespaces (XML entities for defining context).
(3)XML-Data Reduced (XDR)
At the time of this writing, the W3C had not yet adopted the XML schema
as a standard. Microsoft has implemented a variation of XML schema syntax
called XML-Data Reduced (XDR) in the MSXML parser that is delivered as
a part of Internet Explorer 5.
Microsoft has promised complete support for the XML schema when the W3C
awards it Recommended status, but for the time being, more and more tools
and organizations are using XML-Data Reduced. It is also important to
note that Microsoft uses XDR in BizTalk, one of the most important initiatives
in the Web application market. It is an initiative intended to create
e-commerce vocabularies for different vertical markets. SQL Server 2000
also uses XDR for its XML schema.
Let's review an example of an XML schema document:
<Schema name="Schema"
xmlns="urn:schemas-microsoft-com:xml-data"
xmlns:dt="urn:schemas-microsoft-com:datatypes">
<ElementType name="Inventory" content="empty" model="closed">
<AttributeType name="Inventoryid" dt:type="i4"/>
<AttributeType name="EquipmentId" dt:type="i4"/>
<AttributeType name="LocationId" dt:type="i4"/>
<AttributeType name="StatusId" dt:type="ui1"/>
<AttributeType name="LeaseId" dt:type="i4"/>
<AttributeType name="LeaseScheduleId" dt:type="i4"/>
<AttributeType name="OwnerId" dt:type="i4"/>
<AttributeType name="Rent" dt:type="fixed.14.4"/>
<AttributeType name="Lease" dt:type="fixed.14.4"/>
<AttributeType name="Cost" dt:type="fixed.14.4"/>
<AttributeType name="AcquisitionTypeID" dt:type="ui1"/>
<attribute type="Inventoryid"/>
<attribute type="EquipmentId"/>
<attribute type="LocationId"/>
<attribute type="StatusId"/>
<attribute type="LeaseId"/>
<attribute type="LeaseScheduleId"/>
<attribute type="OwnerId"/>
<attribute type="Rent"/>
<attribute type="Lease"/>
<attribute type="Cost"/>
<attribute type="AcquisitionTypeID"/>
</ElementType>
</Schema>
This schema describes the structure of an XML document that contains Inventory
information. Only one element in an XML document will be based on this
schema. It is defined in the <ElementType> tag. The definition also
specifies its name ("Inventory"), content (the tag is "empty"
because all information will be carried in attributes), and content model
("closed"-it is not possible to add elements that are not specified
in the schema).
The element contains multiple attributes. Each attribute is first defined
in an <AttributeType> tag and then instantiated in an <attribute>
tag:
<AttributeType name="Cost" dt:type="fixed.14.4"/>
...
<attribute type="Cost"/>
For each attribute, the schema defines a name and a datatype. You can
see a list of acceptable datatypes later in this chapter.
The following listing shows an XML document that complies with the previous
schema:
<Inventory xmlns="x-schema:Schema.xml"
Inventoryid="5"
EquipmentId="1"
LocationId="2"
StatusId="1"
LeaseId="1"
LeaseScheduleId="1"
OwnerId="1"
Cost="1295.0000"
AcquisitionTypeID="1"/>
(3)Schema Constraints
Let's review schema attributes that can be used to declare elements and
attributes. These can be classified as:
- Element constraints
- Attribute constraints
- XML datatypes
- Group constraints
(4)Element Constraints
Elements in a schema can be constrained using attributes of the <ElementType>
tag:
- name
- content
- model
- order
- group
- minOccurs
- maxOccurs
The name attribute defines the name of the subelement.
Possible values for the content attribute are listed in the Table 12-1.
|
Content |
Meaning |
|
"textOnly" |
Only text is allowed
as content |
|
"eltOnly" |
Only other elements
are allowed as content |
|
"empty" |
No content |
|
"mixed" |
Both text and elements
are allowed |
Table 1: Content Attribute Values
An important innovation in XML schemas (that was not available in DTDs)
is the capability to add nondeclared elements and attributes to an XML
document. By default, every element of every XML document has its model
attribute set to "open". To prevent the addition of nondeclared
elements and attributes, the model attribute has to be "closed".
It is also possible to define how many times a subelement can appear in
its parent element using the maxOccurs and minOccurs attributes. Positive
integer values and "*" (unlimited number) are allowed in the
maxOccurs attribute, and "0" and positive integer values are
allowed in the minOccurs attribute. The default value for minOccurs is
"0". The default value for maxOccurs is "1", except
that when the content attribute is "mixed", maxOccurs must be
"*".
An order attribute specifies the order and quantity of subelements (see
Table 12-2).
|
Order |
Meaning |
|
"seq" |
Subelements must
appear in the order listed in the schema. |
|
"one" |
Only one of
subelements listed in the schema can appear in the XML
document. |
|
"many" |
Any number of
subelements can appear in any order. |
Table 2: Order Attribute Values
The default value for order is "seq" when the content attribute
is set to "eltOnly" and "many" when the content attribute
is set to "mixed".
(4)Attribute Constraints
By their nature, attributes are more constrained than elements. For example,
attributes do not have subelements (or subattributes), and it is not possible
to have more than one instance of an attribute within the element.
The required attribute (constraint) in a schema specifies that the attribute
is mandatory in XML documents that follow the schema. The default attribute
(constraint) in a schema specifies the default value of the attribute
in an XML document (the parser will use that value if an attribute is
not present).
The schema can be set so that an attribute value is constrained to a set
of predefined values:
<AttributeType name="status"
dt:type="enumeration"
dt:values="open in-process completed" />
(4)XML Datatypes
The schema can also enforce the datatype of the attribute or element.
Table 2 in Appendix A lists datatypes and their meanings.
Table 3 in Appendix A maps XML datatypes to SQL Server datatypes.
(4)Group Constraints
The group element allows an author to apply certain constraints to a group
of subelements. In the following example, only one price (rent, lease,
or cost) can be specified for the Inventory element:
<Schema name="Schema" xmlns="urn:schemas-microsoft-com:xml-data"
xmlns:dt="urn:schemas-microsoft-com:datatypes">
<ElementType name="Inventory" content="eltOnly"
model="closed" order="many">
<element type="Inventoryid"/>
<element type="EquipmentId"/>
<element type="LocationId"/>
<element type="StatusId"/>
<element type="LeaseId"/>
<element type="LeaseScheduleId"/>
<element type="OwnerId"/>
<group order = "one">
<element type="Rent"/>
<element type="Lease"/>
<element type="Cost"/>
</group>
<element type="AcquisitionTypeID"/>
</ElementType>
</Schema>
The group constraint accepts order, minOccurs, and maxOccurs attributes.
(3)XML Namespaces
Some entities from different areas of a document can have the same name.
For example, you could receive a purchase order document that contains
a <name> tag for the customer and a <name> tag for the reseller
company. People reading this documents would be able to distinguish them
by their context. However, an application would need additional information
to correctly interpret the data.
An answer to this problem is to create XML namespaces to provide the XML
document with a vocabulary (that is, a context). After that, customer
and company names can be referenced using a context prefix:
<contact:name>Tom Jones</contact:name>
<Company:name>Trigon Blue</Company:name>
Naturally, before these prefixes can be used, they have to be defined.
The root element of the following document contains three attributes.
Each of them specifies a namespace and a prefix used to reference it:
<PurcheseOrders
xmlns:contact="http://www.trigonblue.com/schemas/Contact.xml"
xmlns:Company="http://www.trigonblue.com/schemas/Company.xml" xmlns:dsig="http://dsig.org">
<PurcheseOrder>
<Customer>
<contact:name>Tom
Jones</contact:name>
</Customer>
<PurcheseDate>2000-09-11</PurcheseDate>
<SalesOrganization>
<Company:name>Trigon
Blue</Company:name>
<Company:DUNS>817282919</Company:DUNS>
<Company:ID>1212</Company:ID>
</SalesOrganization>
<dsig:digital-signature>78901314</dsig:digital-signature>
</PurcheseOrder> </PurcheseOrders>
Next Page >
|