: Incremental Maintenance of XML Documents in Relational Databases

Размер: px
Започни от страница:

Download ": Incremental Maintenance of XML Documents in Relational Databases"


1 ER Paper ID: 66 : Incremental Maintenance of XML Documents in Relational Databases Kajal T. Claypool and Elke A. Rundensteiner Department of Computer Science Worcester Polytechnic Institute Worcester, MA kajal Abstract Today many application engineers struggle to not only publish their relational, object or ascii file data on the Web but to also integrate information from diverse sources, often inventing and reinventing a suite of hard-wired integration tools. A key problem with such solutions is the subsequent maintenance of these integrated sources, i.e., a change in the source must be made visible in the target. Today there are numerous new approaches to translate XML to relational, and relational to XML format. However, there is no generic mechanism that allows us to efficiently deal with, say, the addition of a new attribute into a DTD, and its subsequent propagation to its relational equivalent, irrespective of the translation between XML and relational schema. We present integration framework that solves this problem. We first show how we can map XML to relational systems in a model independent manner via the modeling of these translations. We term such a modeled transformation a mapping. We then show how a DTD change can be propagated to the relational schema independent of the mapping. For this we define a generic set of operators that in-place modify a given map in to produce a correctly updated output. We show that these incremental operators preserve the correctness of the mapping and are equivalent to a complete re-mapping between the source and the target. Keywords: Metamodel, Model Management, Integration, Schema Transformation, XML Mapping, Change Propagation, XML Evolution, Maintenance

2 1 Introduction Networked environments like the Internet have catalyzed a phenomenal growth in the publication of data resulting in XML as the new popular data exchange format, and numerous new research problems that must now be addressed. On the forefront is the problem of managing the XML data both in terms of effectively storing [ZLMR01, FK99, SHT 99, CKS 00] and querying XML data [FK99, DFF 99, RLS99]. In an attempt to re-use well-established technologies, researchers and commercial vendors are focusing their efforts on utilizing relational and object database engines to store and manage XML data [Cor00, IBM00, CFLM00]. As an example, Figures 1 and 2 show a DTD and conforming XML documents respectively, while Figure 3 shows a possible way of storing these in relational tables.!element EMPTY!Element EMPTY!ATTLIST phonenumber #CDATA #REQUIRED name #CDATA #REQUIRED!ATTLIST dslnumber #CDATA #REQUIRED lastname #CDATA #REQUIRED Figure 1: A Sample Business DTD. phonenumber name Joe Smith / / phonenumber name Mia Weber / / dslnumber client Acer Direct / / dslnumber client Software Inc. / / Figure 2: A Business XML Document Conforming to the DTD in Figure 1. PHONECLIENTS phonenum name Joe Smith Mia Weber INTERNETCLIENTS dslnumber clientname Acer Direct Software Inc. Figure 3: Relational Schema BUSINESS-DB for Storing the Business DTD and the Business XML documents given in Figures 1 and 2 respectively. Problem Description. Unfortunately, with any persistent information there is the issue of change due to errors, change in the real world entity that is represented, or as part of incremental design. XML, due to the volatility of the Web and because of it s heavy use on the Web, is perhaps more susceptible to change than other more traditional forms of persistent information. However, while many researchers have been and are continuing to focus on developing strategies for storing XML in relational, extended relational or object databases [Cor00, IBM00, CFLM00, FK99], there has been little focus on the maintenance of these stores when there is an XML change. A key to attain this consistency between XML and its stores is the propagation of any change from the XML to the store, such that any XML change is represented in the store s schema and data. In this paper, we address this problem of maintaining consistency between XML and its storage format and data when there are structural (DTD) changes applied to the XML. While our approach is applicable for 1

3 !Element (billingdate)!element EMPTY!Element billingdate (#PCDATA)!ATTLIST phonenumber #CDATA #REQUIRED name #CDATA #REQUIRED!ATTLIST dslnumber #CDATA #REQUIRED lastname #CDATA #REQUIRED Figure 4: Updated Business DTD. billingdate Jun 15 /billingdate phonenumber name Joe Smith / / billingdate Jun 15 /billingdate phonenumber name Mia Weber / / dslnumber client Acer Direct / / dslnumber client Software Inc. / / Figure 5: Updated XML Documents Conforming to the DTD in Figure 4. PHONECLIENTS phonenum name billingdate-r Joe Smith Jun Mia Weber Jun 15 INTERNETCLIENTS dslnumber clientname Acer Direct Software Inc. Figure 6: Updated Relational Schema BUSINESS- DB for Storing the Updated Business DTD given in Figure 4. PHONECLIENTS phonenum name REF(BILLINGDATE) Joe Smith REF(Jun 15) Mia Weber REF(Jun 15) INTERNETCLIENTS BILLINGDATE dslnumber clientname billingdate Acer Direct Software Inc. Jun 15 Jun 15 Figure 7: Updated Relational Schema BUSINESS- DB for Storing the Updated Business DTD given in Figure 4. relational, object and semi-structured systems, in this paper we present our approach using relational systems. Consider for example the DTD and XML documents in Figures 1 and 2 stored in a relational database as depicted in Figure 3. Now consider that we insert a new sub-element billingdate into the DTD element phoneclient to produce a new DTD. The subsequent DTD and the updated XML documents are depicted in Figures 4 and 5 respectively. Consider that the mapping scheme for this example maps a simple subelement (with PCDATA) in an XML document to a relational attribute in a relational table. For this change and this mapping, a corresponding change to the relational schema is the addition of a new attribute billingdate-r of type VARCHAR (PCDATA here is mapped to VARCHAR) to the table PHONECLIENT (Figure 6). As a next step, the relational database must be suitably updated as per the changes in the XML documents as shown in Figure 6. Now consider that we had mapped the sub-element not to an attribute in a relational table, but instead to a relational table itself as illustrated in Figure 7. A change such as the one mentioned above would now cause not the addition of a new attribute but rather the creation of a new relational table and perhaps the building of a foreign-key relationship or the use of reference attribute as shown in Figure 7. This change in the mapping of the sub-element construct requires the addition of new operations to handle the propagation of change from XML to relational schemas. One approach to solving this problem is to provide hard-coded 2

4 translations of every change on XML to an equivalent change on the relational storage. A major drawback of this hard-coded approach is its lack of extensibility whereby (1) a new mapping between XML and relational models would now require creation of code that translates the set of DTD changes to a new relational schema per the new mapping; (2) any new definition of change on the XML would require new software to propagate it to the relational schema; (3) a mapping from XML to now say extended relational systems, object or semistructured systems would require additional change management software for the targets; and (4) a mapping from relational or object systems would also need new software to now propagate change from the relational or object systems to XML. In this paper we present a maintenance approach that achieves this extensibility. Our Approach. Given that the underlying processes for maintenance in all of the above cases are very similar, we propose a model-driven approach for propagating a change on a DTD to an equivalent change on the relational database, independent of the data model and more importantly independent of the particular mapping of the DTD to the storage structure. For this, we define a meta-model based integration framework called!" #. For this integration, we use the graph-model proposed by Atzeni et al. [AT96] to express different data models, including but not limited to XML and relational models and application schemas. More importantly, we extend the graph model to now also express the actual mapping semantics between XML and other target data models. Figure 8 presents the three-layered architecture of the!" # system. Like other meta-modeling systems for data models [Boo94, AT96], the top layer models the data model, the middle layer the application schemas and the bottom layer represents the application data. Similarly, in!" #, for maps the top layer models the map models, the middle layer the mapping between two application schemas, and the bottom layer maps the source data to the target data using the mapping defined in the middle layer. $!" # offers semi-automatic yet declarative support for handling transformation strategies from one data model to another. These strategies are declared initially as map models in the model level by a system administrator. A key advantage of $!% # lies in the fact that mapping between application schemas and the subsequent generation of the transformation code to translate source to target data can be almost completely automated, thereby increasing user productivity and making it easy to re-target for new data models and maps. Figure 8: Three Layers of!" # With integration between data models and application schemas modeled in!" #, we can now translate the specific local source change to changes on the graph model and hence propagate it to the target. For example, we first translate the specific DTD changes into generic modification operations on the graph model 3

5 and then propagate them through the graph map model (the mapping) to produce the updated target. These operators are thus defined only for the middle (application) layer of!" #, i.e., on application schemas and on mappings between them. This generic structural modification in the middle layer in turn generates the code for the transformation of the input to the output data. However, a complete re-mapping and hence a complete re-loading of target data is inefficient. In this paper, we hence develop these operators as incremental modification operations and show that they are equivalence-preserving, i.e., they produce the same output as the re-mapping. A key advantage of our approach is that as data model specific changes such as the DTD changes are translated to these generic modification operators, our approach does in fact provide maintenance extensibility with very little user effort. Our approach can thus be used to maintain targets in the event of source change irrespective of (1) the data model of the source or the target, and (2) irrespective of the mapping between the source and the target. In summary, in this paper we present a unique approach for the maintenance of targets when there is a source change. In particular, we apply our approach to maintain target relational databases when there is a DTD change independent of the mapping of XML to relational systems. We are currently in the process of building a prototype of!" #. The system is being developed using Java 1.2 and Oracle 8i for persistently storing the modeling information. The $!% # system has been accepted as a demonstration at SIGMOD 2001 [CRZS01]. Overview. Section 2 presents the graph models used to model the data models, the application schemas, the mapping between the data models and the mapping between the application schemas. Section 3 details presents the overall approach for change propagation, while Sections 4, 5 and 6 present details on the individual steps layed out in Section 3. We present related work in Section 7 and conclude in Section 8. 2 Modeling in &('%)+*,'%- In this section we briefly review the data and map modeling in!" # details can be found in [CR01]. as needed for this paper, while full 2.1 Modeling Data Models and Application Schemas We build our modeling of data models on the graph model presented by Atzeni et al. [AT96]. To express data models in the model layer and application schemas in the application layer (Figure 8), we utilize a graphtheoretic formalism. Our metamodel is based on a fixed set of metaconstructs that are either nodes., namely complex (/ ) and atomic (0 ) node types; or edges 1, namely containment (2 ) and property (343 5 ) edges. A containment edge exists between two complex nodes while a property edge stems from a complex node and is incident on an atomic node. Patterns and Structures. We use two main notions to describe data models as well as application schemas. A structure is a directed acyclic graph whose nodes and edges are metaconstructs of our metamodel and whose edges have order as labels. A monotonically increasing order specified as an integer i gives the local ordering for all outgoing edges from a given node n. Trees within the structure S composed of all the nodes and the 4

6 edges reachable from the outer nodes 1 of S are referred to as components of S. Definition 1 (Structure) A structure is a quadruple S 6 (G, 7, 8, 9 ) where G 6 (N,E) is a directed acyclic graph; 7 and 8 are typing functions: 7 : N 2:. and 8 : E 2:1 ; and 9 is an ordering function: 9 : E 2, where is the set of natural numbers. A pattern is a rooted tree whose nodes and edges are metaconstructs of the metamodel and whose edges have quantifiers and order as labels. A quantifier given as a pair of integers [x:y], with 0 ; x ; y <>=, specifies the minimum and maximum times an edge with the same label can appear in a structure typed by this pattern. A pattern essentially describes a collection of structures that represent the same composition of the metaconstructs. The monotonically increasing order i associated with an edge in a pattern specifies the order of the edges in the pattern that must also be preserved in a structure. Definition 2 (Pattern) A pattern is a triple P 6 ) where S 6 (G, 7, 8, 9 ) is a structure such that G is a rooted tree,? is a function that associates a quantifier with each edge of G is a function that associates a monotonically increasing order with each edge of G. Two structures SA and SB are isomorphic if for each node n in SA, there is a node n in SB such that all edges e outgoing from n have a matching e stemming from n in the structure SB, and the order of e is equal to order of e. Similarly, we define isomorphism for patterns [CR01]. A structure S matches a pattern P if it can be described by one of the structures that are in the pattern P [CR01]. A structure S is an instanceof a set of patterns C, if for each component SD of the structure S there is a pattern PDFEC such that SD matches PD. Figure 9 represents a set of patterns C and Figure 10 shows a structure S that is an instanceof patterns P1 and P3 in Figure 9. We use labels A, B GHGHG H in Figure 9. In Figure 10, the lower case labels indicate an instanceof relationship with the node carrying the uppercase label in Figure 9. Hence, b1, b2 and b3 are all instanceof the node B. The structure S1 in Figure 10 is an instanceof patterns P1 and P3 in Figure 9. The order between c1 and d1 in Figure 10 must be the same as the order between C and D in Figure 9. Data Model and Application Schema. A data model can be represented in $!" # by a set of patterns C such that each element of C is assigned a label that represents a construct of the data model. For example, Figure 11(a) shows a segment of the DTD modeled in $!% #. Here the labels DTD, Element, E- Attribute, and SubElement are assigned to the first, second, third and fourth / node respectively in the the pattern P1; and the labels SubElement and Element to the first and second / node respectively in pattern P3 given in Figure 9. Similarly, labels and order are assigned to all the edges. Application Schema. Similar to the data model, an application schema can be represented in $!" # via a structure such that each element of the structure is assigned a label that corresponds to the real-world entity modeled by the application schema. An application schema is said to conform to a data model, if the structure S of the application schema is an instance of C, the set of patterns that describe the data model. Figure 11 (b) shows a slice of the application DTD Business that is obtained by assigning labels to the structure in Figure 10. This is the application layer representation of the sample DTD in Figure 1. 1 We use the term outer nodes to refer to nodes that have no edges incident on them. 5

7 Figure 9: A Set of Patterns. Figure 10: A Structure SA - an Instance of Patterns P1 and PI given in Figure 9. Figure 11: An Example Showing the DTD Model described in Data Model Layer and a DTD in the Application Layer. 2.2 Modeling Maps in JLKNM"OPKNQ New from other meta-data integration approaches [MR83, AT96, BR00, PR95], we now model the mapping between two data models or within one data model. We assume that there is a global universe of discourse of all patterns in the system denoted by C and a set of all structures in the system denoted by Inst(C ). This includes all sets of patterns that describe a particular data model and all structures that describe any application schema Metaconstructs for Modeling Maps Our map metamodel is based on a set of seven map node types R and a set of four map edge types S [CR01]. With these map nodes and map edges we are able to define mappings across the data model boundaries as well as re-structuring mappings within the same data model. Simple mappings between data models, i.e., translation of a construct in one data model to a construct in another data model are modeled by cross (T ) and connect (02 ) nodes, linked together via containment edges. These form a logical unit for performing cross data model mappings. Hence, in our discussion here we only focus on the cross and connect map nodes, the containment, and input and output edges. The cross (T ) maps one node to another node, e.g., an element to a relation as shown in Figure 12 (a). The connect (0 2 ) node maps one edge to another edge, e.g., the hasa edge into a has edge in a relational schema as depicted in Figure 12 (b) Map Model and Application Maps We now give a brief overview of the graph-based map metamodel we have designed to express map models and the application maps. Throughout we use mn to refer to a map node and me to refer to a map edge. 6

8 Figure 12: The Cross and Connect Map Nodes. Map Structure. A la the structures and patterns given in Section 2.1, we introduce the notion of map patterns and map structures for map modeling. A map structure is a directed acyclic graph whose nodes and edges are elements of our set of map metaconstructs R S and maps an input structure to an output structure. In our context here, each map node can be connected to another map node in a map structure via a containment edge. A containment edge between two map nodes mna and mnb implies that the map node mnb is contained in map node mna, i.e., the mapping of mna can be completed only if mnb is mapped already. In addition each map node must have an input and an output edge which connect it to the input and output structure respectively. Moreover, the input and the output edges for all map nodes in the same map structure must be incident on some node or edge in the same input or output structure respectively. Figure 13 (b) depicts a map structure. Definition 3 (Map Structure) A structure is a five-tuple MS 6 (MG, U, V, SA, SB ) where MG 6 (MN, ME) is a directed acyclic graph, and U and V are typing functions: U : MN 2WR and V : ME 2WS and SA and SB are the input and output structures as defined below. Properties of Map Structures. Consider two map structures MSAX6 (MGA, UYA, VZA, SA, SB ) and MSB6 (MGB, U[B, VNB, SI, S\ ), then MSA can be mapped to MSB via a pair of functions ] : MNA^2 MNB and _ : MEA^2 MEB. Two map structures are equivalent if for each mn E MNA, UYA (mn) 6`U[B (] (mn)) and for each edge me:(mn, mn ) E MEA, VaA (me) 6bVcB (_ (me)) and _ (me) 6 (] (mn),] (mn )). Two map structures MSA and MSB are isomorphic if MSA and MSB are equivalent and if their two input structures SA and SI respectively are isomorphic. Map Pattern. Like a pattern (Section 2.1), a map pattern describes a collection of structures that represents a specific composition of the map metaconstructs. Input and output of the map pattern are patterns as described in Section 2.1 implying that input and output edges from map nodes in the map pattern must be incident on nodes in the patterns. Figure 13 (a) depicts a map pattern. The instanceof relationship between a map structure and a map pattern is defined similar to the instanceof relationship between a structure and a pattern [CR01]. Definition 4 (Map Pattern) A map pattern is a four-tuple MP 6 (MS, d, PA, PB ) where MS 6 (MG, U, V, SA, SB ) is a map structure such that MG is a rooted tree, and d is a function that associates a quantifier with each edge of MG; and PA and PB are input and output patterns, respectively. Map Model. Intuitively, a map model MM maps (a subset of) a data model DMA to (a subset of) another data model DMB. A map model MM is composed of a set of map patterns ReC where each element of the map pattern is assigned a label. In addition constraints can be defined on each map edge in the set of map patterns. 7

9 Figure 13: (a) Example Map Pattern and (b) Map Structure. Constraints are conditions applied on the input and output edges of the map nodes. Figure 14 depicts a map model that assigns labels to the map pattern represented in Figure 13 (a) modeling one possible mapping of a DTD to the relational model. Here the element (E) in a DTD model is mapped via a cross node (T ) to a relation (R) in the relational model; and the attribute (EA) and the sub-element (SE) in the DTD model are each mapped via a cross node (T ) to an attribute (A) in the relational model. Connect nodes are used to translate relationships in the DTD model to relationships in the relational model. Application Map. An application map defines a mapping between two application schemas. It is a map structure where each of its map nodes and map edges is assigned a label. Moreover, constraints may be specified on each map edge. Figure 15 shows a segment of a mapping between the application DTD with an Element- and the relational schema with Relation- PHONECLIENTS. Figure 14: Example Map Model to Map DTD to Relational Model. Figure 15: Example Application Map to Transform Application DTD to Relational Application Schema. 3 The &('%)+*,'%- Approach to Change Propagation In this section, we present a concrete example to illustrate the steps we take within!" # change from a DTD to the target relational schema and data. to propagate a Modeling the DTD and Relational Schema in!" #.!" # is an integration framework in which the source, the target as well as the mapping between them is modeled in the application layer. To enable this, the source and the target data model, as well as the map model must be defined apriori, by a system administrator, 8

10 in the model layers of!" #. In the application layer, a user is then able to set up their source, target and the mapping between them by a semi-automated process. For this,!" # provides tools and graphical user interfaces (GUIs) to (1) import the source schema; (2) select and instantiate a map model to create a mapping from a given source to a target; and (3) to generate the target application schema. Figure 11 (b) represents the modeling of the Business DTD in the application layer of!" #. Figure 15 represents a segment of the application map as modeled in the application map layer to transform the Business DTD representation in!" # to its relational equivalent. The relational schema represented in Figure 15 can now be utilized to generate a relational application schema in the data layer. As each map node contains fragments of code to perform the actual data transformation, XML documents can be transformed to relational data by a combination of these generated code fragments. The resultant set of relational tables and the corresponding relational data is as shown in Figure 3. Propagating Change in!" #. Consider now that the Business DTD in the data layer (Figure 1) is modified by the addition of a new attribute to produce updated DTD and XML documents as given in Figures 16 and 21 respectively. In order to perform a structural change on a DTD, we use the evolution operator taxonomy from XEM - XML Evolution Manager [SKC 01] which we have developed as part of our previous work. For a complete set of evolution operations in XEM the reader is referred to [SKC 01]. To maintain consistency, this change to the DTD must be propagated to all targets that are derived from it as depicted in Figure 17. To achieve this consistency using!" # we: 1. translate each DTD change into a sequence of one or more change operations on the graph model that can be applied to the input structure, i.e., DTD modeled in the application layer. This sequence of graph operators are encapsulated into one atomic transaction. These graph operators are generic and hence applicable in the application layer for any application schema model in $!" # independent of the data model. (See Section 4); 2. this change on the input structure triggers a sequence of change operations on the map structure, i.e., the application map that maps the DTD to the relational schema. Again these operators are generic and hence applicable to any map structure modeled in $!% #. These operators update both the application map as well as the output structure. (See Section 5); 3. a change on the output structure, the effect of step 2, must now be transformed into a local change on the application schema in the data layer, thereby updating both the output application schema as well as the output data. The change operations on both the input structure and the map structure drive the code generation that now queries the input data, transforms it as per the updated mapping in the application layer to produce the updated target data. Once again this code generation process is generic and data model independent driven by the models captured in $!" #. (See Section 6). At the end of this sequence of steps, all application maps and the corresponding targets are updated to reflect the change in the input DTD. The modification of the maps and the generation of the modified outputs are handled via automated steps and are triggered as a result of Step 1. However, $!" # can also operate in a non-automatic mode, whereby the modify operators as defined in Section 5 can be manually invoked by the user for each map structure whose input structure has been modified. 9

11 !Element (billingdate)!element EMPTY!ATTLIST phonenumber #CDATA #REQUIRED name #CDATA #REQUIRED billingdate #CDATA #REQUIRED!ATTLIST dslnumber #CDATA #REQUIRED lastname #CDATA #REQUIRED Figure 16: Updated Business DTD. Figure 17: Propagating Change from the DTD to Relational Schema and Data. 4 Modifying a Structure The goal of this section is to show how any change on the DTD can be translated into a change on the input structure in the application layer. For this we first define a set of primitive change operations on the input structure. We then show via an example how a DTD change can be translated into a sequence of these change operators. We define three basic change primitives - InsertNode, DeleteNode and AlterLabel. A key criteria for these primitives is to guarantee that after their application, the updated input structure is correct. We define correctness as follows. Definition 5 (Correctness of Structure Change Operators) Given an application schema AS in the application layer that is an instanceof some data model DM, a change operation op applied on AS is said to be correct if its output application schema AS is an instanceof the same data model DM. 4.1 Structure Modification Primitives InsertNode. Informally, the first modification primitive InsertNode denoted as InsertNode(Node:, Node: r, Label: l, Edge Type: t, App. Schema: AS) 6 (S, f )), inserts inplace a new node into the structure S in the application schema AS 6 (S, f ) to produce an updated structure S and hence an updated application schema AS. However, this node can only be inserted into the input structure if it does not violate any constraints, such as the order or quantifier, of its model DM (AS is instanceof DM). Currently, the InsertNode operation appends the node, i.e., if e represents the edge from r to, then 9 (e ) 6 max(9 (ea )) g 1. This operator allows the modeling of DTD changes such as the addition of an attribute or a sub-element to a DTD element. Definition 6 gives the formal definition of the InsertNode operator. Definition 6 (InsertNode) Given AS 6 (S, f ) that conforms to some data model DM 6 (C, h ), such that S 6 (N, E) conforms to some pattern P 6 ), P EjC. Let r be a node in S. InsertNode:AS kb< r, l 2 AS, where < r, l Ej.mkn., is defined as the insertion of into N such that e <poyqsrut, l E E. Let v,w x denote the set of edges stemming from node r E S with label t (edge type), such that y v,w x yyej? i (ei ). Let v ~} w{z x denote the set of edges stemming from node o (r) E S such that e E v ~} w{z x. The default behavior for InsertNode appends the edge e to the (ordered) set v,w ~} z x, i.e., 9 (e ) 6 max(9 (ea )) g 1, ea Eƒy v w y. For all edges e E y v ĤŠ y, y v, ˆ y, ŒHŒHŒ, y v, ˆŽ y, (e ) F inqs " Ht, a ~ Žt, then 9,q t 6 9,q ct g 1. 10

12 Moreover, it inserts the new label l into the set of label L, i.e., L 6 L l, where L E AS such that f ( ) 6 l. Other than the addition of the new edge, no other modification is made to the AS. DeleteNode. The DeleteNode operation, denoted by DeleteNode(Node:, Node: r, App. Schema: AS 6 (S, f )), removes an existing node from an application schema AS, such that it removes the edge from the node r to the node, and produces as output an updated application schema AS. While many variations of a DeleteNode operation are possible, like other work in this area [Tec94, BW95], we present a simple delete that removes a node in a structure S only if has no outgoing edges. If occurs in the input structure such that if e is the edge from r to, then 9 (e) 6 max(9 (ea )) for all eaè v ˆ, the set of edges stemming from r, then the order of all nodes and hence all the edges with order 9 (ea ) l 9 (e) must be adjusted. Intuitively, if a sub-element is removed from a DTD element, then this operation will remove the sub-element, the link between the element and sub-element, and will fix the order of all sub-elements that occur after the removed sub-element. Hence, if a sub-element whose order is 3 is removed from a DTD element that has 4 sub-elements, then the order of the fourth sub-element must be changed to 3. However, the DeleteNode operation is disallowed if its deletion results in the removal of a required edge. Thus, the DeleteNode operation is disallowed if removal of edge e:< r, l, the set of edges stemming from E v ˆ r, violates the quantifier constraint, y v ˆ ~} x y? i (e). That is if the set of edges stemming from oyqsr"t does not meet the minimum quantifier given in the pattern. Definition 7 gives the formal definition of DeleteNode operation. Definition 7 (DeleteNode) Given AS 6 (S, f ) conforms to some data model DM 6 (C, h ) such that S 6 (N, E) conforms to some pattern P 6 (Si,? i ), P E C. The DeleteNode denoted by DeleteNode:AS 2 AS k < r, l, where < r, l E.šk. is defined as: 1. if n E N, then o (n) E N provided n if e:< na, nb l E E, then e :<œo (na ), o (nb ) l E E, provided nb 6ž. 3. if f ( ) 6 l, such that l E L, then L 6 L - l. 4. if edge e:< r, l EŸv ˆ, the set of edges stemming from r, and if there are edges ea, eb, ŒHŒHŒ, eeÿv ˆ (i.e., stemming from r) such that 9 (e) < 9 (ea ), 9 (eb ), ŒHŒHŒ, 9 (e ), then 9 (e ) 6 9 (e) 1, for e 6 ea, eb, ŒHŒHŒ, e. AlterLabel. The third modification primitive is the AlterLabel operation. Rename operations that are applied on the DTD such as rename of an element or of an attribute can be translated to equivalent operations on the input structure in!" # via the AlterLabel operation. For more details on this operation we refer the reader to [CR01]. 4.2 Evolving a DTD in the Application Layer - An Example Assume that the operation adddtdatt (billingdate, phoneclient, CDATA, true, Jun 15) is applied to the Business DTD given in Figure 1 producing an updated DTD as shown in Figure 16. In 11

13 !" #, each DTD enitity is modeled by a node and its relationships to other nodes are modeled by edges between the nodes. To successfully translate the DTD adddtdatt operation to a generic operation on the input structure, we must first model each parameter of the adddtdatt by a node. Hence, an attribute in!" # is modeled by the pattern shown in Figure 18. Here, each parameter of the adddtdatt is represented by a node: billingdate by node n1, phoneclient by node r, CDATA by node n2, true by node n3, and Jun 15 by node n4. To correctly apply the adddtdatt to the input structure, we thus must apply four InsertNode operations. For example, InsertNode(n1, r, billingdate, hasa, AS) inserts the node n1 with label billingdate into the application schema AS such that there exists an edge between nodes billingdate and phoneclient with label hasa to denote that billingdate is an attribute. Similarly, the other nodes can be inserted to finally result in the updated application schema as shown in Figure 19. Each insertion of a node is checked against the pattern in Figure 18 to ensure its correctness. Figure 18: A Pattern for Modeling an Attribute of a DTD. Figure 19: The Updated Application Schema as Modeled in!" # in the Application Layer. Similarly, other DTD update operations from XEM [SKC 01] can be translated into generic graph modification operations on the structure in the application layer. 5 Modifying Maps When a change is made via the operations defined in Section 4 on any structure, many maps for which the structure is an input structure, may become invalid. To maintain consistency, the map and the final output of the map must be modified to reflect the change. In this section we first describe the generic map change operators - insert and delete operations for a map structure. We then show how a map structure in general can be modified to reflect the addition or deletion of nodes in the input structure via these map change operators. 5.1 Primitive Operations on Map Structures We define two primitive map change operators: InsertMapNode - to insert a new map node; and DeleteMapNode - to delete a map node. These operators must guarantee that the result of their application produces a correct map structure and hence a correct output structure. We define correctness for map change operators as: Definition 8 (Correctness of Map Structure Operators) Let AM be an application map AM, AM instanceof some map model MM, such that it produces the output application schema AS, AS instanceof some data model 12

14 DM. A map change operator mop is said to be correct if mop(am) produces (1) an updated map AM such that AM instanceof MM; and (2) an updated output application schema AS such that AS instanceof DM. InsertMapNode. This operation is invoked to insert a new mapping between an input node and an output node in an existing map structure. For example, if a new attribute billingdate is added to the input structure, then this operation invoked with the right parameters will insert into the map structure a new mapping that now maps the attribute billingdate in the DTD to the attribute billingdate- R in the relational schema. Hence, the InsertMapNode operation, denoted by InsertMapNode(Map Node: #, Map Node: mr, Edge Type: me, App. Map: AM 6 (AS,, )), inserts a given map node mn into the specified map structure MS. The InsertMapNode operation is allowed on an application map AM only if the addition of the map node # to the map node mr preserves the quantifier specified in the map pattern. Definition 9 gives the formal definition for InsertMapNode operation. Definition 9 (InsertMapNode) Let AM 6 (MS,, ) be an application map, AM instanceof map model MM 6 (R C,, ). Let mr be a map node in AM. InsertMapNode:AM kž< mr, # l 2 AM operation in-place modifies AM and inserts # into MN such that me < ] qsª$r"t,# l E ME, and if «v Žx ˆ denotes the set of edges stemming from node mr E MS, then y «v Žx ˆ y~ē d i (mei ). DeleteMapNode. The DeleteMapNode operation removes an existing mapping between two nodes. For example, consider that a source attribute billingdate is mapped to a target attribute billingdate-r, and the source attribute billingdate is deleted. The mapping from the source attribute to the target attribute must now also be removed. This is made possible by the DeleteMapNode primitive. The DeleteMapNode(Map Node: #, Map Node: mr, App. Map: AM 6 (AS,, )) operation removes an existing map node # in an application map AM such that it removes the edge from mr to the map node #. The DeleteMapNode operation operates on a map node # in a map structure MS only if # has no outgoing edges. Definition 10 gives the formal definition of the DeleteMapNode operation. Definition 10 (DeleteMapNode) Let AS 6 (MS,, ) conform to some map model MM 6 (ReC,, ). Let MS 6 (MN, ME) conform to some map pattern MP 6 (MS, d, PA, PB ). Let mr be a map node in MS. The DeleteMapNode removes a node # E MS in application map AM and produces as output a new application map AM with map structure MS 6 (MN, ME ), i.e., DeleteMapNode:AM 2 < mr, # l E R.±k Re. such that: 1. if mn E MN, then ] (mn) E MN provided mn 6 #. AM kž< mr, # l, where 2. if me:< mna, mnb l E ME, then me :<²] (mna ), ] (mnb ) l E ME, provided mna 6 mr and mnb 6 #. 5.2 Modifying a Map Structure: The MS-Imodify Operator In this section, we show how a change made to the input structure via an InsertNode operation can be translated to a change on the map structure using the basic map change primitives defined in Section 5.1. To translate a change on the input structure to a change on the map structure, we must first identify the right mapping, i.e., map node(s) that must be either inserted or removed from the map structure. The InsertMapNode or DeleteMapNode operations can then be invoked to perform the change on the map structure. For 13

15 example, if a deletenode operation is applied on the input structure to delete the attribute billing- Date, we must first determine the map node that maps the attribute billingdate to the relational attribute billingdate-r. The DeleteMapNode can then be invoked to remove the map node from the map structure. We now define two operators MS-Imodify and Dmodify on a map structure which in turn invoke the InsertMapNode and DeleteMapNode operations. For space reasons, we only present the MS-Imodify operation here. Definition 11 (MS-Imodify) Let AM be an application map such that AM is instanceof MM, a map model. Let the MS-Imodify be applied on AM, MS-Imodify:AM k < %³r l 2 MS such that: 1. for µ E SA and E SA, create T instanceof P, such that T is an instanceof P; such that 6po ( ) only if there exists a pattern P and MS is an 2. for e:< r, l Eµ SA and e:< r, l E SA, create 20 such that e 6² (e); 3. Let T denote the map node that maps r such that for r E SA, then o (r) E SB. Invoke InsertMapNode:MS k < T, T l F2 MS ; and InsertMapNode:MS k < T, 20 l P2 MS. An Example. Consider the example given in Section 3. Assume here that the operation InsertNode(n1, r, billingdate, hasa, AS) is applied to the input application schema Business. We now must apply MS-Imodify on the map structure MS given in Figure 15. This results in two new map nodes T and 02 that are inserted into the map structure via the operation InsertMapNode. The T node is inserted into the map structure such that there is now a containment edge between the map node T -p that maps the element phoneclient to the relation PHONECLIENT. The T node then maps the newly added attribute billing- Date to the relational attribute billingdate-r. The fact that the new relational attribute billingdate- R is an attribute of the relation PHONECLIENT is given by the map node 0 2. This maps the hasa edge between the element phoneclient and the attribute billingdate, to the edge has between the relation PHONECLIENT and the attribute billingdate-r. The 20 map node is also inserted into the map structure such that there is a containment edge between the map node T -p and the node Correctness and Equivalence of MS-Imodify Correctness of MS-Imodify. Correctness of a map structure needs to be guaranteed after the application of the MS-Imodify operation. A MS-Imodify operation, MS-Imodify( MS,, r), applied on a map structure MS is said to be correct, if it produces a modified map structure MS such that if MS is instanceof Ç A, then MS is also an instanceof Ç A for C A some set of patterns. Moreover, if the output S of MS is instanceof some set of patterns C, then the modified output S produced by the modified map structure MS must also be an instanceof C. Equivalence of MS-Imodify. Incremental modification implies that the MS-Imodify operators only minimally modify the target model and the mapping between the source and the target as per the update made to the source. However, this is not the only strategy for achieving the update. A more straightforward but a much more time and resource intensive mechanism is a complete re-mapping. 14

16 phoneclient.hasa¹ º» -1¹ PHONECLIENT.has billingdate¹½¼ -1¹ billingdate-r billingdate.required¹ º» -2¹ billingdate-r.not NULL true¹ ¼ -2¹ NOTNULL billingdate.type¹ º» -3¹ billingdate-r.domain CDATA¹ ¼ -3¹ VARCHAR(40) billingdate.default¹ º» -4¹ billingdate-r.default Jun 15¹¾¼ -4¹ Jun 15 Figure 20: The Map Update as Performed by the InsertMapNode Operation. In $!" #, each application map must conform to a map model defined in the map model layer, i.e., each map structure (application map) is an instanceof some map pattern (map model). Let us assume that there is a map structure MSA that maps the input structure SA to the output structure SB (SA MS SB ), such that the map structure MSA conforms to the map pattern MP with input pattern PA and output pattern PB 2. Now consider that a change is made to the input structure SA via one of the operations defined in Section 4 to produce an updated input structure SA. We define complete re-mapping as the instantiation of a new map structure MSB which now maps SA to some new output structure SB such that MSB is an instanceof the map pattern MP. Such a complete re-mapping here involves not only the instantiation of the new map structure MSB and the creation of the new output structure SB, but also the re-loading of the input schema and data to the output schema and data as per the new mapping MSB at the application and data layers. Using this definition we now define equivalence of the incremental modification operator MS-Imodify and Dmodify with this re-mapping. Definition 12 (Equivalence of MS-Imodify) The incremental modification operation MS-Imodify is equivalent to complete re-mapping MSB, if after the application of the modification operation, the updated map structure MSA and the new map structure MSB are isomorphic, i.e., the updated map structure MSA is equivalent to the map structure MSB, and the updated output structure SB and the new output structure SB are isomorphic as defined in Section 2. 6 Data Layer Transformations By our!" # change propagation approach, the final step in completing the propagation of the source change to the target is the update of the output data as per the modification applied on the input data. For this update of data, there are two considerations. One is the data update caused by the local change operation on the source schema and data in the data layer. The other is the subsequent update propagation of any change made to the input data as a consequence of the local change operation. Here we discuss both kinds of updates. 6.1 Schema Changes in the Data Layer. Each InsertMapNode operation and the consequent change on the output structure drive the change of the output schema and data in the data layer. Figure 20 represents the map updates performed by the InsertMapNode operation as a result of the modification on the input structure. 2 Here S Š must be an instanceof P Š and S and instanceof P (Section 2) 15

17 phonenumber name Joe Smith, billingdate Jun 1 / / phonenumber name Mia Weber, billingdate Jun 20 / / dslnumber client Acer Direct / / dslnumber client Software Inc. / / Figure 21: Updated XML Documents. This map update is used to generate a statement that now alters the application schema in the data layer. Each map node generates a segment of the statement. The entire statement is then built by concatenating the statements in a top down fashion. Siblings are visited using the order specified in the map structure. Hence, for our example we start at 02-1 which will produce the statement segment: ALTER TABLE PHONECLIENT ADD COLUMN (. The map node T -1 will produce the segment: billingdate-r. Traversing all the containment edges in the map structure from the map node T -1 will result in the final evolution statement: ALTER TABLE PHONECLIENT ADD COLUMN (billingdate-r VARCHAR(40) DEFAULT Jun 15 NOT NULL). This statement can now be executed on the BUSINESS application schema in the data layer. 6.2 Data Updates in the Data Layer However, often as a follow-up to the evolution of the XML documents, the XML data may be updated to reflect information other than the default values. We now show how we can update the output data assuming that such data updates may have occurred along with the DTD evolution. For example, consider that a change adddtdatt is made to DTD and all XML documents (Figures 1 and 2) are updated with the default value Jun 15. Now if a user updates all XML documents such that the billingdate is different for different clients as shown in Figure 21, then these data updates must be propagated to the relational schema and data. These data updates are performed incrementally, i.e., the relational data is updated by only the slice of data added as part of the change operation, i.e., the values for the attribute billingdate. For a delete operation this is trivial as no data from the input must be transformed to update the output data. Instead, the relational tuples are simply removed. To achieve this incremental update of the target relational database in the example mentioned above, we must do the following: À Query the XML documents to collect the updated XML data say with Quilt [CRF00]. Quilt queries are generated automatically by the system based on the InsertNode modification applied on the input structure in the application layer. Figure 22 shows the Quilt query generated as a result of the operation InsertNode(n1, r, billingdate, hasa, AS). The resultant XML documents 16

18 <business> FOR $client IN document ("business.xml")// RETURN <phoneclient phonenumber billingdate = > </phoneclient> </business> business phoneclient phonenumber billingdate Jun 1 / / phoneclient phonenumber billingdate Jun 20 / /phoneclient /business Figure 22: Sample Quilt Query to Get Updated Data. UPDATE PHONECLIENT SET billingdate-r = Jun 20 WHERE phonenumber = Figure 23: Result of the Quilt Query given in Figure 22. Figure 24: A Sample SQL Query Generated From the Map Structure and the Parsed XML Documents. are shown in Figure 23. Here we use the phonenumber as our unique attribute (key) to identify the phoneclient. À Transform the resultant XML documents returned by the Quilt query. This is a two-step process. First, we use the map structure in the application layer to determine which output application schema, relations and/or attributes are to be updated. We then use this information and resultant XML documents to generate (1) a text file representing the to be loaded data and (2) appropriate SQL bulk update queries to update the relation(s) using the bulk loading facility. Figure 24 depicts a sample SQL update query to update one record of the output relation PHONECLIENT in the data layer. 7 Related Work XML Storage. Currently there are numerous projects that deal with the persistent storage of XML documents [ZLMR01, FK99, SHT 99, CFLM00]. Shanmugasundaram et al. [SHT 99] presented the first storage of XML documents with DTD in a relational system and cataloged the problems and limitations of relational systems in doing so. In the same vein, Florescu et al. [FK99] have in their work presented eight mapping schemes for storing XML documents in relational databases with experimental data presented to help select an ideal mapping scheme. Zhang et al. [ZLMR01] present one fixed mapping for storing XML documents (with DTD) in relational systems. Catania et al. [CFLM00] both present the use of object databases to efficiently store XML documents with DTDs. With the exception of the Clock project [ZLMR01], none of these projects look at the maintenance of the relational or the object storage when the source XML/DTD is modified. Zhang et al. [ZLMR01] address this problem but only for their one fixed mapping. In our work, we now provide a framework that allows us to not only map XML documents to relational, extended relational or object databases, but to also propagate any DTD change to the same independent of the mapping. 17