Enhancing the Functionality of the Web

P. Kutschera, R. Rantzau

University of Stuttgart

Institute of Parallel and Distributed High-Performance Systems (IPVR)

Breitwiesenstr. 20-22

D-70565 Stuttgart

Germany

E-mail: fkutschera,rantzaug@hermes.informatik.uni-stuttgart.de

Abstract

This paper presents an approach to combine the World Wide Web and 
database systems in a way that all of the Web based resources are stored 
and managed by the database system. Access to the information located in 
the database system is based on an URL notation which will be described 
in detail. Addionally, the database system is used to store all of the 
necessary information that is used to overcome the most serious 
limitations of the World Wide Web like existence of the "lost in 
hyperspace" syndrome, invalid links as well as the limited search 
facilities.

1 Introduction

The World Wide Web (WWW) is probably the most popular information system 
today. Its popularity is mainly based on the appealing graphical user 
interface of the browser, the availability of the software on almost 
every hard- and software platform as well as the huge framework of 
resources that are accessible world wide. But, despite of its 
popularity, the WWW comprises some well-known limitations like the "lost 
in hyperspace" syndrome, the existence of invalid links and the limited 
search facilities. Another feature of the World Wide Web which is 
currently missing is the ability to access information located in a 
database system in a standard fashion. These shortcomings limit the use 
of the World Wide Web in many different applications like on-line 
information services or learning environments where most of the relevant 
information is stored in a database and the integrity of the online 
information is very important. However, in combining the features of the 
World Wide Web with the functionality of database systems, it is 
possible to overcome these limitations and "enhance the functionality of 
the Web".

The approach presented in this paper combines the World Wide Web and 
database systems in a way that all of the Web based resources 
(documents, images, audio or video clips, Java applets, etc.) as well as 
any additional meta data are stored and managed by the database system. 
The meta data comprise all of the information that is necessary to 
provide link integrity, content-based searches and navigational aid for 
online access. Access to information located in the database system is 
based on an appropriate URL notation which will be introduced in this 
paper and can be used independent of the underlying database 
environment. Basically, this paper extends and further refines the work 
that is presented in [7].

2 Related Work

Today, almost every database vendor sells its own proprietary product 
that is capable of accessing its own database system via WWW. Most of 
these solutions use the API of the Web servers to provide this 
additional functionality and rely on vendor-specific extensions to the 
HTML language to include database queries directly into HTML pages. 
Based on these so-called application pages which are either stored in 
the Web server's file system [9] or in the database system, it is 
possible to enhance HTML pages with database queries. While almost all 
of these products are only capable of including relational or 
object-oriented queries into HTML pages, only one of them [3] is capable 
of accessing audio or video content, images, HTML documents or Java 
applets residing in the database as well. Another possibility for 
accessing a database system is to use a CORBA based Web environment with 
an integrated ORB like as it is described in [1]. In summary, it may be 
said that the database vendors provide a proprietary access to their own 
database product, but do not try to solve the problem of invalid links 
or the limited search capabilities for Web based resources.

The idea of storing information about hyperlinks in a database system to 
provide link integrity has been used in many hypermedia systems (see [5] 
for example). In [11] this approach has been applied to the World Wide 
Web, but because all of the Web based resources are still located in the 
Web server's file system, only a limited form of link consistency can be 
guaranteed in comparison with the approach presented in this paper. The 
same kind of limitations like those of the World Wide Web have already 
been encountered in other hypermedia systems as well and led to the 
development of the so-called second generation hypermedia systems like 
Hyper-G [8]. Because of the different hypertext models and communication 
protocols of these hypermedia systems, these solutions cannot be applied 
to the World Wide Web.

3 Concepts

3.1 Link integrity

In the World Wide Web, a link denotes a forward reference from a HTML 
document (link source) to any type of multi- or hypermedia content (link 
target) and is embedded directly into its source document. Thus, the 
term link is here not only used to denote hyperlinks, but for all kinds 
of references within a HTML page where an URL is permitted. The problem 
of the dangling links results from the WWW's hypertext model that only 
uses forward references

for links without maintaining the a?liated backward references. Thus, if 
a link becomes invalid due to changes of the target resource, there is 
no possibility to adapt the source document of the link according to 
these changes. The only possibility to detect an invalid link is to scan 
the HTML pages in the local file system using tools like the Web crawler 
on a regular basis and analyze the existing link structure. By combining 
the World Wide Web and database systems in a way that all of the Web 
based resources including documents, images, audio or video clips as 
well as Java applets are stored and managed by the database system, link 
integrity can be guaranteed by extracting the relevant link information 
during insertion of a HTML document and storing them along with the 
missing backward references in the database. Afterwards, this 
information has to be kept up to date by monitoring all of the changes 
that are applied to the document structure. Prior to retrieving a HTML 
document from the database, the validity of the links within the 
document has to be checked and invalid links have to be removed to 
shield the user from accessing dangling links.

3.2 Additional search facilities

Although there are many different resources available on the World Wide 
Web, it is very di?cult to locate resources about a certain topic. 
Currently, either a directory service such as Yahoo or one of the 
existing search engines such as AltaVista, Web Crawler, etc. [12] can be 
used. While this approach is feasible for searching larger parts of the 
Web, these methods are rather ineffective when someone wants to find 
relevant material about a certain topic at a particular Web site. Thus, 
by using a database system it is possible to extract and store 
additional meta data like keywords, headings, table or figure captions 
of HTML pages in the database prior to storing the HTML pages itself. 
Now, the built-in query language of the database system cannot only be 
used for full-text searches, but also to search specific parts of HTML 
documents very e?ciently.

3.3 Navigational aid

The World Wide Web provides the user with a navigational interface along 
the existing hyperlinks. While in principle, it is a very elegant way to 
link related information together, it has been observed that users often 
feel disoriented after following a few hyperlinks. This phenomenon which 
is often called \lost in hyperspace" results from a missing navigational 
aid similar to a table of contents in a book or a guided tour through 
the Web site. Based on the information about the existing link 
structure, it is possible to display the entire hypertext structure by a 
frontend tool in form of a map to indicate the users' current location 
while navigating through the hyperspace.

3.4 Accessing database content

As it has been demonstrated so far, by combining the WWW and database 
systems it is possible to overcome the most serious limitations of the 
World Wide Web. Because a lot of different Web applications benefit from 
eliminating these shortcomings, they should be accomplished in a way 
that is independent of the underlying database and Web environment. 
Unfortunately, references to resources located in a database system 
cannot be embedded into HTML pages as easily as references to resources 
located in the Web server's file system due to the fact that they cannot 
be denoted by an URL. Thus, an appropriate URL notation is developed

to denote resources located in the database as well. The advantage of 
using an URL, besides the fact that it is independent of the underlying 
database environment, is that all of the information that is accessible 
via World Wide Web is treated in a uniform fashion. Thus, the 
administration and development of the appropriate HTML pages is much 
simplified.

4 Combining databases and the World Wide Web

This section gives an overview in which way a World Wide Web based 
access to database systems can be accomplished and introduces the URL 
notation that has been developed for denoting information located in a 
database system. Finally, all of the details are revealed how link 
integrity, additional search facilities and navigational aid can be 
accomplished by using additional meta data.

4.1 Architecture

When developing a World Wide Web based access to database systems, 
various approaches have to be considered that all differ in regard of 
the e?ciency that can be achieved and the necessary software environment 
which is required. Generally, it is possible to use CGI scripts, extend 
the functionality of the Web server or using a CORBA based approach [1] 
to access the database system. Although, CGI scripts provide access to 
external data sources in a standard fashion, they lack e?ciency due to 
the overhead of starting a new process, namely the CGI script, for every 
single database request. An ORB based Web environment, on the other 
hand, offers a lot of interesting possibilities for distributed 
applications, but is currently not the state of the art. Thus, the only 
possibility that is left is to extend the functionality of the Web 
server itself. Although, currently every Web server offers its 
proprietary API for extensions, this approach lacks portability between 
different platforms as well as products. Fortunately, a new generation 
of extensible Web servers [2, 4] already exist that can be extended 
using the Java programming languages. These Java based extensions of the 
Web server which are called servlets are in fact server-side equivalents 
of Java applets. Compared to the other solutions they offer an enhanced 
portability, e?- cient database access as they are loaded during startup 
of the Web server and the security model of Java applets. Additionally, 
servlets may access the database by using the JDBC (Java Database 
Connection) [6] interface which is independent of the underlying 
database environment. In summary, for portability and security reasons 
the servlet approach has been chosen in favor of the other approaches 
described above.

After the general framework has been introduced, the interaction between 
the different components of the architecture as it is depicted in figure 
1 will be described. The retrieval of information located in the 
database system is possible with any ordinary Web browser whereas 
insertions, deletions, modifications and migrations of resources special 
frontend tools have to be used. During the insertion of HTML documents 
all of the necessary information is extracted by the link management to 
provide integrity of the links as well as additional search facilities. 
By monitoring all of the subsequent changes that take place within the 
document structure, it is possible to keep this information up to date. 
The access to resources in the database is accomplished by using an 
enhanced URL notation. A proposal for such a notation is introduced in 
the following section.

Web-Browser
multimedia
extensions
(plug-ins)
standard
capabilities
Java virtual
maschine
Web-Server
Database system
HTTP
JDBC
Frontend tools
standard capabilities
database access module link management
document parser
meta data management
graphical frontend
JDBC
traditional data
hyper-/multimedia data
Java applets
meta data
link information
search information
URL handler
link management
document parser
meta data management
URL handler

Figure 1: The general component architecture

Every time a HTML document residing in the database is requested, the 
Web server's URL handler has to transform the URL notation into a 
database query, retrieve the document from the database, use its link 
management to check the validity of the existing links and remove any 
invalid links during retrieval. If any server-side includes within the 
document exist, they have to be handled before sending back the document 
to the browser.

4.2 The URL notation for accessing information in a database system

As it has been mentioned already, using an URL notation to denote 
information residing in a database system has the following advantages 
over using special HTML tags to include database queries into HTML 
pages:

1. The same notation is used to access all kind of Web based resources 
independent of their location.

2. The notation that is used for embedding database queries into HTML 
pages is independent of the underlying database environment.

Thus, the proposed URL notation is capable of denoting links to multi- 
and hypermedia content as well as database queries including traditional 
data (relational or objectoriented) or multi- and hypermedia data 
respectively. As a result, an URL has a different semantical meaning 
based on the context that it is used in. In the following sections the 
terms link semantics and query semantics will be used to distinguish 
between these two semantical meanings. Based on the different semantical 
meanings, the link management has to perform different tasks. In case of 
an URL with link semantics, the link manager has to ensure the validity 
of the link, i. e. the link will be accessible as long as the 
corresponding target resource. On the other hand, an URL with

query semantics remains valid as long as all of the attributes and 
entities exist that are part of the query in the query.

So, how is it possible for the link manager to distinguish between these 
two semantical meanings of the URL notation? The answer to this 
questions is closely related to the way Web based resources are stored 
in the database. Although, it is possible to store any multi- and 
hypermedia content as well as traditional data in the same entities, the 
approach that has been used here is to store the different types of 
resources in a centralized fashion in the so-called resource entities. 
The motivation of using these special resource entities, besides the 
fact that access time for these resource entities can be adapted to the 
particular needs by assigning them to dedicated disks and/or partioning 
them across several disks is illustrated in figure 2.

The application developer interacts with the system by using special 
frontend tools to insert, delete or modify new as well as existing 
entities. In the example of figure 2 it is supposed that the application 
developer wants to extend an existing database schema of a learning 
environment with two new entities, namely the 'professor' and 'lecture' 
entities. The 'professor' entity consists of the professor's name, his 
location inside of the building as well as his image. The 'lecture' 
entity is used to store information about lectures as its name, the day, 
time and location it takes place as well as the script that is used 
during the lecture. Thus, the application developer interacts, defines 
and modifies an external schema description of the database as it is 
visible to the outside world. Because it is very di?cult to store and 
handle multiple interlinked resources as an attribute value as it is 
needed for the script attribute of the 'lecture' entity , the system 
uses a different internal schema description to store this type of 
information. The transformation between the external and internal schema 
descriptions is accomplished by the appropriate frontend tools. As a 
consequence, multiand hypermedia content is not stored as a part of the 
entity itself, but in the 'content' attribute of the appropriate re-

hDatabase URLi ::= http://hHosti[:hPorti]/hPath Prefixi/hDatabase Refi 
hDatabase Refi ::= hDatabase Namei/hEntity Namesi[/hChoicei[?hQuery 
Paramsi]] hEntity Namesi ::= hEntity Namei [ & hEntity Namesi] hEntity 
Namei ::= hSchemai.hEntityi hChoicei ::= hEntity Attributesi j hBytecode 
Refi j hPresentationi hEntity Attributesi ::= hEntity Attribute Namei [ 
& hEntity Attribute Namei] hEntity Attribute Namei ::= 
hSchemai.hEntityi.hAttributei hBytecode Refi ::= hEntity Attribute 
Namei/hClass namei hPresentationi ::= hPresentation Templatei'('hEntity 
Attributesi')' hQuery Paramsi ::= hQuery Parami [ & hQuery Paramsi] 
hQuery Parami ::= hEntity Attribute Namei hOperatori (hEntity Attribute 
Namei j hValuei)

Table 1: The syntax of the developed URL notation in a BNF-like syntax

Professor (name, location, image) Lecture (name, day, time, location, 
script) External schema description Internal schema description 
Professor (name, location, ) Image(id, mime-type, content) Lecture 
(name, day, time, location, ) Document(id, mime-type, content) 
transformation

Figure 2: The notion and the usage of resource entities

source entities which are depicted in figure 2 with an italic font. In 
the entity itself, the 'image' or 'script' attributes contain only 
references to the corresponding resource entries. Hypermedia content is 
further divided into its parts which in turn are stored in the 
appropriate resource entities with the links transformed according to 
the locations of their target resources as it is illustrated with the 
'script' attribute.

Therefore, all of the URLs with link semantics denoting Web based 
resources contain only references to resource entities whereas URLs 
denoting database queries only contain entities that are part of the 
external schema description. Based on this fact, the link manager is 
able to distinguish between the two semantical meanings of the URL 
notation. Examples which are illustrating these facts more detailed 
follow after the proposed URL notation has been revealed.

As one can see in table 1, the proposed URL notation contains a host 
name (Host) and an optional port number (Port) like every other URL 
notation followed by a prefix (Path Prefix) that is used to distinguish 
between resources residing in the database system and in the Web 
server's file system. Generally, the chosen path prefix should not 
denote a valid filename to avoid any possible conflicts with direc-

tory names in the local file system. Examining the notation in greater 
detail, it is revealed how an hierarchical naming schema like an URL can 
be applied to database systems. All the information in a database system 
is partitioned into databases (Database Ref ) which in turn consist of 
different entities (Entity Names). Entities, on the other hand, consist 
of attributes (Entity attributes). Instead of a long theoretical 
discourse, the usage of the URL notation will be illustrated with a few 
examples.

First of all, an URL with link semantics that denotes a hyperlink to a 
HTML document which is located on host `www.myhost.de' in the 
`education' database has the form 
http://www.myhost.de/db/education/Document/content? id=5147936 . The 
link semantics of the URL should be obvious, because only the resource 
entity 'Document' is referenced in the example above. Thus, the syntax 
for referencing Web based resources is identical for all of the 
different resource types, except for Java bytecode (Bytecode Ref ). The 
difference between denoting Java bytecode compared to other resources 
results from the way the bytecode references are handled by the Java 
virtual machine. Thus, the last part of an URL denoting a bytecode 
reference has to be a valid class name (Class Name). Generally, URLs 
denoting references to Web based resources are mostly generated 
automatically by the system during insertion. They can be used anywhere 
within a HTML document where an URL is permitted. On the other hand, a 
database query like "Display the name and script of all lectures that 
take place on Thursday" can be denoted by an URL of the form 
http://www.myhost.de/db/education/ lecture/name&script?day='Thursday' . 
This example shows how a database query is structured. The first part of 
the URL denotes the entity or entities (Entity Names) respectively to 
which the query will be applied. If a query contains multiple entities, 
they have to be concatenated with an ampersand. The next part of the URL 
denotes all of the attributes (Entity Attributes) that form the query's 
result set. In addition, it is possible to restrict the query result in 
such a way that only instances of entities are selected that meet 
specified conditions (Query Params). By the way, these conditions have 
to be specified as name/value pairs with an identical syntax as the 
query parameters of HTML forms. Unlike URLs with link semantics, URLs 
with query semantics can only be integrated into HTML pages in form of a 
hyperlink or can be used in the same way as a CGI script. Because of 
this rather limited form of embedding database queries into HTML pages, 
the capabilities of the database module has been enhanced to handle 
server-side includes of the following form

h! ?? # include Database URL ??i


as well. This feature is also found in almost all of todays Web servers.

Sometimes it is necessary to provide a unique presentation layout for 
all of the information that belongs to a certain topic or to display 
information using a particular form of presentation. It is possible to 
customize the presentation of the query results using a special 
presentation layout (Presentation Template) that is stored in the 
database as well. Such a presentation template is simply a HTML page 
with variables of the form %%hnumi%% which are substituted with the 
appropriate entities' attribute values ( Entity Attributes in 
Presentation) during runtime. Thus, presentation templates can be used 
to provide a unique layout for the same type of data as well as a means 
to store the information that has to be presented independent of the 
presentation's layout. Thus, for changing a presentation's layout only 
the corresponding presentation template has to be modified instead of 
adapting multiple HTML documents according to these changes manually.

4.3 Link management

This section describes the link management in greater detail. First, the 
fundamental issues of link management as well as an overview of the 
degree of link integrity is given that the link management is able to 
guarantee under certain circumstances. Then, the link management's meta 
data structures are introduced that are necessary to store all of the 
relevant link information. Finally, the architecture and the 
functionality of the link management is presented.

4.3.1 The integrity of links

Basically, the link management provides an abstraction on top of the 
storage system that is used to access all of the Web based resources. 
Its interface consists of operations to insert, modify, delete, retrieve 
as well as change the location of resources. In contrast to the storage 
system's interface to access Web based resources, the link management 
extracts and maintains relevant link information during these operations 
to guarantee the integrity of links. Thus, in an ideal environment where 
all of the accesses to Web based resources are performed by using the 
interface of the link management, it is possible to ensure link 
integrity at any time. In a real Web environment, on the other hand, a 
lot of different tools exist that help to administrate and manage a Web 
site. Thus, it is a rather unrealistic asumption that all of these tools 
use the link management's interface to access the Web based resources. 
Based on this fact, the following section investigates the effect of the 
two most important factors with regard to the degree of link consistency 
that the link management is able to ensure:

The storage system: It is possible to store Web based resources either 
in the Web server's file system or in the database.

Location of resources: A resource can either be stored locally on the 
Web server's host or on a remote host somewhere on the internet.

As it has already been mentioned, a link in the World Wide Web a forward 
reference from a HTML document to any type of multi- or hypermedia 
content. Without maintaining the a?liated backward references, there is 
no possibility to adapt the source documents according to changes of the 
target resources. By using a database system for maintaining link 
information, the missing backward references of

links can be derived. Thus, the survey in table 2 gives an overview 
which degree of integrity can be ensured for HTML documents that are 
stored on the local host. Generally, eight different cases have ot be 
distinguished:

Case 1-4: The links of the HTML documents that are located in the local 
file system cannot be automatically adapted to changes of the target 
resources. Thus, the link integrity is violated as long as the 
administrator of the local Web site is informed and changes the source 
documents manually. If the target resource is located in the local file 
system as well (case 1), it is possible to use tools like the Web 
crawler to analyse all of the documents that are locally available on a 
regular basis and identify all of the invalid links. In this case it is 
not possible to ensure the integrity of links too, but the period of 
time until invalid links are detected can be reduced. Thus, the degree 
of integrity that can be achieved in all of these cases is weak.

Case 5,6: Although the source documents are stored in the local 
database, the target resources are still located in the file system. 
Assuming that the link management is notified every time a target 
resource has been changed, the link management is able to adapt the 
source documents according to these changes instead of adapting the 
source documents manually as in case 1-4. Thus, only a weak form of link 
integrity can be achieved as well.

Case 7: This is the interesting case where all of the source documents 
as well as the target resources are located in the database. Thus, the 
interface of the link management is used to access all of the Web based 
resources and the integrity of the links can be ensured. The only 
problem that still may occur is due to the client side caching of 
resources of the Web browser and the stateless HTTP protocol. It is not 
possible to notify the Web browser of any changes that effect the link 
integrity of documents that are currently stored in the Web browser's 
cache.

Case 8: In the case that the target resource is stored in a remote 
database, it is possible to guarantee the integrity of links as well, if 
a protocol between the distributed link management components as 
described in [11] is used. The degree of link integrity that can be 
achieved depends on the underlying protocol that is used to propagate 
the changes of target resources. If the protocol uses a distributed 
transaction mechanism, i. e. an atomic update, to propagate these 
changes, it is possible to achieve a very strong degree of link 
integrity. On the other hand, if changes of target resources are 
propagated on a regular basis, i. e. a deferred update mechanism is 
used, only a weaker degree of link integrity can be ensured.

In summary, one can say that the integrity of links can only be 
guaranteed in an acceptable way when all of the Web based resources are 
stored in the local database.

4.3.2 Meta data structures

This section gives an overview of the meta data structures that are 
depicted in figure 3 and have to be maintained by the link management to 
ensure the integrity of links. Although the meta data structures of the 
link management may only used be to store link information where the 
source

Location of the target resource
local remote local remote
file system file system database database

Location local weak weak weak weak of the file system (case 1) (case 2) 
(case 3) (case 4) source local weak weak strong adaptable document 
database (case 5) (case 6) (case 7) (case 8)

Table 2: Overview about the degree of link integrity that can be 
achieved under the different conditions.

documents as well as the target resources are located in the database, 
it should be noted that they are capable of maintaining all kind of link 
information independent of the resources' location. Generally, every 
link consists of a source anchor that is directly embedded into the HTML 
document as well as a target anchor which denotes the reference to a 
target resource.

Apart from an unique identifier, namely the sourceId that is used to 
identify the source anchor unambigiously, the source anchor entity 
consists of an url that denotes the source document, a reference to the 
a?liated target anchor (targetId) that constitutes the actual link as 
well as the location of the link within the source document, i. e. a 
byte range denoted by a starting position and an end position. If the 
source document is located in the database, the targetId contains a 
reference to the appropriate resource entity, otherwise it is undefined.

According to the definition of the source anchor entity, the target 
anchor entity also consists of an unique identification, namely the 
targetId , as well as its location within the target resource that is 
denoted by a starting position and an end position. If a target resource 
contains more than one target anchor, the fragment part of the target 
resource's URL is used to denote the target anchor in a unique fashion. 
Because a target resource may contain multiple target anchors, all of 
the link information that belongs to the target resource itself is 
included in the resource description entity. Thus, the resource 
description entity consists of an unique identifier, namely the 
resDescId , and the base url (without the fragment part) of the target 
resource. If the target resource is located in the local file system and 
its location has been changed, the newUrl indicates its new location, 
otherwise it is undefined. If the target resource, on the other hand, is 
located in the database system, the appropriate resource entity consists 
of a reference in form of the resDescId to the corresponding entry in 
the resource description entity. So far, only URLs with reference 
semantics have been covered by the meta data structures. But, as it has 
been mentioned before, a hyperlink or server side include may also 
contain an URL with query semantics. Because the link management is able 
to check the validity of URLs with query semantics as well, the query 
description entity is used to store the database, entities and 
attributes that are part of the query. Additionally, the query 
description entity has been extended with a reference, namely the 
resDescId , to the corresponding resource description, as well, to 
achieve a standardized design for all types of URLs.

4.3.3 Architecture and functionality

Finally, this section describes the architecture of the database access 
module (see figure 1) and especially the link management in greater 
detail. Afterwards, the link management's interface and the usage of the 
meta data structures that have been described in the previous section 
will

be revealed.

The database access module which is used to extend the functionality of 
the Web server as well as for the frontend tools to manage a Web site 
consists of the following two modules:

- The URL handler is responsible for transforming the URL notation into 
an intermediate form that is used by the link manager. Addionally, its 
task comprise the access to all kind of information that resides in the 
database as well as formatting the query results.

- The link management consists of a document parser which is responsible 
for extracting the URLs of the HTML documents whereas the meta data 
management is responsible for the maintenance of the meta data 
structures to provide link integrity. Technically, the link management 
is embedded into the URL handler, because it uses the intermediate URL 
representation of the URL handler as well as the functionality of the 
URL handler to access the database.

Now, that the components of the link management have been introduced, 
its tasks in regard of maintaining the link management's data structures 
which are shown in figure 3 will be described in more detail. As it has 
already be mentioned in section 4.3.1, the interface of the link 
management consists of operations to insert, modify, delete, retrieve as 
well as change the location of resources. Thus, the following steps have 
to be taken to ensure integrity of links:

- Insertion: Every time new resources are inserted a resource 
description with a unique resDescId and the corresponding url have to be 
generated. Afterwards new target anchors are created containing a 
reference (resDescId) to the appropriate resource description, a unique 
targetId as well as the start - and end value initialized to zero. If a 
resource has to be inserted in the database as well, the appropriate 
resource entity is used to store the resource. As far as any HTML 
documents are inserted, the document parser has to extract all of the 
URL references as well as HTML anchor tags. For every anchor tag that is 
found, a new target anchor entry containing a reference to the HTML 
document, its position and the appropriate fragment id has to be 
created. For every URL that denotes a database query it has to be 
checked if the appropriate resource description already exists. If not, 
a new resource description, query description and target anchor have to 
be inserted. For every URL that denotes a reference to a Web based 
resource, a new source anchor with a unique sourceId , a reference to 
the HTML document (documentId) as well as the start - and end position 
of the URL has to be created, while the sourceIds of the corresponding 
target anchors are stored in main memory. After all of the resources 
have been inserted,

Target Anchor
targetId
resDescId
fragment
start
end
Source Anchor
sourceId
documentId
targetId
start
end
Resource  Description
resDescId
url
newUrl
Video
id
resDescId
mimeType
contents
Query Description
database
entity
attribute
resDescId
Document
id
resDescId
mimeType
contents
...
Explanation
:Unique identifiers
Referential constraints with different semantics on delete set null
on delete cascade
url

Figure 3: Meta data structures used to store all of the relevant link 
information

the link structure has to be valid. Thus, for every target anchor which 
is stored in main memory it has to be tested if a corresponding target 
anchor entity exists. If the test is successful, the corresponding 
targetId is stored in the appropriate source anchor . Otherwise, the 
link structure is inconsistent and all of the changes have to be undone.

- Deletion: Before the specified Web based resources can actually be 
deleted, their entries in the resource description entity with the 
appropriate URLs have to be deleted. Because of the referential 
integrity constraint between the resource description entities and the 
target anchor entities, the corresponding target anchors of these 
resources are removed as well. The removal of the target anchors, on the 
other hand, leads to the situation that the corresponding source anchors 
have to be adapted too, because of the existing referential integrity 
constraint. The semantic of removing target anchors is that the 
corresponding links are no longer valid. This situation is reflected in 
the meta data structures by changing the targetId of the source anchor 
entity to an undefined value. Addionally, if any of the Web based 
resources that are about to be deleted are located in the database, they 
are removed from the corresponding resource entities as well, because of 
the referential integrity constraint between the resource description 
entity and the resource entities. In case that any of the deleted 
resources is a HTML document, all of its source anchors are removed from 
the source anchor entity because of the referential integrity constraint 
between the resource entity Document and the source anchor entity. 
Addionally, it is possible that the deletion of a HTML document leads to 
the situation that the database queries which are part of the document 
may be obsolete too and

the corresponding query description entries have to be deleted. Such a 
situation can be discovered by counting the references from source 
anchor entries to target anchor entries which denote database queries. 
If no such references exist, the database query has become obsolete and 
the corresponding entry in the target anchor entity has to be removed. 
Because of the reference (resDescId) from the target anchor entry to the 
corresponding resource description, the resource description is removed 
as well. Finally, the query description is deleted, because of the 
referential integrity constraint between these two entities.

- Modification: In case of any modifications of resources which are not 
HTML documents nothing has to be done, because they do not consist of 
any source anchors and the target anchors are not affected by changing 
their content. The modification of a HTML document, on the other hand, 
is identical to applying a delete operation to the HTML document which 
is followed by an insert operation.

- Changing location: First, it should be noted that this operation can 
only be applied to resources located in the local file system, because 
for resources located in the database the notion of a location is not 
applicable. Thus, if a resource is relocated, the resource description 
entry has to be adapted accordingly by storing its new location as the 
newURL. Additionally, a new resource description entry has to be created 
with the new location as its url and an undefined newURL. The reason for 
maintaining multiple resource descriptions for a relocated resource is 
that the source documents are not adapted automatically to the new 
location of their target resources. Thus, a resource description entity 
has to be maintained as long as it is referenced

by any target anchor entry.

- Retrieval: Because a retrieval operation does not affect the integrity 
of links, only the validity of the resource's links have to be checked. 
Because Web based resources except for HTML documents do not contain any 
links, no actions have to be performed when retrieving such a Web based 
resource located in the database. If the Web based resource is located 
in the local file system, the corresponding resource description has to 
be checked if the resource has been relocated in the past. In case of a 
relocation, the actual url has to be determined by inspecting the 
corresponding resource description entries. Finally, the Web based 
resource has to be retrieved from the database or the local file system, 
respectively. On the other hand, prior to retrieving a HTML document the 
validity of the existing links has to be checked. Thus, all of the 
targetIds of the HTML document's source anchor entries containing an 
undefined value have to be retrieved. If such source anchor entries 
exist, its start - and end value have to be retrieved to remove the 
corresponding byte range from the HTML document during retrieval. For 
all of the database queries which are identified according to the 
reference from the document's source anchor to the corresponding target 
anchor and the target anchor to the resource description the existence 
of the corresponding database, entities and attributes have to be 
checked during runtime. This check has to be performed by retrieving the 
appropriate catalog information from the system catalog of the database 
system. If any invalid database queries exist, their position within the 
HTML document is determined based on the start - and end value of the 
corresponding target anchor entries and they are removed during 
retrieval.

Finally, it has to be added that all of the operations above are 
executed under transactional control. Thus, if any of the changes have 
to be undone, the transaction must perform a rollback, whereas 
committing the transaction makes all of the changes persistent.

4.4 Additional search facilities

Currently, the only possibilities to perform full-text searches within 
the local HTML documents are either to use one of the well-known search 
engines or to use tools that have been developed by oneself. On the 
other hand, in a Web environment that consists of a database system 
these queries can be performed much more e?ciently using the built-in 
query language of the database system. So far, it is already possible to 
do full-text searches of HTML documents that are located in the database 
system by using a database query that searches the contents of the 
Document resource entity. But, support for e?cient searches within 
certain relevant parts of the document like headings, captions of 
figures and tables as well any information that is included into the 
HTML document header, it is necessary to extend the functionality of the 
meta data management and document parser components (see figure 1) 
accordingly. Thus, the document parser has to extract the appropriate 
HTML tags during insertion or modification as well and to store them in 
an additional meta data structure of the following form:

Document Info (HTMLtag , content , documentId)

Using this kind of meta data, it is possible to e?ciently search the 
content of special HTMLtags in a certain document or all of the 
documents that are available. Just for

clarification, it has to be noted that the documentId denotes a 
reference to the corresponding Document resource entity that contains 
these HTML tags. Obviously, the information in the Document Info entity 
has to be adapted according to modifications or deletions of HTML 
documents as well.

4.5 Navigational aid

The "lost in hyperspace" syndrome which has been observed in a lot of 
hypermedia environments due to a disorientation of the user after 
following a few hyperlinks results from a missing navigational aid in 
form of a map that indicates the users' current location as well as the 
users' path through hyperspace. As it can be seen, all of the 
information that is needed to generate a map of the existing document 
structure is already available in the meta data structure of the link 
management whereas the path information can be derived from analyzing 
the Web server's access log. By the way, the Web server's log should be 
stored in the database as well to be able to use the query language of 
the database system to obtain various access statistics. Thus, a 
frontend tool may use this kind of information to generate a graphical 
output on demand of the user to help him navigating along the existing 
hypermedia information.

5 Implementation

A prototype implementation of the system as it is depicted in figure 1 
has been developed in a networked environment of heterogeneous Unix 
workstations and a relational database system. The general idea of the 
prototype implementation was the creation of a testbed to show the 
feasibility of the underlying design concepts and to build a basis for 
further development. In the following section the prototype 
implementation is revealed in greater detail:

Web-Browser: Any standard Web browser with Java support can be used to 
access the information residing in the database system. A number of 
different helper applications, like an MPEG audio and video player or a 
real audio player have to be installed to be able to playback audio and 
video clips.

Frontend tools: In the current state of the prototype implementation 
none of the graphical frontend tools have been developed yet. Thus, all 
of the resources hav to be inserted into the database using command line 
utilities.

Web-Server: As it has been described in section 4.1 an extensible Web 
server which implements the servlet interface has be chosen as the basis 
of the prototype implementation. Thus, the Java Web Server [4] from Sun 
Microsystems has been used as the software platform. The database access 
module (see figure 1) has been implemented as a servlet that is loaded 
by the Web server during startup. Instead of implementing the document 
parser manually, the Java Compiler Compiler (JavaCC) [10] has been used 
to be able to extend the document parser according to changes in the 
HTML standard. Java based access to the database system which is used to 
store all of the meta data is accomplished by using the JDBC [6] 
interface.

Database system: The current implementation is based on IBM's 
object-relational database system DB2 Version 2.1.1 on a Sun Solaris 
platform. All of the multi-

and hypermedia content is stored as LOB (long objects) data depending on 
its type either as CLOBs (character large objects) or BLOBs (binary 
large objects). In a further release of DB2, which has already be 
announced and is called DB2 Universal Database, it is will be possible 
to store all of the Web based resources in so-called relational 
extenders for text, images, audio or video respectively and to perform 
content based queries on multimedia data. Access to the database system 
is provided by DB2's native JDBC driver.

6 Conclusion

This paper presented an approach to combine the features of the World 
Wide Web with the functionality of database systems by storing all of 
the Web based resources in the database to overcome the most serious 
limitations of the Web. By using an URL notation to denote all of the 
information located in the database system, it has been demonstrated how 
easily references to these Web based resources as well as database 
queries can be embedded into HTML documents treating all kind of 
information in an uniform way independent of their location.

7 Future Work

The main goal is to use the prototype implementation during a practical 
course or a lecture to make all of the course material like scripts, 
programs, etc. on-line available for the students. The students' 
experiences will then be used for the evaluation of the prototype with 
respect to robustness and performance as well as missing features. 
Additionally, all of the frontend tools have to be developed which 
enable an easy administration of Web based resources. Another direction 
of future work is to develop an authentication model that is capable of 
managing Web based resources and traditional data in a more suitable 
manner as the World Wide Webs current authentication method that is 
based on protecting directory sub-trees. Thus, it would be advantageous 
to have a more fine-grained authentication model that is capable of 
granting or restricting access to single Web based resources or entity 
attributes.

References

[1] M. Anand, et. al. The Web Request Broker: A Framework for 
Distributed Web-Based Applications. Available at 
http://www.olab.com/beta/www6 1/paper.html.

[2] A. Baird-Smith. Jigsaw Overview. Available at http:// 
www.w3.org/pub/WWW/Jigsaw/.

[3] J. Gaffney. Illustra's Web DataBlade Module. SIGMOD Record, Vol. 25, 
No. 1, March 1996.

[4] The Java Server Product Family. Available at http:// 
jserv.javasoft.com/.

[5] B. J. Haan, et al. IRIS Hypermedia Services. Communications of the 
ACM 35(1), 1992.

[6] G. Hamilton, R. Cattell. JDBC: A Java SQL API. Available at 
http://splash.javasoft.com/jdbc/.

[7] P. Kutschera. Combining Database Technology with the World Wide Web 
for Tele-Teaching Environments. New

Media for Education and Training in Computer Science, infix, 99-108, 
1996.

[8] H. Maurer. HyperWave: The Next Generation Web Solution. Reading et. 
al.: Addison-Wesley, 1996.

[9] IBM Net.Data Programming Guide. Available at 
http://www.software.ibm.com/data/net.data/docs/ dtwdev.htm.

[10] Java Compiler Compiler. Available at http://www. 
suntest.com/JavaCC/features.html.

[11] J. E. Pitkow, R. K. Jones. Supporting the Web: A Distributed 
Hyperlink Database System. Computer Networks and ISDN Systems 28(7-11), 
981-991, 1996.

[12] W. R. Tuthill. Don't Get Caught in the Web: A Fieldguide to 
Searching the Net. Proceedings of the COMP- CON'96, Santa Clara, 
February 25-28, 1996, 77-83, Los Alamitos, IEEE Computer Society Press, 
1996.