Metatag Depositories: Dublin Core Metadata, Harvest and Resource Discovery in Law

Metatag Depositories:
Dublin Core Metadata, Harvest and Resource Discovery in Law

1999 CALI Conference
Steven C. Perkins
and John Doyle

This session will discuss one solution for allieviating the problem of inefficient resource discovery of legal materials on the Internet. The proposed solution involves the use of Dublin Core metadata tags in electronic documents combined with the use of a search engine, such as Harvest, that is capable of recognizing the Dublin Core metadata tags.

What is Metadata?

Metadata is data about data. Common examples are various types of catalogs, such as a mail order catalog or a library catalog. Each catalog contains descriptions of items but not the actual items. Thus each catalog is a metadata depository. By searching the catalog metadata you are able to locate the physical item. The metadata do not need to be with the actual item and that allows the metadata depository to be separate from the physical item.

Why Metadata?

It is easier to search the metadata than the whole document.
Metadata can be held in a metadata depository.
Will allow for a focused search on elements.
Can be attached to images, databases, and PDF documents.

There are many metadata schemes:

The Dublin Core was developed by OCLC and has been extended with the Warwick Framework and the Canberra Qualifiers. It is based on common cataloging statements and has a base of 15 identifiers.
The Text Encoding Initiative - TEI has been developed for use in textual analysis.
The Platform for Internet Content Selection was developed by the W3 to allow for rating of content. It can also allow code signing, privacy and intellectual property rights management.

Why use Dublin Core?

The Dublin Core metadata scheme has been developed with reference to the Machine Readable Cataloging (MARC) data format in international use for cataloging items in libraries.
It has been developed to allow for use by the creators of documents as well as by resource description experts.
It is the subject of an ongoing development effort led by Stuart Weibel of the Online Computer Library Center (OCLC). See, [DC] Dublin Core Metadata for Resource Discovery, Informational RFC 2413. S. Weibel, J. Kunze, C. Lagoze, M. Wolf, September 1998, ftp://ftp.isi.edu/in-notes/rfc2413.txt.

Examples of Dublin Core Metadata Depositories

Tools to create Dublin Core and other types of Metadata

DC-Dot outputs DC in RDF, HTML2 and HTML4.
REGGIE is a Java application for generating MetaData.

The User Guide Working Draft

There is a User Guide Working Group which has developed a User Guide Working Draft which explains how to use Dublin Core version 1.

What is the Future of Metadata?

The future of Metadata is the Resource Description Framework, an XML application that describes metadata and allows for relationships between metadata. See, R. Iannella, "Application of RDF for extensible Dublin Core metadata".

What is RDF?

This article What is RDF? explains RDF in a readable manner. See, RDF Tools for an example of the Dublin Core in RDF.

Which Search Engines Support Metadata?

Search Engine Watch has a list of search engines which use Metadata.

Using Harvest to Retrieve and Index Metadata

Harvest is an integrated set of tools to gather, extract, index and search Internet information. Harvest is capable of using various indexing software, but comes by default with the Glimpse indexer built-in. Both Harvest and Glimpse are able to deal with structured data that is organized into fields, allowing documents with fields containing names and attributes (such as metadata) to be stored and retrieved using field limited searches.

Harvest contains a "gatherer" and a "broker" componant. The gatherer retrieves documents from the Internet using a list of URLs, and will recursively descend a site if the configuration list requires it to. The retrieved documents are each passed through a summarizing program that extracts-out the elements of the documents that are required and stores the summary for each document in a field-organized record. The gatherer having completed its retrieval/summarizing tasks is then ready to receive a request from a broker. The broker portion of Harvest periodically queries the broker(s) listed in its configuration file, retrieving documents added since the previous contact. The broker then (utilizing Glimpse) indexes the data, and, via a user query-form and a query-engine, accepts input from users, passes the query on to the external Glimpse search-engine, and sends the results back to the user as a web page.

For information on configuring Harvest see: http://www.wlu.edu/Harvest/docs/.