Home

Project Info

The University of Illinois Open Archives Initiative Metadata Harvesting Project

Proposal to Implement a Scholarly Information Portal
Using OAI Metadata Harvesting Protocols

Introduction | Work plan


Introduction

Limitations of Current Web Search Systems:

Current Web search tools index primarily high-visibility, freely-accessible information resources, and most index only HTML content. Scholarly information resources of less visibility, of restricted access, or in other formats (including images / multimedia, databases, PDF & XML documents, and metadata describing physical archival holdings) are largely neglected. When harvested at all by Web search engines, item-level metadata describing resource content and context are not well utilized. The result is that users miss finding out about important and relevant resources housed in scholarly institutions. Because the harvesting process used by today's Web search engines is largely hidden from view and has the appearance of sophistication, students, faculty, and other researchers often think they are conducting more comprehensive searches than they really are. Search utilities indexing primary source materials in a wide-range of formats and better exploiting item-level metadata are needed to provide more inclusive Web-based gateways and portals to scholarly information resources and to highlight the availability of these special collections.

At the same time, most existing Web information repositories and aggregation systems lack the degree of semantic interoperability needed to enable consistent and coherent access to similar and related classes of information objects distributed across heterogeneous information repositories. Variations in document authoring conventions and Web page construction and granularity make searching of Web-accessible collections by current techniques problematic. Users are faced alternately with too much and too little in the way of search results. Search precision and recall is generally poor. Quality, quantity, and fidelity of items found are ambiguous. Context regarding and relationships between information objects identified is not provided. As the amount of content available online increases, users want and need better ways to search across heterogeneous information repositories. Search utilities that better exploit interoperability are needed to facilitate more effective discovery of information.

Content Considerations:

Current Web search tools do an especially poor job providing comprehensive intellectual access to many kinds and types of information resources found in scholarly archives and other special collection units. For example, academic libraries collectively hold a wealth of unique information in the form of manuscript archives and cultural heritage image, artifact, cartographic, and multimedia collections. Much of this material is of special local or regional importance (e.g., collections of local oral histories, collections of state or regional aerial photographs, collections of local historical images and artifacts). Much of this material is closely tied to the institutions holding the items (e.g., personal archives of famous faculty and alumni, archives associated with affiliated research centers). Increasingly this material (or at least the finding aids describing this material at the item level) is being digitized and made available online. These collections are unique and highly valued, but items in these collections typically are not individually represented in library online catalogs. Instead they are described at the item level only in a multiplicity of discrete and independent bibliographic databases and finding aids. Effective item-level aggregation across institutions and even within the same institution is still relatively uncommon. Item-level aggregation across metadata schemas is rare.

Effective aggregation and effective searching at the item level across collections has been a problem in particular for manuscript archives. The difficulties in providing cross-repository and cross-schema searching of manuscript archives are exemplary in this regard. Services such as the online version of the National Union Catalog of Manuscript Collections and for-fee search utilities such as Chadwyk-Healy's Archives USA and RLG's Archival Resources have made inroads of late but are still better at exploiting collection level metadata than at item-level aggregation. In practice such services appear to index item-level metadata very inconsistently. In part, difficulties in this regard can be traced to variations in implementation of community-standard encoding schemas such as the Encoded Archival Description (EAD) metadata schema. Community-specific (i.e., schema-specific) research into ways to reduce and/or overcome such inconsistencies shows considerable promise, but many of these community-specific solutions are not immediately extensible, and in any case they do not address interoperability between metadata schemas (e.g., see the discussion below regarding the Distributed Finding Aid Server project). This last concern in particular is an issue we mean to address in this proposed project.

Two examples illustrate the need for better interoperability across metadata schemas commonly used to describe archival holdings. Indiana University's Hoagy Carmichael Collection consists of thousands of items (both textual and non-textual) drawn from three different repositories on the Bloomington campus (the Archives of Traditional Music, the Lilly Library, and the University Archives). The materials all relate to the life and career of Hoagy Carmichael and are all of possible interest to those researching Carmichael or related topics. Unfortunately separate and distinct finding aids and databases describe the various components of these holdings and three different metadata schemas (EAD, Text Encoding Initiative [TEI], and MARC) are used. Providing integrated browse and search access to EAD finding aids, MARC records, and TEI Header records remains an unresolved issue. Similarly, the University of Illinois at Urbana-Champaign has extensive archival holdings relating to the history of advertising, including the archives of the Advertising Council (approximately 100 cubic feet) and the D'Arcy Collection (approximately 2 million advertisements published between 1890 and 1970). Again, different library units (the Communications Library and the University Archives) hold the primary collections, and again different schemas were used to construct the extent finding aids. There is no way presently to search across both collections simultaneously.

There are unique challenges as well in the provision of what appears (to the end user) to be seamless access to diverse types of digitized cultural heritage information. This content category includes a wide range of special collections. Our collection of digitized historical maps of Illinois and the Northwest Territory, our collection of digitized historical aerial photographs, and the Illinois Digital Cultural Heritage Collection (digitized primary source materials supporting the Illinois Learning Standards Goals for Socal Science) are examples of local collections in this content category. Museums, libraries and archives have traditionally used a variety of approaches to provide intellectual access to such content. Prior to the advent of Internet connectivity, the Web as a medium to convey graphical information, and the capability to search and retrieve information from databases, there was little overlap among cataloging, inventory, indexing practices, vocabulary tools, and access systems across cultural heritage institutions. The diversity of approaches across several disciplines poses challenges to the cohesive development of the categories of administrative, structural, and especially descriptive metadata for the combination of digitized visual, multimedia, and textual resources in the library, museum, and archival communities. A number of different metadata schema are in common use (e.g., EAD, TEI, the schema developed by the Consortium of the Interchange of Museum Information [CIMI], the schema developed by the Federal Geographic Data Committee [FGDC], Categories for the Description of Works of Arts [CDWA], Dublin Core [DC] and Dublin Core Qualified).

Need for Further Research:

With regard to these two categories of scholarly information resources (manuscript archives and digitized cultural heritage information) there has been considerable research into community-specific (i.e., metadata schema-specific) techniques and standards for description. The Distributed Finding Aids Server Project (DFAS), conducted jointly by Harvard, Oxford, Michigan, Columbia, and Indiana, demonstrated that it is possible to build a distributed search mechanism for searching folder and item level descriptions in archival finding aids stored in EAD format. However, the DFAS project uncovered significant limitations to the particular distributed approach used. DFAS required the purchase and use of specific indexing software by each participating institution. It also exposed only metadata available in native EAD (encoded in XML or SGML) assuming that all collections of interest will be described using this standard alone. In reality, many institutions provide only static HTML versions of "EAD" finding aids. As described above, many primary sources are described in non-EAD HTML, in non-EAD metadata schemas, or through dynamic HTML generated from descriptive information stored in a collection-specific database schema. Nevertheless, DFAS's development and use of "Common Access Points" (CAP) to support mapping of item-level descriptive information from EAD to a Z39.50-based description schema provides a promising model for further experimentation with other metadata harvesting protocols. Within the museum and cultural heritage community also, considerable research has been done examining the diversity of descriptive work that currently exists due to the uniqueness of collections and approaches to the cataloging, organization, description, and presentation of museum collections. Earlier work by Blackaby and Sandore has suggested potential in the approach of collecting or harvesting metadata from the diverse legacy systems that operate in museums, libraries, and archives. Bell, in writing about the descriptive work of art curators, comments that while technology has made it feasible to exchange information about art collections, not every curator chooses the same terminology to describe works of art. This makes for challenges to a scholarly metadata harvesting project that require creative solutions. In both domains a number of specialized controlled vocabularies and metadata formats have been developed to assist with intellectual organization, standardization, and limited interoperability.

This body of research paves the way for greater interoperability. The further introduction of the Open Archives Initiative (OAI) metadata harvesting protocol provides mechanisms for the sharing of item-level metadata across multiple metadata schemas in an automated manner. Used in concert, these conventions and protocols have the potential to enable creation of tools and utilities that better address the need for greater semantic interoperability and make possible easier, more comprehensive discovery of information resources, even across different metadata schema. The question is how best to implement and exploit OAI for these categories of information resources. In order to exploit and prove the extent of the potential OAI offers, we see a requirement to address the following specific needs.

First, we need to integrate and further test OAI-based technologies and interoperability approaches in real-world context in order to validate their effectiveness and efficiency and establish their readiness for widespread adoption. In technology terms, there is a need to establish the viability, scalability, and utility of the OAI metadata harvesting protocol in the context of making available rich, important scholarly information resources such as those found in manuscript archives and collections of digitized cultural heritage information. The issues here are largely mechanistic, e.g., the harvesting function itself, automating the selection of records and the formats in which to harvest those records, scheduling full and incremental (refresh) harvests, utilizing the records management features of the OAI protocol, etc. There is a need to investigate and explore various harvesting service configurations and implementation approaches in order to learn how the protocol can be used to maximum advantage in this context.

Second, there is a need to demonstrate that state-of-the-art bibliographic control techniques, such as EAD and specialized controlled vocabularies, can be used effectively in concert with OAI metadata harvesting protocols. We need to demonstrate either that providers can map from EAD, TEI, FGDC, CDWA and other specialized formats into Dublin Core in a sufficiently consistent fashion to enable effective harvesting and cross-schema searching. Or, we need to demonstrate that an OAI harvesting service can harvest from multiple repositories in "native" repository metadata schemas and then map effectively and fairly to a common search index schema. This work will build on and eventually augment current guidelines and best practices for doing such mappings (e.g., the DFAS CAP approach). We also need to exploit and integrate into OAI metadata harvesting services created the best and most suitable of available interface and search system technologies. Correct implementation of basic search system features such as automatic truncation, proximity and Boolean searching, noun-phrase free-text searching, result ranking, and customization of interface appearance and functionality for device / browser being used is important and non-trivial. Investigation of more advanced techniques, e.g., automated indexing, classification, authority control, term suggestion, semantic mapping, query analysis and categorization, and term co-occurrence analysis also is needed to better understand the potential power of such techniques in an OAI context. There is the potential that implementation of advanced indexing and search interface features can mitigate interoperability problems associated with cross-schema harvesting. Retrieval and display technologies must mirror the sophistication of the information being sought. Current Web search systems lack clarity and context when displaying hits. As digital information and relations between digital information objects become more complex, providing appropriate context and clearly showing relationships when displaying retrieved results becomes essential.

Finally, we need to address, technical issues associated with sustainability and provide constructive feedback on the whole process to the community of librarians and metadata providers. Demonstrating effective management of metadata harvested is crucial to demonstrating sustainability. Metadata staleness must be minimized for a harvesting approach to be effective. There's a need to address de-duplication of records and to recognize related records describing different manifestations of the same intellectual object. Resources and expenditures involved in setting up and maintaining harvesting and search services of the kinds described need to be documented and widely reported. Feedback on value of the services implemented will be most credible if based on real-life experiences with implementations designed to test and exploit semantic interoperability on a large scale across a number of diverse institutions. Information and evaluative data regarding the utility and usability of harvesting and search services developed can be acquired in several ways. Formal group and informal one-on-one usability testing of services developed with faculty and graduate students who are the most common users for primary sources contained in archives and manuscript libraries can provide one perspective. Similar solicitation of response from faculty, students, and off-campus users of digitized cultural heritage information resources also should be done. Detailed analysis of search service transaction logs should be conducted. Lastly, we see the need for systematic consultation with librarians and archivists from peer institutions (e.g., through a project steering committee). The evaluative feedback developed can then inform future developments and implementations of OAI in particular and metadata harvesting and interoperability in general.

back to top


Work Plan--Activities & Goals:

Overview

The Library of the University of Illinois at Urbana-Champaign (UIUC) proposes to create and implement a suite of OAI-based metadata harvesting services, search services, and tools designed to facilitate discovery and retrieval of certain classes of scholarly information (i.e., metadata describing manuscript archives and digitized cultural heritage information). This work will be complementary with and carried on in concert with a related project being proposed by the University of Michigan. An overview describing relationship between two projects is given in Attachment I. The two institutions will utilize common metadata harvesting technologies (developed at Illinois) and common indexing metadata brokering tools (developed at Michigan). Both will work together to encourage and facilitate participation (through the revealing of metadata) by peer institutions (notably, other institutions within the CIC). But while the Michigan search service will be global in scope, providing information about all kinds of publicly available digital library objects revealed by academic and scholarly institutions, the Illinois project will focus on creating a deeper, domain-specific portal designed to search metadata describing selected manuscript archives and digitized cultural heritage information resources. The Michigan approach is akin to the creation of an "academic Lycos" while the Illinois approach is more analogous to the creation of a "vertical community" portal.

Both approaches exercise the technologies involved, both approaches lead to a better understanding of benefits and pitfalls of developing cross-repository search and retrieval systems, and both approaches help make visible portions of the currently "hidden" Web of scholarly information resources. Different audiences are served by the two approaches, and different needs met. The Illinois project will serve scholars looking to use the power of latest technologies to discover existence of specialized in-depth information resources, even if significant effort is then involved in obtaining access to the primary source materials. Our service will enable researchers to find information objects and relationships between information objects that currently are difficult if not impossible to find in any systematic way.

The Illinois project will have the following overarching objectives

Work will commence July 1, 2001 and be complete by December 31, 2002. The University of Illinois Library will provide facilities and workspace for project staff in the Grainger Engineering Library Information Center.

Selection of Metadata to Harvest

We will focus on enhancing access to scholarly information from archives and other special collection units that traditionally has been difficult for end-users to find - e.g., boxes and folders in manuscript archives and images, maps, and multimedia content related to cultural heritage topics. While the harvesting middleware application we develop will be general purpose and capable of harvesting metadata broadly across all disciplines, we propose to focus our harvesting and portal development on specific domains. In particular we have two related metadata-based projects currently ongoing at the University of Illinois. The first, funded under a National Leadership Grant awarded us by the Institute of Museum and Library Services, is entitled Teaching with Digital Content: Describing, Finding and Using Digital Cultural Heritage Materials. This project focuses on the development of innovative approaches to presenting and teaching with digitized primary source materials. Content for this project is drawn not only from University of Illinois collections but also from 10 other libraries and museums (mostly from within the state of Illinois). Content for this project has historical significance, is in a variety of original formats, and is deemed particularly relevant to Illinois Learning Standards goals for Social Science. In a second, separate project we have undertaken to translate the major print finding aids describing contents held by our University Archives into EAD. This work is being funded from a mix of internal and external sources. The content is typical of that found in a large university archive. Representative of finding aids so far converted to EAD are the James B. Reston Papers (1935-95), the John Bardeen Papers (1910-1991), and the Third Armored Division Association Archives (1941-).

We propose to take advantage of our expertise in these domains and also take advantage of established local user communities interested in these subject areas by harvesting selectively materials similar or complementary in topic, coverage, and format. The exact scope and selectivity of what we harvest can't be determined for sure until we know precisely what materials will be made available by OAI providers, however there are numerous examples of appropriate materials that have been suggested by potential OAI provider institutions. For instance, we have had preliminary indications of interest from Harvard, Wisconsin, Indiana, Minnesota, and the University of Chicago regarding the possibility of revealing EAD finding aids to manuscript archives held by those institutions. We have identified cultural heritage information collections at Indiana (e.g., Hoagey Carmicheal Collection, Wright's Bibliography of American Fiction), Minnesota (Historical Maps, Architects and Architecture of Minneapolis, History of Computing: Burroughs Corporation Photographs), and Harvard (VIA--visual arts, architecture, material culture, and history) that would be of interest. We anticipate availability of similar cultural heritage content (i.e., with a focus on regional history and/or Americana) being made available by other institutions.

Defining the scope of content harvested and portal developed in this manner will have 2 advantages. First, the content harvested, though focused in topic, will be diverse enough in format and metadata schema to allow us to attack a range of specific problems associated with metadata harvesting. We will give the protocol a good workout. For example, EAD documents may include nesting of important metadata about the individual folder of material in which a researcher is interested. How should such information be harvested and represented to the user? For metadata natively presented as dynamic HTML content (i.e. extracted real-time from a database), how does the manner by which specific pieces of metadata are extracted from the background database and presented for harvesting impact on utility of harvested metadata? Secondly limiting scope of portal developed will facilitate the in-depth investigation of interface and best practice issues specific to these domains as described above. To what extent can interface and indexing techniques mitigate difficulties caused by variant and inconsistent provider procedures and conventions?

We will therefore harvest selectively as to topic and source. Metadata describing content not in digital format will be considered for harvest. Metadata describing content having access restrictions will be considered for harvest. Metadata which itself has access restrictions may be considered for harvesting. Limited facilities for authentication and access control (e.g., IP address checking, support for proxy services, and Basic HTTP/HTTPS user authentication) will be implemented at provider collection (set) level as required by specific providers. The impact of such restrictions on the sustainability will be evaluated in the last phase of this project.

Task Details -- Phase 1 Construction of Baseline Harvesting Service (July 1 - December 31, 2001):

Our initial task will be to create a robust and optimized application that efficiently harvests metadata. In particular this discrete and standalone middleware application will provide harvested metadata for indexing by Michigan's XPAT-based search system. The Illinois harvesting application will run on Microsoft Windows NT and Windows 2000 platforms. Object metadata will be stored in discrete XML files. Information necessary for harvesting and indexing related management and control of metadata objects will be stored in a configurable ODBC-accessible database resource (e.g., a Microsoft Access database or an Oracle database). The harvesting application will allow for both (or either) manual and automated identification of metadata provider sites (automated presumes availability of one or more machine-readable registries of OAI-compliant metadata providers). The harvesting application will be per site configurable as regards automatic harvest refresh frequency. It will be able to perform complete site harvesting, partial site harvesting (i.e., by designated sets), and incremental site / partial site harvesting (i.e., by date range). Developmental and production copies of the Illinois metadata harvesting application will be made available to the University of Michigan throughout the project period.

During this phase of activity, Illinois will purchase and implement Michigan's XPAT-based search system (derived from version 5 of the Open Text search engine), including extensions to be provided by Michigan to enable both the indexing and exposure of metadata using OAI protocols. (The latter feature will allow the Illinois harvesting service also to be used, incidentally, as an OAI metadata brokerage service. As schedule and resources permit, implementation issues related to brokering function will be pursued as part of overall performance, scaling, and sustainability investigations.) Illinois will construct a simple, baseline search service to enable end-user searching of metadata indexed using our XPAT implementation. This baseline search implementation will be a starting point for enhancements developed and investigations done as part of phase 2 of this project.

During this phase also, generic metadata provider tools already developed by Illinois will be packaged and made available to peer institutions interested in setting up OAI metadata provider services. Illinois will host a workshop in August of 2001 for interested staff from potential metadata provider sites in the region, and especially will work closely with other CIC institutions interested in revealing existing metadata using OAI protocols (including possible travel to one or more nearby institutions). Tools developed and made available for this purpose will rely on and utilize community standards (e.g., XML, XSLT, SQL). This work is ancillary to primary project tasks, but will be useful in encouraging early the breadth of participation and diversity of metadata desirable to support project's primary research aims and objectives.

A number of basic and applied research questions will be investigated during this phase of the project. Among these are:

Outcomes from this phase of work will include: the harvesting middleware application (including source code); a working, baseline search service implementation; preliminary baseline performance measures; a preliminary report on harvesting refresh frequency issues; and a summary report addressing rest of the research issues described above.

back to top

Phase 2: Portal Creation and Development (September 1, 2001 - December 31, 2002)

The OAI metadata harvesting protocol is only of value if harvested metadata can be effectively searched and retrieved across heterogeneous repositories and heterogeneous metadata schema (whether mapping occurs on the provider side or the harvesting service side). During phase 2 of this project we will identify and test potentially desirable functions and value-added features that future OAI harvesting services might implement. Semantic interoperability requires meaningful integration of harvested information so that a cross-institutional collection of metadata can be effectively searched through a single, appropriate access point (i.e., client implementation). Once critical mass of relevant metadata has been harvested, we will develop one or more domain-specific portal-style search interfaces for search and retrieval of harvested metadata. Through this portal, end-users will search the metadata we have selectively harvested. A variety of basic and advanced search interface techniques and approaches will be explored to ascertain suitability of such techniques in context of OAI-compliant metadata harvesting. In creating this portal we will build especially on the large-scale XML testbed and "federated search" research previously accomplished here at Illinois under auspices of DLI-I and DLib Test Suite grant projects (1994 - 2001).

During this phase we will investigate in particular the use of specialized controlled vocabularies to facilitate searching (e.g., LC Thesaurus for Graphic Materials [TGM], Nomenclature for Museum Cataloguing, Art & Architecture Thesaurus, Getty Thesaurus of Graphic Names [TGN], USGS Geographic Names Information System [GNIS], etc.). Such controlled vocabularies have the potential to facilitate resource discovery in multiple ways. Used at the search interface level they can provide search term suggestions. Analyzed and used in conjunction with existing noun-phrase extraction and vocabulary co-occurrence techniques (such as developed during the Illinois DLI-I project ) controlled vocabularies and taxonomies can be used to enrich and add value to metadata records at the point of indexing. Term co-occurrence analyses can suggest automated assignment of controlled vocabulary terms or application of taxonomies to metadata records not explicitly cataloged or classified by hand. Where 2 or more controlled vocabularies have been used to describe overlapping collections of objects, co-occurrence generated mappings between controlled vocabularies can be further augmented by direct mappings, allowing for further enrichment of metadata records. These approaches can help enhance the effective interoperability of otherwise dissimilar information repositories.

During this phase of work we also will investigate issues specific to (in the context of OAI) the use of metadata describing manuscript archives and digitized collections of cultural heritage information. For example, while EAD is designed to improve semantic interoperability when searching across multiple scholarly archives, there remain a number of issues with how it is used. Guides and resources, such as the "EAD Cookbook" (based on community practice and the recommendations of the EAD Application Guidelines ) and the DFAS CAP mappings, help to facilitate interoperability, but there continue to be differences in local implementations. Authority control and encoding format for common elements such as personal names varies. Tag usage and use of optional syntax varies. We do not intend to investigate in depth or resolve these issues, but we will investigate the magnitude of the impact of these local differences when trying to federate not only EAD metadata but also metadata from other schemas such as TEI, FGDC, and CDWA. We will investigate the impact of these variations on the usability of metadata harvested and identify technologies and techniques that help normalize OAI harvested content derived from EAD finding aids and enhance interoperability. The intent is not to reinvent conventions for use of metadata schemas natively, but rather to parameterize the effect of local implementation differences and describe how decisions made with regard to mapping from native schemas to DC impact effectiveness of aggregation for search and retrieval of these information resources.

There is also a large body of research regarding provision of intellectual and inherent feature (color, shape, texture, and composition) access to graphical image objects on which we can draw. Studies by Armitage and Enser, Keister, and Collins have analyzed user queries of picture collections (not image databases). These studies suggest that a general framework can be applied to the various query types that users generate. They also report that user queries include a preponderance of subject terms, both specific and non-specific. This phenomenon is not unlike that which the library community discovered with the advent of online catalog searching. A recent study of image queries across a number of picture libraries by Armitage and Enser further validates a query matrix constructed earlier by Shatford, based on work by Panofsky. Shatford proposed that picture queries could be categorized into a matrix of three types of information - iconography (specifics), pre-iconography (generics) and iconology (abstract concepts) - combined with the query features of who, what, where, and when. Armitage and Enser's study evaluated queries from seven picture libraries of diverse subject matter such as a film and television archive, a local history collection, a collection of art images, and an archive of aerial photographs. Collins, in a study that analyzed user queries of two historical picture collections, reported that 86% of all queries submitted by users constituted subject terms. Further, Collins notes that of these subject searches, over half represented various categories of "generic" terms (i.e., fire brigades, families, downtown, football, racism), and just under half comprised specific terms (proper names, corporate, and geographic names). Collins' work with historical picture collections suggests that for this particular subject domain, the "aboutness" of the picture proved far more important in user queries than other commonly-indexed attributes such as the picture creator, genre, or support materials. The implications of this research will be considered when developing search interface to our hybrid collection of harvested metadata describing both textual and non-textual information objects. Again, the intent is not to reinvent or directly further research already done into how end-users search homogeneous collections and classes of information objects. Rather our intent is to further understanding of how such community-specific, format-specific, and metadata schema-specific research can be exploited when creating a portal designed to facilitate search of aggregated, heterogeneous collections of information objects.

Finally, during this phase of the project we also will investigate and test techniques to enhance presentation of results. We will investigate ways of providing meaningful context when displaying retrieved records (e.g., showing relation of retrieved item to larger collection, inclusion of links to items related by reference, etc.). Links to full content will be implemented where such links were provided as part of metadata harvested and where conditions of use allow. We will investigate inclusion of emerging standards that support persistent object identifiers and local reference resolution. We also will investigate methods to facilitate end-user navigation of search results. As time and resources permit, we will investigate methods to dynamically update link relationships with links to annotations and the automatic conversion of one-way links into bi-directional links.

A number of research questions will be investigated during this phase of work. Among these are:

There is much prior work on all of these issues, but it needs to be extended and examined in context of OAI metadata harvesting and search. The portal described will be developed iteratively over the course of this project. The primary output of this phase of work will be a usable portal for cross-institution search and retrieval of metadata describing scholarly manuscript archives and digitized collections of cultural heritage information. Summary reports describing the research investigations listed above and strategies for transfer of technologies developed also will be produced.

Phase 3 Sustainability and Feedback (July 1, 2002 through December 31, 2002):

In addition to scalability issues (which we anticipate will be investigated more thoroughly by the Michigan project), the long-term usefulness of OAI metadata harvesting protocols will depend on sustainability of the systems developed. Technical sustainability will depend largely on ability to efficiently update and maintain metadata harvested and the development and widespread adoption of appropriate community standards and consensus. During the final phase of work we will investigate these issues and, in context of revealing holdings of scholarly manuscript archives and selective cultural heritage information resources, identify areas where community consensus or best practice recommendations are most needed. While in-depth economic sustainability studies are beyond the scope of this proposal, we will document resource costs associated with creating and maintaining a harvesting service and allied portal, and we will describe kinds and magnitude of benefits possible using OAI.

In particular, we will define and implement sustainable refresh procedures and processes to de-duplicate harvested records using standard identifiers (e.g., DOIs, OCLC numbers, OAI identifiers). In developing de-duplication algorithms we will investigate the potential to automatically ascertain which version of a record to keep (i.e., the more rich) and which record to discard. We also will investigate more sophisticated techniques for differentiating between exact duplication and multiple metadata records pointing to different manifestations of the same intellectual object. We also will report on amount of effort required to sustain harvesting service and associated portal once created.

During this phase of work we also will study the impact of system performance (e.g., harvest refresh rate, search response time), interface feature set, and data quality (e.g., tagging and encoding consistency) on end-user success and satisfaction measures. Anonymous transaction log data will be analyzed and used to help parameterize important influences. Relevant literature on past library and archival usability studies will be consulted. Throughout the project, but particularly during this phase of activity we will consult with groups of end-users. As appropriate online user surveys and informal one-on-one (and possibly more formal focus group) usability testing will be undertaken involving University of Illinois humanities faculty and graduate students. Together these studies will help identify where community agreement on practice is especially important. Transaction log analyses also will be performed to better understand how end users make use of discipline-specific portals and cross-institutional search systems.

A number of research issues will be investigated during this phase of work. Among these are:

These results will be reported in the literature and directly to metadata providers along with summaries highlighting the areas of significant (in the context of OAI) disagreement in local encoding practices noted (from phase 2). A second workshop for metadata providers will be held in August of 2002. The feedback provided at that point in time will encourage dialog aimed at developing and propagating recommended best practices for future OAI implementations.

back to top

Development Platforms:

The bulk of development and implementation work on this project will be done on the Microsoft Windows NT/2000 platform. LINUX platform will be used for Michigan's XPAT database management system component. Harvester middleware application will be developed in Microsoft Visual Basic and/or Microsoft Visual C++ and will utilize components unique to those languages and to the Windows platform for connectivity to the network and to database systems other than XPAT. Harvested metadata objects will be stored in XML on Windows file system and made accessible to the XPAT application running under LINUX via NFS protocol and/or FTP. Primary metadata indices will be built using XPAT. Metadata object management information (e.g., information needed for management of incremental harvesting, sets, and de-duplication and tracking of sites harvested) and data to be analyzed for creation of value-added content will be stored in relational database (e.g., Microsoft SQL Server). The Illinois end-user search portal will be hosted on a Windows 2000 platform running Microsoft Internet Information Server version 5. Portal components, search interface features, and metadata enrichment tools will be written using Visual Basic, Visual C++, and/or VB Script. The University of Illinois Library will supply servers used in this project. Two PC workstations and additional server hard drive space will be purchased for this project using grant funds.

Staffing & Administration:

This project will be administered by the University of Illinois Library. Project Principle Investigator (PI) will be Timothy W. Cole, Mathematics Librarian, Associate Professor of Library Administration and Adjunct Associate Professor of Library and Information Science. Co-PIs will be: Christopher Prom, Assistant University Archivist and Assistant Professor of Library Administration; William H. Mischo, Engineering Librarian, Professor of Library Administration and Adjunct Professor of Library and Information Science; Beth Sandore, Head, Digital Imaging and Media Technology Initiative and Professor of Library Administration; and Thomas G. Habing, Research Programmer, Grainger Engineering Library Information Center. (see Staff/Contacts page)

back to top

  University of Illinois at Urbana-Champaign University of Illinois at Urbana-Champaign
Library Gateway Homepage
Comments to: Tom Habing
Updated on: 9/03/02 Sshreeve