Home

Project Info

The University of Illinois Open Archives Initiative Metadata Harvesting Project

Illinois OAI Protocol Metadata Harvesting Project

Status Report Covering Quarters 3 and 4 of the Project( 30 June 2002)

January

February

March

April

May

June

Harvest Activities to Date

Harvested record count increased from 500,000 in December 2001 to 980,000 as of June 2002. Of this, approximately 8,878 were EAD records. These records were expanded to generate an additional 2.5M discrete items. Total searchable items currently in the repository are over 3,588,000 records. A monthly breakdown of harvesting activities follows. (* Denotes institutions that are not yet registered OAI Data Providers)

January:
Added institutions:
American Museum of Natural History*
Harvard University Libraries*
Northwestern University*
OpenVideo

March:
Added institutions:
Illinois Alive! *
Indiana University (previously surrogate provider)

April:

Added institutions:

Ohio Historical Society*
National Library of Australia*
University of Chicago*
University of Wisconsin, Madison
Ibiblio

May:
Added institutions:

U niversity of Minnesota Images Database
American Numismatic Society

June:
Added institutions:

A IM25 – Archives in London
Alex Catalog of Electronic Text
Rumsey Collection
Ackerman Archives

Collections added prior to January 2002 include:

     

Continued Solicitation / Acquisition of Metadata

After the OAI Metadata Provider workshop in September 2001, the Illinois OAI Harvesting Steering Committee suggested a letter to be sent jointly from the Illinois and Michigan Library Directors to all CIC library directors encouraging institutional participation in the Illinois and Michigan projects. As of June 2002 the participating CIC institutions are Illinois, Michigan, Wisconsin, Indiana, Iowa, Michigan State, Minnesota, Northwestern, and the University of Chicago. Of these only four are registered OAI data providers: Illinois, Michigan, Indiana, and Minnesota. However, we expect sites for which we are providing surrogate data provider services to become registered data providers as OAI PMH continues to gain wider acceptance as a viable tool for enabling cross-collection resource discovery. We are continuing to solicit metadata from cultural institutions and regularly add registered OAI data providers as they appear on the Open Archives Initiative site (http://www.openarchives.org/Register/BrowseSites.pl).

Other Project Activities

Dissemination of Project Information

Beginning in March 2002 and continuing through June numerous presentations outlining the Illinois OAI PMH project and its preliminary findings have been made. These have included a live videoconference for OCLC highlighting OAI, several presentations to the University of Illinois Graduate School of Information Science researchers and students, presentations at the Spring CNI Task Force meeting, the Museums and the Web Conference, the DLF Spring Forum, and the E-Text discussion group at ALA. In addition, the project was highlighted in the "In Brief" section of D-Lib Magazine in April. The paper presented at the Museums and the Web Conference was published in the print proceedings (see citation at the end of the report). Announcements have been submitted to various listservs, and the project’s own listserv was set up to disseminate project updates to interested parties. Current dissemination efforts include a paper to be presented at the JCDL Conference in July, a fall workshop in conjunction with the CIC DLIOC meeting to develop a strategic plan for investigating ways to incorporate OAI into efforts to improve e-resource discovery across CIC institutions. Also under consideration is an information session for potential data providers at the DLF Fall Forum. A special issue of Library Hi-Tech on OAI will be published early 2003 and will be edited by the Illinois OAI PI. Illinois, in conjunction with the Michigan and Emory University OAI projects, has also submitted an application for a panel discussion on OAI at the Spring 2003 ACRL Conference.

Working with the Metadata

Building a cross collection repository has involved significant efforts in working with the metadata provided to the project. The three areas which have required a substantial amount of time are the conversion of various metadata schemas into unqualified Dublin Core as part of our surrogate services, normalization of specific elements in the metadata in order to collocate like items, and working with the EAD metadata to provide more access to content embedded within one EAD finding aid while simultaneously displaying an EAD record alongside other one-level records. The efforts in each area are briefly outlined below.

Surrogate Services

Providing surrogate Metadata Provider services to our contributing institutions has involved significant efforts to convert metadata into appropriate unqualified Dublin Core elements. This has allowed us to investigate and evaluate common metadata crosswalk practices. Mapping standard and local metadata schemas to Dublin Core required that we work closely with the data provider and develop a good understanding of how various elements were used within the schema. We developed a stylesheet to map the Harvard VIA records into Dublin Core. We also worked with the database administrator of the Spurlock Museum to establish a crosswalk between Spurlock’s local metadata format and Dublin Core. Through the application of the Library of Congress MARC to Dublin Core crosswalk we converted 900,000 MARC records from the Illinois State Library to Dublin Core within just several day’s effort by one programmer. We then filtered these records according to their Dewey Decimal Classification numbers to extract just over 300,000 records that represent materials broadly related to cultural heritage. Normalization In order to aid in collocation of like materials and increase the discoverability of resources, the project team has spent time examining elements that would be suitable for applying a standard controlled vocabulary. The analysis focused on five elements: type, coverage, date, format, and subject. The normalization of these was felt to have the potential to add the most value for users of the repository. Upon further examination, the project discarded the format element because it was inconsistently used and seemed less useful than the type element for discovery and collocation. The project closely examined use of the type element and developed a type vocabulary containing both general and slightly narrower terms. The date element and the temporal aspect of the coverage element were explored next. The normalization here involved grouping resources into discrete time periods based on the information in the metadata. In both these cases the decision was made to add the additional vocabulary and not to replace or delete any existing metadata. Our first approach to this process involved applying some of these normalization scripts directly during the harvesting process and others after harvesting but before indexing. We later decided to do all normalization after harvesting as a separate process. This allows for better control of the processes and keeps a "clean" copy of harvested records in our harvester database and file system. Other potential candidates for normalization are the spatial aspects of the coverage field, the subject field, and possibly some content commonly found in the description field. Normalizing or coordinating controlled vocabulary found in the subject and description fields is obviously an extremely complex proposition. We hope that the data mining research conducted in conjunction with NCSA will aid us in finding novel ways to cluster like resources together and leverage limited use of controlled vocabulary by metadata providers. (For our collection of metadata only 20-25% of metadata providers utilize recognized controlled vocabulary in the subject element.)

Processing EAD Metadata

EAD records provided have allowed significant opportunity to test the value of exposing the details of strongly hierarchical records alongside simple one-level item records. The EAD records are broken into the <co> levels. The application of a stylesheet allows for the display of the discrete components of a finding aid to be easily searched alongside other one-level item records and in the context of the whole finding aid to which it belongs. The stylesheet provides a separate pop-up window in which the native EAD structure is graphically displayed. A paper discussing this work in more detail will be presented at the JCDL conference in Oregon in July 2002.

Experimental Work with MARC and Z39.50

  With an interest to extend the use of the OAI-PMH to easily include MARC records we have developed a tool, ZMARCO, that allows MARC records made available through a Z39.50 server to be included in our repository. ZMARCO was released on SourceForge in June 2002. Because Z39.50 and MARC are ubiquitous in the library community, an OAI tool to harvest MARC records directly from Z39.50 servers can provide a convenient means of including huge sets of publicly available MARC records to an OAI repository. ZMARCO is an attempt at providing this service. The Illinois State Library has provided the Illinois project access to nearly 45 million MARC records to use for testing ZMARCO as well as testing the scalability of the XPAT indexing tools.  

Progress & Additional Details Regarding Specific Tasks

Task One – Construction of Baseline Harvesting Service – (Sept. 1, 2001 – Dec. 31, 2001)

(See first six-month report for details: http://oai.grainger.uiuc.edu/Papers/ProjectReportFinal.htm)

A white paper describing the functionality of the harvester was put online in January 2002. The Harvester was ported to Java and delivered to the Michigan OAI team in February. Enhancements and bug fixes to both the VisualBasic (VB) and Java Harvester tools have been ongoing through June 2002. Many suggested enhancements are related to the end-user interface for the Harvester and will continue to be included as time allows.

In addition to providing the Java Harvester tools to Michigan, the Illinois team agreed to develop OAI 2.0 versions of both Provider and Harvester tools and make them available as Open Source via the SourceForge website (see http://sourceforge.net/projects/uilib-oai/). Both Java and VB (Windows only) versions of both harvester and provider tools are available on the site.

Task Two – Portal Creation and Development (Sept. 1 2001 – Dec. 31, 2002) (70% complete):

Our current search portal can be viewed at: http://oai.grainger.uiuc.edu/oai/search/. We conducted a series of usability tests with specific search instructions given to the testers. Testers were instructed to verbalize their thoughts and decision-making processes throughout the test to the observer who recorded these thoughts and processes. After completion of the tasks outlined by the test, testers were asked to express general comments about the interface, the collections represented in the repository, or to ask any questions they had about the system. In general the reaction to the interface and the repository were positive although there seemed to be some confusion about the different collections and whether or not online access to resources described was available in all cases. The project team attempted to address this confusion and other issues through changes to the interface. In addition, the interface was examined by students in the GSLIS class Interfaces to Information Systems. Again the project team analyzed the recommended changes and adjusted the interface as appropriate.

An annotation box was added to the interface in May 2002. This box is designed to allow for user input to enhance the records. When a user makes an annotation to a record, the annotation is recorded in a separate database. The project team reviews the annotation and will add it to the record as appropriate. The project team foresees this as a useful tool for adding value to records in the repository. Specifically teachers K-12 may use this feature to add information about specific records they believe can be used to develop lesson plans that meet state-specified learning criteria for particular aspects their curricula. The usefulness and practicality of the annotation box concept will be explored further in the fall in conjunction with a K-12 education class at Illinois.

The final phase of the project (July 1, 2002 – Dec. 31, 2002) will provide an opportunity to explore further the usefulness and practicality of end-user search interface features such as the annotation box. We hope to incorporate students from an education class at Illinois in the design of an interface specifically focusing on the needs of teachers K-12. This community of users has demonstrated an interest in using a cross-collection cultural heritage repository to meet their curriculum development needs through another project at Illinois, Teaching with Digital Content. (http://tdc.uiuc.edu) We also anticipate a greater volume of site use allowing meaningful analyses of anonymous transaction logs.

Updated on: 7/23/02 Sshreeve