|
|
The
University of Illinois Open Archives Initiative Metadata Harvesting Project
|
Illinois OAI Protocol Metadata Harvesting Project
Status Report Covering Quarters
1 and 2 of the Project
(17 January 2002)
Summary of Accomplishments & Events 1 July 2001 through 31 December
2001:
July
- Project Start
- Project Research Programmer hired (YuPing Tseng)
- Work begins on OAI Harvesting tools
- Work begins on updating of OAI Provider tools developed during Alpha Test
August
- Project Website established
(http://oai.grainger.uiuc.edu/
- Meeting in Ann Arbor with Michigan Project Team
- Test harvesting of Illinois
& selected OAI-Registered Provider sites
- Project Research Assistant hired (Sarah Shreeves 50%)
September
October
- Letter to CIC Library Directors from Paula Kaufman & Bill Gosling
- Began setting up surrogate OAI Provider Sites (see narrative)
- Began acquiring representative EAD finding aids from sites nationwide
- Created XSLT stylesheet (simplistic version) to transform EAD to DC
- Began production harvesting of OAI-registered & surrogate Providers
- Test Harvest of relevant sites registered with http://www.openarchives.org.
November
December
Harvest Activities to Date:
Preliminary investigation into appropriate OAI Providers to harvest for this project began in September. We first selected Data Provider sites that appear
to provide materials significant to cultural heritage from the list of registered
OAI Data Providers offered through the Open Archives Initiative website (http://www.openarchives.org).
(We continue to monitor http://www.openarchives.org
for new sites providing content relevant to this project.) Only sites that
contain at least some cultural heritage records are being harvested.
In addition to harvesting OAI-registered Data Providers, we also are harvesting
surrogate OAI Data Provider sites containing snapshots of metadata provided
to us by a means other than OAI. These surrogate sites are maintained on Illinois
servers. The institutions providing this metadata are not yet OAI-registered
Data Providers, but have been very cooperative with this research project and
have expressed their intention to make their data available directly via OAI
at a later date. Illinois will maintain surrogate sites until such time as owning
institutions are ready to make their records directly available for harvesting
as OAI-registered Data Providers.
To date, Illinois has harvested over one million unique records from the institutions
listed below. For scalability testing all sites are fully harvested. Complete
site harvests are done monthly, incremental harvests (harvesting only records
that have been added to a site or that have changed since last harvest) are
done on shorter intervals as appropriate by site. We estimate that about 500,000
records are relevant to the cultural heritage domain.
- University of Illinois Library
- Library of Congress
- Perseus Digital Library
- University of Michigan Library
- University of Tennessee Libraries
- Formations
- CIMI
- University of Pennsylvania
- American Philosophical Society
- Tacoma Public Library
- University of Illinois Spurlock Museum*
- Michigan State University*
- University of Texas*
- University of Michigan Bentley Collection*
- Minnesota Historical Society*
- University of Minnesota*
- Online Archive of California*
- Colorado Digitization Project*
- Illinois State Library (Alliance & Lincoln Trail Library Systems)*
- University of Iowa*
- Cornell*
(* Denotes institutions that are not yet registered OAI Data Providers
and for whom we are hosting snapshots of their metadata on Illinois servers.)
Harvested metadata spans a wide range. We have metadata representing collections
of cultural and natural history materials, early motion pictures, sheet music,
photographs, poetry, letters and manuscripts, finding aids, biographical and
bibliographical information, books, and scholarly papers related to cultural
heritage. Specific collection emphases:
- We have actively solicited EAD finding aids in native format from a number
of the institutions listed above. The best ways to map EAD into DC and provide
access via OAI Harvesting services in ways that maintain the integrity and
richness of the materials referenced by EAD finding aids is a major research
interest of this project.
- One of our largest collections of metadata has been provided by the Spurlock
Museum. The metadata from Spurlock represents a rich spectrum of artifacts
from the areas of natural history and cultural heritage. The Spurlock collection
provided us the opportunity to evaluate metadata of a truly cultural nature
and is a realistic example of the challenges faced by a small museum.
- Through assistance from the Illinois State Library, we have obtained access
to over 700,000 MARC records describing traditional, non-digitized library
collection materials. We are interested in testing strategies for pulling
relevant cultural heritage materials out of a large traditional online library
catalog to be included in a cross collection repository and appropriately
retrievable.
- By design we include content from several different kinds of smaller, non-academic,
non-governmental data providers, including historical societies, public libraries,
statewide digitization projects, and museums. While many of the collections
harvested from these institutions are small, they allow us to work with a
wider, more heterogeneous collection of metadata.
Progress & Additional Details Regarding Specific Tasks:
Task One (85% Complete):
Construction of Baseline Harvesting Service (July 1 – December 31, 2001)
Solicitation / acquisition of Metadata:
- Immediately upon acceptance of the grant, the PI and co-PI’s began planning
an OAI Data Provider workshop with a focus on CIC participation which was
then held in September 2001. Indiana University, Northwestern University,
Purdue University, Michigan State University, University of Iowa, University
of Wisconsin, University of Minnesota as well as Illinois and Michigan, University
of Texas, OhioLINK, and American Museum of Natural History were represented
at the workshop Representatives from OCLC and CIC were also present. There
were a total of 33 registered participants, and 11 presenters and project
team members. This workshop introduced Data Provider tools and strategies,
and included time for a moderated discussion for institutions expressing an
interest in participating in the OAI initiative. Immediately following the
workshop an Illinois OAI Harvesting Steering Committee meeting was held.
The Committee agreed to have a letter sent jointly from Illinois and Michigan
Head Librarians to the Head Librarians of CIC institutions encouraging their
institutional participation.
- The Data Provider tools previously developed by Illinois during OAI Protocol
for Metadata Harvesting Alpha Phase testing (Fall 2000) were refined, updated
and made publicly available for download and use by peer institutions. Three
variations of these tools were created. The first option is for an institution
storing its metadata in a database. The second option is for an institution
storing its metadata in discrete XML files. The third option is for an institution
that includes DC metadata within Web-published primary source materials (i.e.,
using HTML <meta> tagging). Data Provider tools were publicly
released in September 2001. We completed an enhancement update to the Data
Provider tools in December and made revised tools available from our website
at that time. These enhancements incorporate user feedback and are intended
to make the installation and use of the tools easier by eliminating the need
to manually configure certain parameters and by allowing XML metadata files
to be named according to user-specified naming conventions.
- Through several discussions with museum curators at Spurlock, the American
Museum of Natural History, Mystic Seaport, and the Minnesota Historical Society,
we believe museums may have the most difficulty establishing OAI Data Provider
services. Some of these difficulties arise from the human resources available
for the technical tasks involved in participating in OAI. Another limiting
factor is the lack of clearly defined best practices for how the metadata
about museum collections should be assigned and stored. There is also a subtle
yet very real limitation due to the focus of museum curators on protection
and local, in-person use of their collections (sometimes at the expense of
resource discovery and exploitation by remote users). Despite these limitations,
we are hopeful to be harvesting records from the American Museum of Natural
History and the Mystic Seaport Museum in the near future. We have procured
data from Spurlock Museum and Minnesota Historical Society but they have not
yet set up their own Data Provider services.
- The Colorado Digitization Project (CDP) team has shown great interest in
participating in OAI projects. Their project is an effort of Colorado’s archives,
historical societies, libraries, and museums to make state cultural heritage
materials available to the people of Colorado. We hope to see the CDP collections
ready for harvesting within the first quarter of 2002.
- The Illinois State Library has provided access to over 900,000 MARC records
from two of the state’s library systems. We are exploring ways to parse these
records according to their cultural heritage significance to see the value
of including union catalog records with special collections.
- We have EAD finding aid records from the Bentley Historical Society, University
of Texas, Online Archives of California, Michigan State, and Illinois. Our
current conversion template is only allowing for display of the top-level
administrative information about each finding aid. This does not seem to provide
enough information about materials referenced by the finding aids to justify
compliance with the OAI protocol. We are exploring strategies for mapping
the entire content of EAD files into DC. Although numerous tagging protocols
and best practices guidelines exist for EAD, we are hoping to develop a general
purpose mapping that can allow for more complete discovery of the richness
in EAD finding aids.
- We applied various preprocessing techniques to metadata received by surrogate
Data Providers to render it OAI compliant. These techniques included writing
scripts to check for and replace non-Unicode characters with their appropriate
Unicode counter- parts. We also built scripts to create XML files from each
metadata record and subsequently convert the raw XML into DC XML. This required
building various scripts to work on MARC records as well as raw EAD files.
Harvesting Service Technical Infrastructure
- The development of an OAI harvesting implementation began in September.
These tools include three separate modules: the Harvester, the Manager Service,
and the Harvest User Interface. The Harvester sends requests to Data Provider
services to get their appropriate metadata. Multiple instances of the Harvester
may be distributed and may run simultaneously. The Manager Service “manages”
scheduling and tracking of harvest jobs. The Harvest User Interface allows
harvesting configuration to be managed by the OAI Harvest Service administrators.
The Harvester User Interface is Web browser based. These tools were preliminarily
released for internal use in late October and delivered to Michigan November
5th with intentions of modifying the tools to suit Michigan’s
needs as such requests are outlined in future discussions. Enhancements were
made to the Harvester User Interface to provide tracking of scheduled harvest
jobs without requiring direct access to the database.
- We installed Michigan’s DLXS/XPat software on a Linux server. This product
is the indexing and search engine for our repository. For each collection
we harvest a tag, region, and data dictionary file is created and configured.
XPAT independently indexes each set of XML files for each collection. An Apache
web server was set up to enable the XPat search function. Modifications to
the Bibliographic class of DLXS were made to include our collections. Modifications
to the Perl Collection, Query, and Search modules of DLXS were made to accommodate
our collections. We also rewrote the CGI programs to handle OAI Metadata searching
and modified HTML templates to more appropriately display OAI records.
- In addition to the DLXS/XPAT approach, we are also investigating the use
of a standard relational database, such as Microsoft SQL Server, for indexing
and storage of the metadata. This approach would require additional
customized software for search and retrieval, but may offer some performance
or customization advantages. These will be explored as the project progresses.
- In September we applied the XPat search template to create a simple tool
for internal use and testing of the repository. The search interface has been
modified and will be made publicly available from our website in mid-January
2002.
- Stress tests for the harvester have been developed and initiated. These
tests comprise a series of harvesting strategies applied simultaneously and
over a variety of time intervals. Initial results indicate the harvester is
robust and capable of expediently running twenty simultaneous instances.
Harvest tools continue to be tested and enhanced with intentions of a
second release in February. Report on performance & refresh issues to
be completed and posted on our website. (Note: most of our testing has
involved harvesting metadata that we provide as a surrogate Data Provider
and we have not yet fully tested the harvester on externally hosted metadata.)
- Detailed research into using OAI to harvest non-DC metadata directly has
been deferred pending additional availability of such metadata. Currently,
MARC is the only alternative format generally available via OAI, due in part
to the requirement of XSD for each schema used in OAI. Ongoing related work
investigating advanced techniques for mapping from EAD to DC will continue
to inform this project.
Task 2 (50% complete):
Portal Creation and Development (September 1-December 31, 2001)
- Various meetings were held
with the PI’s to prioritize the types of metadata collected and to suggest
strategies and design ideas for developing the end user search interface.
We consulted with several faculty members in the School of Library and Information
Science at Illinois.
- After conducting a review of the literature, the project team decided to
proceed with the design of the user interface while drawing on the expertise
and theories of Ben Shneiderman, Jakob Nielsen, and John Carroll. Shneiderman’s
work focuses on building effective user interfaces specifically for information
retrieval systems. Nielsen looks specifically at the usability of web interfaces,
and Carroll describes how scenario-based design can be used to create a user-centered
interface. The project team found that these three complemented each other
nicely and would provide a framework of best practices in moving forward with
the design. Using scenario-based design strategies, three separate scenarios
representing different users and uses of the system were developed. A very
preliminary front screen was created for presentation at the D-Lib Forum in
November.
- Early interface design and testing strategies will include over-the-shoulder
usability testing, analysis of search logs, iterative design, and evaluative
comments solicited from faculty and graduate students.
- Interface elements include: basic and advanced search screens, field-specific,
keyword and phrase searching. We are exploring methods of using collection
and sub-domain search filters, stored expert searches, and the possible use
visual search interface techniques.
- By reviewing the values entered by Data Providers into the various DC element
tags it is apparent that normalization of the data will be important to enhance
discoverability. Another pre-search processing task will be to selectively
harvest metadata using other techniques than those directly supported by OAI.
For example, the MARC records supplied by Alliance and Lincoln Trails Library
Systems through the Illinois State Library are not organized by subject headings
or classification schemes. Our Harvester requests all records from these collections
but then saves only those records indicating the disciplines of interest (items
related to cultural heritage) as noted by the Dewey Decimal Classification
numbers assigned to each record.
- One issue related to discoverability is the method with which EAD is mapped
to DC. Our current method of mapping loses the richness contained within the
raw EAD files. We are exploring methods to including links between multiple
DC items generated from a single EAD finding aid.
- We have deferred work related to investigating de-dupping and creating linkages
between items gathered from different repositories in part because best practice
recommendations are still in discussion with the OAI Technical Committee.