|
|
The
University of Illinois Open Archives Initiative Metadata Harvesting Project
|
Illinois OAI Protocol Metadata Harvesting Project
Status Report Covering Quarters 3 and 4 of the Project(
30 June 2002)
January
February
- Provided feedback on design of Michigan online survey directed at end-users:
http://oaister.umdl.umich.edu/surveyreport.html
.
- Created stylesheet to convert Harvard VIA record format to DC.
- Began usability testing for the repository interface.
- Released advanced search interface for the repository.
- Ported VisualBasic OAI harvester to Java and delivered to Michigan.
March
April
May
- Presented project overview at DLF Spring Forum.
- Developed alpha version 2.0 of OAI Harvester tools for both VisualBasic
and Java.
- Provided initial data dump for data mining research to NCSA.
- Added annotation box to end-user interface.
June
- Presented at ALA Etext Discussion Group in Atlanta, GA: http://oai.grainger.uiuc.edu/ALA.ppt.
- Released OAI 2.0 of Metadata Provider and Harvester tools on SourceForge.
- Released Z39.50 OAI Metadata Provider tool.
- Developed and applied Date and Coverage normalization scripts as a post-harvesting
process.
- Developed "exhibits" search interface model.
- Begin scalability testing.
- Add item level records from the EAD files to the repository. As a result
the number of records in the repository is over three times what it previously
was.
Harvest Activities to Date
Harvested record count increased from 500,000 in December 2001 to 980,000 as
of June 2002. Of this, approximately 8,878 were EAD records. These records were
expanded to generate an additional 2.5M discrete items. Total searchable items
currently in the repository are over 3,588,000 records. A monthly breakdown
of harvesting activities follows. (* Denotes institutions that are not yet
registered OAI Data Providers)
January:
Added institutions:
American Museum of Natural History*
Harvard University Libraries*
Northwestern University*
OpenVideo
March:
Added institutions:
Illinois Alive! *
Indiana University (previously surrogate provider)
April:
Added institutions:
Ohio Historical Society*
National Library of Australia*
University of Chicago*
University of Wisconsin, Madison
Ibiblio
May:
Added institutions:
U niversity of Minnesota Images Database
American Numismatic Society
June:
Added institutions:
A IM25 – Archives in London
Alex Catalog of Electronic Text
Rumsey Collection
Ackerman Archives
Collections added prior to January 2002 include:
- University of Illinois Library
- Library of Congress
- Perseus Digital Library
- University of Michigan Library
- University of Tennessee Libraries
- Formations
- CIMI
- University of Pennsylvania
- American Philosophical Society
- University of Illinois Spurlock Museum*
- Illinois State Library*
- Michigan State University*
- University of Texas*
- University of Michigan Bentley Collection*
- Minnesota Historical Society*
- University of Minnesota (previously surrogate provider)
- Online Archive of California*
- Colorado Digitization Project*
- University of Iowa*
- Cornell*
Continued Solicitation / Acquisition of Metadata
After the OAI Metadata Provider workshop
in September 2001, the Illinois OAI Harvesting Steering Committee suggested
a letter to be sent jointly from the Illinois and Michigan Library Directors
to all CIC library directors encouraging institutional participation in the
Illinois and Michigan projects. As of June 2002 the participating CIC institutions
are Illinois, Michigan, Wisconsin, Indiana, Iowa, Michigan State, Minnesota,
Northwestern, and the University of Chicago. Of these only four are registered
OAI data providers: Illinois, Michigan, Indiana, and Minnesota. However, we
expect sites for which we are providing surrogate data provider services to
become registered data providers as OAI PMH continues to gain wider acceptance
as a viable tool for enabling cross-collection resource discovery. We are
continuing to solicit metadata from cultural institutions and regularly add
registered OAI data providers as they appear on the Open Archives Initiative
site (http://www.openarchives.org/Register/BrowseSites.pl).
Other Project Activities
Dissemination of Project Information
Beginning in March 2002 and continuing through
June numerous presentations outlining the Illinois OAI PMH project and its preliminary
findings have been made. These have included a live videoconference for OCLC
highlighting OAI, several presentations to the University of Illinois Graduate
School of Information Science researchers and students, presentations at the
Spring CNI Task Force meeting, the Museums and the Web Conference, the DLF Spring
Forum, and the E-Text discussion group at ALA. In addition, the project was
highlighted in the "In Brief" section of D-Lib Magazine in
April. The paper presented at the Museums and the Web Conference was published
in the print proceedings (see citation at the end of the report). Announcements
have been submitted to various listservs, and the project’s own listserv was
set up to disseminate project updates to interested parties. Current dissemination
efforts include a paper to be presented at the JCDL Conference in July, a fall
workshop in conjunction with the CIC DLIOC meeting to develop a strategic plan
for investigating ways to incorporate OAI into efforts to improve e-resource
discovery across CIC institutions. Also under consideration is an information
session for potential data providers at the DLF Fall Forum. A special issue
of Library Hi-Tech on OAI will be published early 2003 and will be edited by
the Illinois OAI PI. Illinois, in conjunction with the Michigan and Emory University
OAI projects, has also submitted an application for a panel discussion on OAI
at the Spring 2003 ACRL Conference.
Working with the Metadata
Building a cross collection repository has
involved significant efforts in working with the metadata provided to the project.
The three areas which have required a substantial amount of time are the conversion
of various metadata schemas into unqualified Dublin Core as part of our surrogate
services, normalization of specific elements in the metadata in order to collocate
like items, and working with the EAD metadata to provide more access to content
embedded within one EAD finding aid while simultaneously displaying an EAD record
alongside other one-level records. The efforts in each area are briefly outlined
below.
Surrogate Services
Providing surrogate Metadata Provider services
to our contributing institutions has involved significant efforts to convert
metadata into appropriate unqualified Dublin Core elements. This has allowed
us to investigate and evaluate common metadata crosswalk practices. Mapping
standard and local metadata schemas to Dublin Core required that we work closely
with the data provider and develop a good understanding of how various elements
were used within the schema. We developed a stylesheet to map the Harvard VIA
records into Dublin Core. We also worked with the database administrator of
the Spurlock Museum to establish a crosswalk between Spurlock’s local metadata
format and Dublin Core. Through the application of the Library of Congress MARC
to Dublin Core crosswalk we converted 900,000 MARC records from the Illinois
State Library to Dublin Core within just several day’s effort by one programmer.
We then filtered these records according to their Dewey Decimal Classification
numbers to extract just over 300,000 records that represent materials broadly
related to cultural heritage.
Normalization
In order to aid in collocation of like materials
and increase the discoverability of resources, the project team has spent time
examining elements that would be suitable for applying a standard controlled
vocabulary. The analysis focused on five elements: type, coverage, date, format,
and subject. The normalization of these was felt to have the potential to add
the most value for users of the repository. Upon further examination, the project
discarded the format element because it was inconsistently used and seemed less
useful than the type element for discovery and collocation. The project closely
examined use of the type element and developed a type vocabulary containing
both general and slightly narrower terms. The date element and the temporal
aspect of the coverage element were explored next. The normalization here involved
grouping resources into discrete time periods based on the information in the
metadata. In both these cases the decision was made to add the additional vocabulary
and not to replace or delete any existing metadata. Our first approach to this
process involved applying some of these normalization scripts directly during
the harvesting process and others after harvesting but before indexing. We later
decided to do all normalization after harvesting as a separate process. This
allows for better control of the processes and keeps a "clean" copy
of harvested records in our harvester database and file system. Other potential
candidates for normalization are the spatial aspects of the coverage field,
the subject field, and possibly some content commonly found in the description
field. Normalizing or coordinating controlled vocabulary found in the subject
and description fields is obviously an extremely complex proposition. We hope
that the data mining research conducted in conjunction with NCSA will aid us
in finding novel ways to cluster like resources together and leverage limited
use of controlled vocabulary by metadata providers. (For our collection of metadata
only 20-25% of metadata providers utilize recognized controlled vocabulary in
the subject element.)
Processing EAD Metadata
EAD records provided have allowed significant
opportunity to test the value of exposing the details of strongly hierarchical
records alongside simple one-level item records. The EAD records are broken
into the <co> levels. The application of a stylesheet allows for the display
of the discrete components of a finding aid to be easily searched alongside
other one-level item records and in the context of the whole finding aid to
which it belongs. The stylesheet provides a separate pop-up window in which
the native EAD structure is graphically displayed. A paper discussing this work
in more detail will be presented at the JCDL conference in Oregon in July 2002.
Experimental Work with MARC and Z39.50
With an interest to extend the use of the
OAI-PMH to easily include MARC records we have developed a tool, ZMARCO, that
allows MARC records made available through a Z39.50 server to be included in
our repository. ZMARCO was released on SourceForge in June 2002. Because Z39.50
and MARC are ubiquitous in the library community, an OAI tool to harvest MARC
records directly from Z39.50 servers can provide a convenient means of including
huge sets of publicly available MARC records to an OAI repository. ZMARCO is
an attempt at providing this service. The Illinois State Library has provided
the Illinois project access to nearly 45 million MARC records to use for testing
ZMARCO as well as testing the scalability of the XPAT indexing tools.
Progress & Additional Details Regarding Specific Tasks
Task One – Construction of Baseline Harvesting Service – (Sept. 1, 2001
– Dec. 31, 2001)
(See first six-month report for details:
http://oai.grainger.uiuc.edu/Papers/ProjectReportFinal.htm)
A white paper describing the functionality of the
harvester was put online in January 2002. The Harvester was ported to Java and
delivered to the Michigan OAI team in February. Enhancements and bug fixes to
both the VisualBasic (VB) and Java Harvester tools have been ongoing through June
2002. Many suggested enhancements are related to the end-user interface for the
Harvester and will continue to be included as time allows.
In addition to providing the Java Harvester
tools to Michigan, the Illinois team agreed to develop OAI 2.0 versions of both
Provider and Harvester tools and make them available as Open Source via the
SourceForge website (see http://sourceforge.net/projects/uilib-oai/).
Both Java and VB (Windows only) versions of both harvester and provider tools
are available on the site.
Task Two – Portal Creation and Development (Sept. 1 2001 – Dec. 31, 2002)
(70% complete):
Our current search portal can be viewed
at: http://oai.grainger.uiuc.edu/oai/search/.
We conducted a series of usability tests with specific search instructions given
to the testers. Testers were instructed to verbalize their thoughts and decision-making
processes throughout the test to the observer who recorded these thoughts and
processes. After completion of the tasks outlined by the test, testers were
asked to express general comments about the interface, the collections represented
in the repository, or to ask any questions they had about the system. In general
the reaction to the interface and the repository were positive although there
seemed to be some confusion about the different collections and whether or not
online access to resources described was available in all cases. The project
team attempted to address this confusion and other issues through changes to
the interface. In addition, the interface was examined by students in the GSLIS
class Interfaces to Information Systems. Again the project team analyzed the
recommended changes and adjusted the interface as appropriate.
An annotation box was added to the interface in May
2002. This box is designed to allow for user input to enhance the records. When
a user makes an annotation to a record, the annotation is recorded in a separate
database. The project team reviews the annotation and will add it to the record
as appropriate. The project team foresees this as a useful tool for adding value
to records in the repository. Specifically teachers K-12 may use this feature
to add information about specific records they believe can be used to develop
lesson plans that meet state-specified learning criteria for particular aspects
their curricula. The usefulness and practicality of the annotation box concept
will be explored further in the fall in conjunction with a K-12 education class
at Illinois.
The final phase of the project (July
1, 2002 – Dec. 31, 2002) will provide an opportunity to explore further the
usefulness and practicality of end-user search interface features such as the
annotation box. We hope to incorporate students from an education class at Illinois
in the design of an interface specifically focusing on the needs of teachers
K-12. This community of users has demonstrated an interest in using a cross-collection
cultural heritage repository to meet their curriculum development needs through
another project at Illinois, Teaching with Digital Content. (http://tdc.uiuc.edu)
We also anticipate a greater volume of site use allowing meaningful analyses
of anonymous transaction logs.
Updated on:
7/23/02
Sshreeve