Home

Project Info

The University of Illinois Open Archives Initiative Metadata Harvesting Project

UIUC OAI Harvester Architecture

The current harvester consists of five components:

 

1)      Database

 

2)      File System

 

3)      Harvester Program

 

4)      Manager Program

 

5)      Web User Interface (Java Applet and ASP/JSP Scripts)

 

Database

For our current VB implementation, any relational database system that supports Microsoft OLE DB standards will work.  These include Oracle, MS SQL Server, Access, etc.  We have currently tested on Access and MS SQL Server. For the Java implementation, MySQL is currently used although any JDBC compliant Database will work.

 

The database consists of tables used for storing information about repositories to be harvested, scheduling and monitoring harvest jobs, plus tables for keeping track of harvested records.  Optionally, the database could also be used for storing the actual metadata, although, for that we currently use the File System.

 

Tables and Columns

Table: OAIRepository
Columns: repositoryID, repositoryName, baseURL, protocolVersion, adminEmail, RegistryPrefix, DateAdded, baseDIR
 
Table: OAISet
Columns: repositoryID, setSpec, setName
 
Table: OAIMetadataFormat
Columns: prefixID, metadataPrefix, metdataSchema, metadataNamespace
 
Table: OAIPrefixMapping
Columns: repositoryID, prefixID
 
Table: OAIRecord
Columns: recordID, repositoryID, identifier, filename, harvestTime, datestamp, metadataPrefix, status, updateFlag
 
Table: OAISetMapping
Columns: recordID, setSpec
 
Table: OAIRecurringJobs
Columns: id, repositoryID, startDate, endDate, harvestFrom, harvestTo, frequency, setSpec, metadataPrefix, state, method, dateAdded, incremental, filterFilePath, validation, normalizeFilePath
 
Table: OAIJob
Columns: repositoryID, setSpec, harvestStartTime, status, jobID, harvestMethod, processID, harvestFrom, harvestTo, metdataPrefix, recurringID, filterFilePath, validation, normalizeFilePath
 
Table: OAIHistory
Columns: repositoryID, recordCount, harvestStartTime, harvestEndTime, jobID, status, failureReason, recurringID, setSpec, historyID
 

 

 

Figure 1.           Database Schema

File System

All harvested metadata is stored on the file system as well-formed, Unicode XML files.  The root file storage location can be specified for each harvested repository, and is stored in the OAIRepository database table in the baseDIR column.  Subdirectories from this root may be constructed based on the OAI identifier.  For example, the metadata record with an OAI Identifier of oai:ilLib:alliance:1234 will be stored at this file path: baseDIR\ilLib\alliance\1234.xml. 

 

In addition, for the convenience of directory browsing no more than 5,000 files will be placed in any one directory. This is also configurable.  Files beyond 5,000 will be stored in subdirectories, such as baseDIR\ilLib\alliance\starting_at_10000\12345.xml

 

In our case we have set up (Network File Sharing) NFS between one of our Windows NT machines and our LINUX machine.  This allows us to run our harvesting software on our NT machines, but our baseDIRs point to directories on the LINUX box where we also host the XPAT software, so XPAT can index the XML files directly from the LINUX machine.  However, this is entirely transparent to our code, so other arrangements are easily possible.

Harvester Program

The VB Harvester Program is a standalone executable that will, depending on different command line parameters, harvest metadata from a given repository.  It communicates with the Database to discover information about the repository it is to harvest, and also to update the database with the status of completed harvests.

 

One version of the harvester was written in Visual Basic.  It makes extensive use of a custom ActiveX DLL for actually doing the harvesting over HTTP. Another version of the harvester is a component of the OAI manager written in JAVA. Both packages use a common API, one using a Java class library and one using a VB class library. These are available as stand-alone packages.

 

The supported command line parameters are listed below. Note for the Java version these are all built into a class and this is then passed from the manager to the harvester:

 

repositoryID

            The ID of the repository to be harvested from the OAIRepository table

recurringID

            The ID of the originally scheduled job in the OAIRecurringJobs table

jobID

            The ID of this specific harvest job from the OAIJob table

harvestMethod

The method to use for the harvest, 0 means to use the ‘verb=ListRecords’ parameter, 1 means to use the ‘verb=ListIdentifiers’ parameter

setSpec

            The OAI ‘set’ parameter

harvestFrom

            The OAI ‘from’ parameter

harvestTo

            The OAI ‘until’ parameter

metadataPrefix

            The OAI ‘metadataPrefix’ paramater

filterFile

The path to an XML file which contains XPath and Regular Expressions used to filter the results of the harvest

normalizeFile

The path to an XML file that contains XPath and Regular Expressions used to add normalized content to an XML metadata file. See the following sections

validation

A setting used to determine how strictly harvested XML files will be validated. There are three possible values: 0 = strict, meaning that the XML must be well-formed. 1 = very strict, meaning the XML will be validated against the appropriate XML Schemas 2 = loose, meaning that little or no validation is performed; bad records will be skipped over without causing a harvest to fail

parseID

If this is true, OAI Identifiers will be parsed in order to create a path hiearchy for storage fo the metadata files. If it is false, all the files will be stored in the one base directory (baseDIR) specified in the OAIRepository table.

filesPerDIR

This parameter indicates the maximum number of files which will be stored in a single directory before a subdirectory is automatically created.

 

When the program completes a harvest job, it will add a record to the OAIHistory table with the number of records harvested.  If it fails in a predictable fashion, for example a repository has returned non-well-formed XML, the Harvester will also add a record to the OAIHistory table with a status of failed, and also a failureReason string.

Filtering

In addition to standard harvesting the harvester also supports record filtering.  This allows records to either be accepted or rejected at the point of harvest based on different filtering criteria.  For example, if you were harvesting MARC records you could only accept records with certain Dewey Decimal Classifications.

 

Example Filter:

 

<filters match="any">

<selectionNamespaces>xmlns:dc='http://purl.org/dc/elements/1.1/'</selectionNamespaces>
<filter invert="false" xpath="//dc:subject" regexp="cultural" ignoreCase="true" />
<filter invert="false" xpath="//dc:subject" regexp="heritage" ignoreCase="true" />

</filters>

 

The match attribute indicates whether ‘all’ of the filters must be matched to include a record or whether ‘any’ of them matching is enough.

 

The selectionNamespaces element contains a space-separated list of namespace declarations, such as used with the MS DOMDocument selectionNamespaces property.  This is required in order to use namespace prefixes in the XPath expressions.

 

The invert attribute indicates whether that filter should be inverted (boolean NOT).  The xpath attribute contains an XPath expression used to determine the nodes to which to apply the Regular Expression in the  regexp attribute.  The ignoreCase attribute indicates whether the Regular Expression should ignore case or not.

 

NOTE: The match attribute not only applies to every filter element, but it also applies to every node which matches a given XPath expression.  Therefore, if the XPath evaluates to multiple nodes, the regular expression will be applied to all of them, so if match="any", the record will be returned if any one of the nodes matches the RegExp.  If match="all" the record will only be returned if all of the nodes match the expression.

 

Normalizing

Record normalizing is also supported. This allows some fields to be added into a record at the point of harvest based on different normalizing criteria. For example, if a record contains a type field with value "jpeg", you can add a normalized type field with value "image" into the record. Example Normalize: xmlns:dc='http://purl.org/dc/elements/1.1/' text book image The selectionNamespaces element contains a space seperated list of namespace declarations, such as used with DOM selectionNamespaces property. The mapping/@xpath attribute contains an xpath expression used to determine the nodes to which to apply normalization. The newElement element is the element to be added to the document if the node value matches newElement/@regexp. The newElement/@ignoreCase attribute indicates whether the regexp should ignore case or not. The newElement/@name attribute is the name of the newly added element. The newValue element is the value of the new element. If multiple newValue elements are specifed, mutiple newElement will be created.

Validation

The Harvester also supports a ‘loose’ validation model.  It will attempt to continue harvesting records even if it receives an XML record that is not well-formed or that does not conform to the XML Schema.  Such records will cause a warning record to be added to the OAI history table, but if possible the harvester will continue to the next block of records (using resumptionTokens).

 

Loose validation can be useful when harvesting large repositories where it is very inconvenient for a large harvest to be interrupted right in the middle because of one bad record (usually some non-Unicode character).

Manager Program

The Manager Program is another standalone executable written in Visual Basic and also Java.  The purpose of this program is to periodically poll the database and start new scheduled harvesting jobs.  The intention is to start this program and let it run continuously on the harvesting machine.

 

The Manager essentially performs three interrelated functions:

 

1)      It polls the OAIRecurringJobs table looking for newly added records.  These are identified with a value of 0 in the state column.  When it finds a new record in the OAIRecurringJobs table it will create a new record in the OAIJob table with appropriate values to start the job at the appointed time and with the appropriate parameters.  It will also update the state column to either 1 for a recurring harvest job or 2 for a non-recurring harvest job.

2)      It also polls the OAIJob table looking for harvest jobs that need to be started.  If it finds a job with a status of ‘Scheduled’ (0 in the status column) and a harvestStartTime less than the current datetime, it will start a new harvest.  It starts a new harvest by creating a new instance of the Harvester Program and passing it the appropriate command line parameters.  Once it starts a new Harvest Program it updates the OAIJob status column to ‘In Progress’ (1) and sets the processID column to the system Process ID of the Harvest Program instance.

3)      Finally, it monitors the status of previously started Harvest Programs.  It checks the OAIHistory table to determine whether previously started Harvester Programs have successfully completed or not.  If a Harvest Program has not updated the OAIHistory table, the Manager will check whether the ProcessID is still active in the system.  If it is not, the Manager assumes the Harvest has failed, so it updates the OAIHistory table itself with a status of failed.  It also deletes the record from the OAIJob table.  (In the future we may program the Manager to reschedule the job for a later time.)

For recurring harvests, if the Harvest Program has successfully completed, the Manager Program will add a new record to the OAIJob table with a new harvestStartTime based on the frequency column in the OAIRecurringJobs table.  For incremental harvests it will also update the harvestFrom or harvestTo columns of the new OAIJob record.

 

If the Manager Program program is terminated it will cause any running Harvester Programs to also be terminated.

Web User Interface

The web user interface consists of a Java Applet and several Active Server Pages running on a web server.  The intention of the web interface is to provide a convenient way for an administrator to schedule and monitor harvest jobs, short of actually selecting, inserting, or updating records directly in the database.

 

The Java Applet is represented by a tabbed dialog with four tabs.  The first tab, Repositories, is a representation of the OAIRepository table.  Repositories can be added or deleted from this table.  To add a repository, the only required data are the baseURL and the baseDIR.  All other data is retrieved from the repository itself using an Identify request.  See Figures 2 and 3.

 

Figure 2.                     Repository Tab

 

Figure 3.                     Add Repository Dialog

 

In the Add repository dialog, the baseURL drop-down list is populated with a list of all registered repositories.  The list at http://www.openarchives.org/Register/ListFriends.pl is maintained by the Open Archives Initiative organization.  New baseURLs can also be typed directly into the field for unregistered repositories.

 

The Schedules tab is a representation of the OAIRecurringJobs table.  The OAIRecurringJobs for the Repository selected in the Repository tab are displayed here.  Harvest Jobs can be added, modified or deleted from this tab.  See Figures 4 and 5.

Figure 4.                     Schedules tab

 

Figure 5.                     Add a scheduled harvest dialog

 

The History tab is a representation of the OAIHistory table.  If a Repository has been selected in the Repository tab, only the History for that Repository will be displayed here.  Otherwise, the History for all Repositories will be displayed.  Records from the History table can be deleted from this tab.  This would be used primarily to clean up old or outdated records that are no longer needed or relevant.  See Figure 6.

Figure 6.                     History tab

 

The Current Jobs tab is a representation of the OAIJob table.  It displays all scheduled or in-progress jobs.  Jobs may be deleted from this tab.  A Scheduled job will simply be deleted from the table, so that it is never run.  However, if an in-progress job is deleted, the Manager Program will first terminate the job, and then delete the record from the database.  This could be used to stop jobs which seem to be taking too long to complete or are hung up.  See Figure 7.

 

Figure 7.                     Current Jobs tab

 

Integration of the Harvester with other Services

Other services, such as indexers, may be integrated with the Harvester applications in a number of ways.  The two most obvious ways would be through the Database or through the File System.

 

Our current approach to indexing is very simple.  We harvest all OAI records in a given repository to a common base directory, such as \\libgrloki\OAIData\uiLib\, and we periodically do a directory walk of all XML files in that directory and subdirectories and create a searchable index for the repository using XPAT.  For us indexing is currently a manual process, so we can perform an index build when we know there are no scheduled harvests in progress.  However, our intention is that once we start automatically scheduling index builds we will use the OAIHistory and OAIJob tables to schedule the index builds.  This will ensure that no harvest jobs are scheduled at the same time as an index build for the same repository.

 

If an indexer or other service requires closer integration with the harvester there are other possibilities.

 

1)      Instead of using the File System directly to retrieve records, the OAIRecord table could be used.

2)      The service could assume (or augment) the scheduling functions that are currently performed by a human administrator using the Java Applet.  The service could add or modify records in the various database tables directly, as its requirements dictate.

3)      A service could replace the Harvest Manager program with a custom application that performs a similar function, but is more closely integrated with other service functions, such as indexing.

4)      The service could abandon the Harvester programs entirely, but still potentially use the low-level harvesting code.  Currently our Harvester Program uses an ActiveX object library for interfacing with OAI providers via HTTP.  This code hides most of the OAI protocol details such as the HTTP response header handling and resumptionTokens.  This object library could be documented and repackaged for separate use, if desired.