Steps of the Normalization Process:
Cole et al (2001) have outlined the general steps of the normalization process. These are:
The elements initially identified for normalization were type, format, date and the temporal aspect of the coverage element, and the subject and description elements. After closer examination the format element was discarded because it was primarily used by finding aids to store the extent of the collection (for example: 3 cubic feet) and was not a useful element to search on except perhaps for archivists. (See summary on http://oai.grainger.uiuc.edu/projectinfo.htm) The type element was a prime candidate for normalization because it was generally consistently used by data providers although the vocabulary was inconsistent. The date and temporal aspect of the coverage element appear to have been used interchangeably and were good candidates for normalization though the use of two elements is an additional challenge to be addressed in the process. (See summary on http://oai.grainger.uiuc.edu/projectinfo.htm) Similarly text describing the aboutness of items has been found in both the subject and description elements – and to a much lesser extent in the coverage element. The text varies from subject headings (some using the Library of Congress Subject Headings) to paragraphs describing the contents of a finding aid for instance. While the type and date/coverage normalization process has been initiated, the project team has not yet embarked on the exceedingly complex task of applying controlled vocabulary to the subject and/or description elements.
For the remainder of the white paper, I will outline the process of normalizing the type element.
The Dublin Core Metadata Initiative (DCMI) defines type to be:
the nature or genre of the content of the resource. Type includes terms describing general categories, functions, genres, or aggregation levels for content. Recommended best practice is to select a value from a controlled vocabulary (for example, the working draft list of Dublin Core Types [DCT1]). (Dublin Core Element Set, 1997)
Analysis of Repository Metadata and Use of Vocabulary:
To aid examination of the content of the type element, the value (or content) of each record was extracted from the repository and organized by data provider (data sets) to facilitate discovery of controlled vocabularies or patterns within institutions. Each different value was listed along with the number of times it appeared within the data set. All of the data providers used the type element as mandated by Dublin Core except in several cases the use of a controlled vocabulary. All but two of the data providers used the type element. (All Dublin Core elements are optional and repeatable.) Approximately 1400 different type values appear in the data as a whole with some providers only using one value and others using over 800. See Table 1 for all data providers as of May 31, 2002 and their use of the type element. Table 2 is an aggregate analysis of the use of the type element across all data providers. It also provides a breakdown by surrogate and registered data providers.
From the thirty-two providers who did use the type element, eight data providers used the Dublin Core Type vocabulary (DCT1) recommended by the DCMI. Three data providers (including CIMI) used the controlled vocabulary recommended by CIMI. CIMI is a consortium of museums that researches and promotes standards for the museum and cultural heritage communities. Eleven data providers had the UIUC assigned EAD vocabulary in the type element. The Library of Congress American Memory Project used the LC Thesaurus of Graphic Materials Genre Headings as values. Seven data providers appeared to use a local vocabulary (or a published vocabulary that could not be located). Five of the data providers used fairly specific or a hierarchical vocabulary; the remaining used a general vocabulary. The distinction between general and specific vocabulary is made by examining the level of granularity used. For instance, a data provider whose collection consists of photographs but who uses the value ‘image’ in the type element uses a general vocabulary. If the data provider had used ‘photograph’ or a more specific term like ‘black and white photograph’, it would be using a specific vocabulary.
Other issues arose from the analysis of the data sets. Several data providers used vocabulary inconsistently. In one case, ‘[computer file].’, ‘computer file/’, ‘Computer file’, and ‘computer file’ were all used. At least three collections (CIMI, the Library of Congress American Memory Project, and the Colorado Digitization Project) are aggregations of metadata from multiple sources. The CIMI Demonstration Repository alone contains records from approximately five hundred different institutions. The vocabulary used across these institutions might be mandated but inconsistencies are difficult to control. This compounded the issue of normalization for the Illinois project. Another issue was the use of vocabulary which described the nature of the resource but not in a physical sense. For instance, values such as cultural, western, educational, and adaptation were found in several of the data sets. Because the data sets were divorced from the context of their metadata it was impossible to see what these terms referred to. On further examination, it was found that some data providers placed type information into description elements. This information is difficult to pull out in an automated manner, and too time consuming to pull out manually given limited staff.
What Vocabulary to Use?
The next step was the choice and application of a controlled vocabulary. At this point the subject matter – cultural heritage - of the collection became an issue. Surveying different controlled vocabularies that are available revealed that no one vocabulary could cover all of the permutations of the collection – physical objects, textual objects, images, etc. The project team decided to move forward to create a vocabulary specific to the UIUC repository but to base it on a combination of other controlled vocabularies.
The DCMI Type Working Group offers a document entitled "Guidance for Domains and Organizations Developing Vocabularies for Use with Dublin Core: Outline". Several steps to developing a vocabulary are outlined:
These steps resonate with Aitchison’s manual on thesaurus construction, though she adds many more elements. (Aitchison, 2000) The scope and purpose of the UIUC vocabulary is to aid in the normalization of type values used by data providers in the realm of cultural heritage in order to aid collocation and resource discovery. Because of the sheer number of different resources and because several of the data providers were already providing specific vocabulary, it was determined that a two level hierarchical vocabulary was needed: a main type which used a general vocabulary equivalent to or similar to the DCT1 vocabulary and subtypes of with a more specific vocabulary that would help to group similar sets of resources in a less overarching way than the main type. The hierarchy would exist only at two levels – a main type and a subtype. However, these would not be set up as a hierarchy within Dublin Core but would live within separate elements. Instead of <meta name = "DC.Type" content = "image; advertisement">, it would be divided into two tags: <meta name = "DC.Type" content = "image"> <meta name = "DC.Type" content = "advertisement">.
Dublin Core Type Vocabulary (DCT1) was examined for its applicability to the types as seen in the examined data sets. DCT1 includes nine values: collection, dataset, event, image, interactive resource, service, software, sound, and text. (Dublin Core Type Vocabulary, 2000) While many of these mapped reasonably well to the metadata (photographs to image, monograph to text), there were others which had no place in this schema – most notably the physical objects and artifacts which made up a large portion of the repository. Further investigation revealed that model, party, physical object, and place had been part of the initial recommendation for the DC Type Vocabulary. (Guenther, 1998) It was also discovered that the Dublin Core Type Working Group was currently working on a recommendation to include physical object in the DCT1 vocabulary. These two factors convinced the project team that the DCT1 vocabulary with the addition of physical object, place, and organization (for party) would meet the needs of the repository for the first level of the type vocabulary.
The next step was to create a more specific vocabulary to supplement the general modified DC type vocabulary. The project team consulted several controlled vocabularies in the area of arts and graphic materials, such as the Getty Art and Architecture Thesaurus and the Library of Congress Thesaurus for Graphic Materials II: Genre and Physical Characteristic Terms. As a result the project team came up with several terms – particularly in the areas of image and text – which would aid in collocating like items and increasing the discoverability of objects. For instance, ‘photograph’ was a term added to a variety of items described as ‘photos’, ‘slide’, ‘negative’, and ‘snapshot’. To the ‘collection’ type, the project added several terms which were useful in describing different parts of EAD files such as: ‘EAD’, ‘file’, and ‘Item’. The specific vocabulary was limited by what might be called ‘type warrant’. The project team was guided by what was already contained in the repository; no attempt was made to project what might be included in the future. In addition, values which were used a minimal number of times (less than ten) and were unique were not provided a counterpart in the controlled vocabulary. This was due to staff and time restraints. Likewise, values that were descriptive and had no physical counterpart (such as ‘cultural’ or ‘western’) were not normalized. The full controlled vocabulary is included in Appendix A.
Applying the Vocabulary and Assessment:
The decision was made early on not to replace values already found in the metadata. Instead the controlled vocabulary would be added to the metadata as an additional element. In order to clarify that the terms added to the metadata came from the UIUC project, the qualifier ‘(uiLib)’ was added to each term in the controlled vocabulary. The controlled vocabulary was then matched to the values already contained in the metadata. A filter was built by the research programmer on the project team so that as collections are harvested so too is the controlled vocabulary. Because the controlled vocabulary is based on ‘type warrant’, as new collections are added to the repository, the values used in the type element are examined, and it is determined whether to add a new term to the controlled vocabulary.
Assessment of the normalization of the type element is still ongoing at this time. The project team has planned comparing searches with and without the application of the controlled vocabulary and will measure both the number of resources recalled in the search and how well the filtering mechanism mapped the controlled vocabulary to matching terms. Concerns include how well the general and specific vocabularies are balanced and what has not been covered. The team needs to examine further how to tease out type vocabulary contained in other elements in the metadata (as in the description element mentioned earlier). A cost/benefit analysis must also be conducted. The work that went into the normalization of the type element was substantial – greater than expected, in fact.
It may be that the normalization of the type element will not appear to be cost effective in the short term given the other major challenges facing this experimental repository. The date/coverage normalization appears to have potentially more gratifying results given that the user will be able to group resources by century. Certainly the normalization of the subject and description elements would have the highest return given their primacy as search categories. The work that would go into the normalization of these elements is considerable, and it is difficult to imagine what the team’s strategy might be, although currently NCSA’s data mining tools are analyzing the metadata. However, the need for intellectual labor, as amply demonstrated by the process of normalizing the type element, remains.
1. Collection
Definition: A collection is an aggregation of
items.
Subtypes: (for EAD files)
· EAD
·
Series
· Collection
· File
· Fonds
· Item
· Oherlevel
·
Rcordgrp
· Sbgrp
· Sbseries
2. Event
Definition: An event is a non-persistent, time-based
occurrence.
3. Image
Definition: An image is a primarily
symbolic visual representation other than text.
Subtypes:
· Cartoon
·
Design Drawing
· Diagram
· Drawing
· Illustration
· Map
· Moving
Image
· Painting
· Photograph
· Poster
4. Software
Definition: Software is a
computer program in source or compiled form which may be available for
installation non-transiently on another machine.
5. Sound
Definition: A sound is a resource
whose content is primarily intended to be rendered as audio. For example - a
music playback file format, an audio compact disc, and recorded speech or
sounds.
Subtypes:
· Music
· Speech
6. Text
Definition: A text is a resource
whose content is primarily words for reading. For example - books, letters,
dissertations, poems, newspapers, articles, archives of mailing lists. Note that
facsimiles or images of texts are still of the genre text.
Subtypes:
· Article
· Book
· Business
document
· Catalog
· Correspondence
· Homepage
· Manuscript
·
Magazine
· Newsletter
· Newspaper
· Pamphlet
· Personal paper
·
Poem
· Thesis
7. Physical object
Definition: a non-human
object or substance. This category includes objects that do not fit into any of
the other categories on this list. In addition these objects must be approached
physically to make use of them. For example - a computer, the great pyramid, a
sculpture, wheat.
Subtypes:
· Realia (See GEM list)
·
Artifact (See GEM list)
8. Place
Definition: a geographic
area.
9. Organization
Definition: institution or
cultural group
Aitchison, J., Gilchrist, A. & Bawden, D. (2000). Thesaurus construction and use: a practical manual. Chicago: Fitzroy Dearborn.
Apps, A. (Ed.) (2001). Guidance for Domains and Organizations Developing Vocabularies for Use with Dublin Core: Outline. Working Draft. Retrieved February 4, 2002 from: http://epub.mimas.ac.uk/DC/typeguide.html.
Cole, T. W., Kaczmarek, J., Marty, P. F., Prom, C. J., Sandore, B. & Shreeves, S. L. (2002). Now that we've found the 'hidden web' what can we do with it? The Illinois Open Archives Initiative metadata harvesting experience. In D. Bearman & J. Trant (Eds.) Museums and the Web 2002: selected papers from an international conference (pp.63-72). Pittsburgh: Archives and Museum Informatics. Also available online at: http://www.archimuse.com/mw2002/papers/cole/cole.html.
Dublin Core Metadata Initiative Type Working Group homepage. (1998). Retrieved February 4, 2002 from: http://dublincore.org/groups/type/.
Dublin Core Metadata Element Set, Version 1.1: Reference Description. (1997). Retrieved February 1, 2002 from: http://dublincore.org/documents/dces/.
Dublin Core Type Vocabulary. (2000) Retrieved February 1, 2002 from: http://dublincore.org/documents/2000/07/11/dcmi-type-vocabulary/.
GEM Resources Type Vocabulary. (2000) Retrieved Febuary 4, 2002 from http://geminfo.org/Workbench/Metadata/Vocab_Type.html.
Guenther, R. (1998). List of Resource Types. Working Draft. Dublin Core Metadata Initiative. Retrieved February 4, 2002 from: http://dublincore.org/documents/1999/08/05/resource-typelist/.
Lagoze, C., Van de Sompel, H.., Nelson, M., & Warner, S. (2002) The Open Archives Initiative Protocol for Metadata Harvesting Beta Version. Version 2.0. Retrieved June 7, 2002 at: http://www.openarchives.org/OAI/2.0/openarchivesprotocol.htm.
Svenonius, E. (2001). The intellectual foundation of information organization. Cambridge, MA: MIT Press.
Sarah Shreeves
Posted
7/2/02