Preservation Metadata for Institutional Repositories: applying PREMIS

Steve Hitchcock, Tim Brody, Jessie M.N. Hey and Leslie Carr

Preserv Project, IAM Group, School of Electronics and Computer Science,
University of Southampton, SO17 1BJ, UK

Preserv is a JISC-funded project within the programme Supporting Digital Preservation and Asset Management in Institutions. Find out more about Preserv.

This is a draft paper, 25 January 2007. It includes edited and updated material from an earlier paper Preservation Metadata for Institutional Repositories (February 2006), here focussing on preservation metadata and omitting the coverage of preservation service models. There is a companion paper on Digital Preservation Service Provider Models for Institutional Repositories: towards distributed services.


Metadata designed for managing digital content over a long period of time is commonly referred to as 'preservation metadata', and typically informs, describes and records a range of activities concerned with preserving specific digital objects. Currently, the authoritive reference on preservation metadata is the PREMIS Data Dictionary (2005). This is based, as the full name (PREservation Metadata: Implementation Strategies) indicates, on the idea of implementation, and the paper seeks to develop an implementation involving institutional repositories (IRs). This analysis attempts to map the five entity types identified in the PREMIS Data Dictionary -- intellectual entities, objects, events, agents and rights -- to potential metadata sources identified in an IR-preservation service provider model: author/IR submitter (via the repository deposit interface); IR software (in this case EPrints); associated tools (in this case file format ID tool PRONOM-DROID); IR policy; preservation service providers. An additional source of metadata is environment registries. Although it has not yet been possible to test the mapping of the preservation metadata elements to these models in examples using real preservation services, the approach has been tested in another form, as the basis of a survey of repository managers of larger IRs. The interim findings are that PREMIS appears to provide an excellent basis on which assess the needs of IRs with respect to preservation metadata, and it is possible to map the PREMIS elements to an extended model incorporating preservation services and registries. Preliminary evidence, based on the survey of repositories, shows that most data can be provided by the sources identified, although some elements may need to be adapted or omitted. More implementation and testing are required, especially to validate the allocation of elements to preservation service providers and environment registries.


Metadata designed for managing digital content over a long period of time is commonly referred to as 'preservation metadata', and typically informs, describes and records a range of activities concerned with preserving specific digital objects.

The broad aim of digital preservation is to ensure that the content remains accessible regardless of changes in hardware and software technologies, notably: presentation formats; changes in organisational responsibilities for managing the content; and to mitigate environmental risks (e.g. Bradley 2005).

What characterises many approaches to digital preservation is the implicit assumption that, from data generation to input to archiving, preservation is managed within an expert and specialised preservation environment. Digitisation projects are good examples of this approach. This paper considers the case where data creation, deposit and content management in a repository will be performed by a range of players, many non-specialists from a preservation viewpoint. Institutional repositories (IRs) are such a case, where authors of research papers, for example, 'self-archive' their works in the IR, and where the management of the IR often has no specialist preservation skills (see Hitchcock et al. 2007b). It is proposed that in such cases the IR might contract preservation requirements to an external service provider (Hitchcock et al. 2007a). Thus preservation metadata here must inform not just long-term management of the data but also the relationship between the IR and the service provider.

This paper provides a brief overview of preservation metadata, which we then seek to develop for an application involving IRs. Currently, the authoritative reference on preservation metadata is the Preservation Metadata Implementation Strategies Data Dictionary (PREMIS 2005), on which we have focussed our initial investigation. Hitchcock et al. (2007a) explain some background that establishes the role of IRs, and place IRs in a preservation context leading to the introduction of three OAIS-based models. In this case we focus on one of those models, the service provider model, to get a handle on the analysis of the PREMIS metadata set for this application.

"It is difficult to anticipate the metadata needed to support technical and administrative processes that are not fully developed, are
not fully tested, and in some ways, are not even fully understood. Compounding the problem is the proviso that preservation metadata recommendations must be restrained by economic realities. Creating and maintaining metadata is expensive, so any recommended
preservation metadata elements should be backed by persuasive evidence of necessity, as well as practical means for populating them" (Lavoie and Gartner, 2005).

In this paper we begin our investigation of supporting preservation metadata within the practical, very real and growing content of IRs.

What is preservation metadata?

“Preservation metadata is the information necessary to carry out, document, and evaluate the processes that support the long-term retention and accessibility of digital materials." (PREMIS 2005)

In terms of digital technology and the widespread creation of digital materials, preservation metadata has had a lengthy period of gestation and development. Various generic approaches have been identified within projects, with the baton seemingly passing periodically from one group to another (e.g. National Library of Australia 1999; NEDLIB, Lupovici and Masanès 2000; CEDARS 2002). The  OCLC/RLG Working Group on Preservation Metadata (2002) introduced an international consensus, while implementations that emerged such as at the National Library of New Zealand (2002) were largely application-specific. According to Lavoie and Gartner (2005) the earlier efforts "largely were speculative in nature, seeking to anticipate the metadata needs of programmatic digital preservation initiatives that would emerge in the future. On the other hand, development of the more recent element sets, such as OCLC, NLNZ were more closely aligned with planning and implementation of 'production' digital archiving systems."

Given the changing descriptions, slightly different terminologies and the fuzziness of the overlap between preservation metadata and other forms of more widely used metadata, such as metadata for resource discovery, administrative metadata, etc., it can be quite hard to unravel the different perspectives, although a review by Day (2003) makes a good attempt.

Fortunately, a more coherent view has appeared in what is, currently, the authoritive reference on preservation metadata, the PREMIS Data Dictionary (2005). This is based, as the full name indicates, on the idea of implementation, and for the first time provides, on examination, a thorough, rigorous and comprehensive set of preservation metadata elements that a "working archive needs to support the functions of ensuring viability, renderability, understandability, authenticity, and identity in a preservation context." (Guenther 2004)

Developing a preservation metadata set for IRs

In the PREMIS data model there are five types of entity: intellectual entities, objects, events, agents and rights. While "Intellectual entities and agents are not fully described", the majority of the entries in the data dictionary involve objects and events (Guenther 2004):
Selection and ongoing refinement of the preservation metadata set is not only concerned with managing preservation but should also consider minimising the costs of preservation actions and services. The Objects and Events types in PREMIS, and the rights entity, can be aligned with what according to James et al. (2003) are the most significant factors affecting costs in eprint repositories:
Particular attention, then, should be paid to elements in these categories that will assist the efficient collection, generation and delivery of the necessary metadata to the service provider to control these costs.

With regard to the Rights elements, according to PREMIS the minimum core rights information that a preservation repository must know is the permissions that have been granted to the repository itself to carry out actions related to objects within the repository. PREMIS considered only rights required for preservation activities; Coyle (2006) investigated how this might be expanded, including rights for access.

It should be noted that IRs could be considered a special case as far as deposit of papers published elsewhere in journals and proceedings, for example, are concerned. In these cases the IR might require only a simple licence agreement with the author. Several IRs have publicly available agreements, e.g. Caltech (see Hitchcock et al. 2007b). More formal model author agreements in the form of 'author addenda', for use with publishers, have recently been produced by organisations such as MIT, Science Commons (through its Scholar's Copyright project), and SPARC (Hirtle 2006), although it is not clear how these have been tested with authors and publishers.

For preservation purposes any rights statement should be extended to allow copying and uses prescribed by the service provider, e.g. the DSpace at MIT licence includes the following clauses relating to possible preservation actions within the IR, illustrated by MacColl (2004):

"You agree that MIT may, without changing the content, translate the submission to any medium or format for the purpose of preservation. You also agree that MIT may keep more than one copy of this submission for purposes of security, back-up and preservation."

The latest version of EPrints (v3, just released, includes a Preservation Rights Declaration in the deposit interface ( to provide selectable options. Further optional licences can be added to the drop-down list by repository administrators.

None of the author addenda include provision for preservation. If an IR is to use an external preservation service provider, then preservation-aware agreements will need to be developed further in conjunction with the service provider.

Mapping PREMIS elements to the IR-service provider model

"in contrast to the support for resource discovery metadata, managers of e-print repositories have practically no preservation metadata support provided by the common repository software packages." (James et al. 2003)

IR software is only one player in our preservation service provider scenario. In the detailed PREMIS Data Dictionary the five entity types -- intellectual entities, objects, events, agents and rights -- are described by entries for the main elements and subelements. In this analysis Tables 1-5 attempt to map these elements to the potential metadata sources identified in our IR-service provider (IR-SP) model outlined by Hitchcock et al. (2007a):
A possible additional source of metadata is environment registries (Table 6), which are recognised in PREMIS and other preservation activities, although there are not yet any concrete examples of such registries based on the broadest, most ambitious designs that could support and source the elements identified here. PRONOM, Global Digital Format Registry (GDFR and JSTOR/Harvard Object Validation Environment (JHOVE are examples of more specific registry types (e.g. for file format ID and validation) that might be the basis of more expansive implementations. Representation Networks are another ongoing preservation development that may inform environment registries, e.g. DCC Representation Information Registry (

The version of the mapping presented here includes the main PREMIS metadata elements without expanding on the subelements where these are assumed to follow into the same source categories unless indicated.

Since PREMIS is oriented towards implementation, this requires the development of schema to define the use of certain elements within this application. According to PREMIS, the schema should, where possible, provide controlled vocabularies or codes for populating elements, rather than relying on “free text”. In addition, the schema should be adaptable to automated workflows for metadata collection and management. This analysis does not extend to identifying, building or including controlled vocabularies or schema that may be required by some elements within the preservation metadata set.

Tables 1-6 map the PREMIS elements to the principal sources in the IR-SP model. It should be noted that elements are not fixed in these tables. Some may apply to more than one table, especially where related subelements are simply wrapped in a single entry, but for clarity we have not duplicated elements between tables. For example, it may be necessary for the relationship element to be informed by the IR author, but subsequent use of that information in related subelements may be the responsibility of other sources or services.

Key to tables: O optional, R required, R* conditionally required, + includes related subelements from PREMIS Data Dictionary

Table 1: From the IR submitter/author (via EPrints interface)

PREMIS metadata elements
Part of (if not main element) PREMIS entity type
Other IR-SP sources
creatingApplication +
probably needs author (e.g. would an MS-Word file generated from OpenOffice be flagged by ID tool?) PRONOM?

this refers to files uploaded with the eprint submission, named by the author and recorded by EPrints as part of the upload directions; could be a DOI? Will files be renamed by SP?
R* dependency + environment
e.g. schema; while the submitter must indicate a relationship, the related subelements may be generated elsewhere (see also table for environment registries)
R* relationship +
relationship between objects, high-level categorization, e.g. structural, transformation, and other types to be determined by SP SP
R* linkingIntellectualEntityIdentifier +

e.g. collection; while evidence of a higher entity is provided by the submitter, the ID (e.g. URI) may be generated by another service

Table 2: From within-code EPrints

PREMIS metadata elements Part of (if not main element)
PREMIS entity type Comment Other IR-SP sources
objectIdentifier +
identifier of the eprint record (identifiers of the digital objects are their URLs)

fixity +
verifies if an object has been altered. Where is fixity check first performed? Not within EPrints currently, but a script that crawls the archive comparing files with checksums is possible

objectCharacteristics Object

Table 3: From PRONOM-DROID file format ID tool

PREMIS metadata elements Part of (if not main element) PREMIS entity type Comment Other IR-SP sources

bitstream, file, representation; this will be "implicit" in the harvesting service IR policy

compositionLevel objectCharacteristics Object
e.g. compression, encryption, zip; EPrints won't tell you this, but a file format ID tool might IR policy
format + objectCharacteristics Object

software + environment
software to render or use the object; SP decides which software environments are to be supported

Table 4: From IR policy

PREMIS metadata elements Part of (if not main element) PREMIS entity type Comment Other IR-SP sources
Depends on "preservability", cost, etc.

significantProperties objectCharacteristics Object
e.g. pdf + links IR submitter
inhibitors + objectCharacteristics
inhibit access, use or migration, e.g. encryption IR submitter, IR policy
R* signatureInformation +
validates submitter, for IRs e.g. identifying authors among services, authenticating material coming from a repository, etc. These appear to be fairly 'weak' needs assuming the repository and preservation services are 'secure'. The signature itself and associated elements (e.g. keyInformation) would be generated by an appropriate tool to be decided by IR/SP policy
SP policy
permissionStatement +
while the author is the ultimate arbiter of which permissions to grant, the IR policy sets a framework for standardising permissions by type of object to cover preservation requirements; the SP records and formalises the management of this information (see permissionStatementIdentifier) IR submitter, SP
permissionGranted + permissionStatement Rights
actions the grantingAgent allows the preservation repository, using controlled values

Table 5: From preservation service provider

PREMIS metadata elements Part of (if not main element) PREMIS entity type Comment
Other IR-SP sources
storage +

direction to locate object stored in preservation repository


e.g. tape, hard disk, CD-ROM, DVD

relatedEventIdentification +
relates objects after an event, e.g. migration
linkingEventIdentifier +

Use to link to events not associated with relationships, e.g. format validation, virus checking

linkingPermissionStatementIdentifier +
identifier for permission statement associated with the object (see permissionStatementIdentifier below)
eventIdentifier +

Events are e.g. SP actions. Each event must have unique, locally-generated ID


define controlled vocabulary, e.g. capture, compression, migration, decryption




e.g. why the event occurred

eventOutcomeInformation +


linkingAgentIdentifier +

Event about an agent associated with an event

linkingObjectIdentifier +
about an object associated with an event
agentIdentifier +

identifies the agent uniquely within the preservation repository system




from controlled vocabulary, e.g. person, organisation, software

permissionStatementIdentifier +
permissionStatement Rights
designation used within the preservation repository system
permissionStatement Rights
objects to which permission pertains, e.g. by IR

permissionStatement Rights
identifying designation for agent (IR?) granting permission, if agent is described as entity, e.g. agentIdentifier

permissionStatement Rights
agreement between IR and SP, as recorded by SP
IR policy

Table 6: From environment registries

PREMIS metadata elements Part of (if not main element) PREMIS entity type Comment Other IR-SP sources


omit if bit-level preservation storage

environment Object
assessment of the described environment
IR policy, SP

environment Object
uses supported by the environment, e.g. render, edit

hardware +
hardware components needed by software, e.g. hardware performance required, does this object require a minimum hardware level?

Testing the mappings

Modelling preservation scenarios and mapping preservation metadata elements from an authoritative source to these models informs development but ultimately needs to be tested in examples using real preservation services. Given that the underlying preservation service provider models have been evolving in Preserv this has so far not been possible. This approach to preservation metadata has been tested in another form, however, as it was used as a basis for an objective survey of repository managers of larger IRs with known content profiles (Hitchcock et al. 2007b). This mapping gave us the opportunity to place the emphasis on what repositories do, and the implications for preservation, rather than on what they may plan to do or what repository managers think about preservation. Below are some findings from the survey that may affect the proposed mappings:
A survey of repositories by PREMIS, despite using a more leading questionnaire, discovered (Caplan 2004):
The validity and need for the elements in Table 3 can additionally be informed PRONOM-ROAR format profiles. The Preserv project has presented format profiles ('Preserv profiles') of over 200 IRs through the Registry of Open Access Repositories (ROAR) by applying the PRONOM-DROID format recognition tools from the National Archives of the UK to OAI data harvested from the repositories (see Preserv Format Profiling: PRONOM-ROAR An illustrated guide

Given the inclusion of a History Module in EPrints v3 ( it is possible that some of the Event elements from Table 5 could be generated within the IR and shared with the service provider, depending on the nature of the services provided, and the number of service providers contracted to provide them (Hitchcock et al. 2007a).

The Rights elements in Table 5 suggest a greater degree of granularity may be required than even the most preservation-aware examples among current author agreements.

Are the elements allocated to Tables 5 and 6 viable and useful for service providers? Determining this will require the setting up of realistic service provider testbeds and has not been performed so far in Preserv.

Mapping PREMIS to repositories: the PRESTA example

Lee et al. (2006) have also mapped PREMIS to a repository service framework in PRESTA - PREMIS Requirement Statement, an Australian Partnership for Sustainable Repositories (APSR) project. In PRESTA the submission system (c.f. an IR in Preserv) and archive (c.f. a preservation service provider in Preserv) are less clearly defined than in Preserv, although the test repositories are IRs at the Australian National University (ANU) and the University of Queensland (UQ). In addition the framework includes a preservation monitoring and management system (c.f. a distributed service in Preserv) and a partner archive (no immediate equivalent in Preserv). Recognising that preservation services are likely to be supplementary to repositories rather than part of the core definition, in PRESTA: "It was decided there would be more emphasis on what metadata was collected than how it was collected."

PRESTA is wider than the analysis presented here. PRESTA considered "all metadata, including PREMIS, necessary to support long term sustainability", including descriptive metadata (describes content including metadata providing context or meaning to a digital object) and structural metadata (how parts relate to the whole and to each other), as well as inclusion of PREMIS in a METS profile for exchanging preservation metadata. In this paper only PREMIS is considered, without an exchange profile.

The result of PRESTA is detailed and justifies close study for those implementing preservation metadata for repositories, but the summary recommendations tend to emphasise the role of repositories to a greater degree than in Preserv's service provider model, with specific actions required of repositories while the role of the National Library of Australia seems to be as a general support framework rather than an active service provider. This perception may be due to the nature of the findings with respect to the two target IRs, which found gaps in the collection of preservation metadata:
Appendix 4 (Gap reports for ANU DSpace and UQ Fez/Fedora repositories) in Lee et al. (2006) provides a useful point of reference for Tables 1-6, although important differences in the underlying models mean that direct comparison is not possible.

Selection metadata

We had expected, given an apparently umbilical connection between preservation and selection, to find selection factors included in preservation metadata. This is not the case. PREMIS is not concerned with selection for preservation, but with content to be preserved.

Selection is generally regarded to be a vital element of preservation services, for reasons of cost. In the general, hypothetical case, selection may be principled but impractical. In terms of digital content, especially Web content, new forms of content such as email lists, blogs and wikis -- and there are many others -- raise new questions about selection that simply cannot be answered from the standard reference points. In fact, the selection question presents an inverted logic as far as preservation is concerned: it is about first deciding what not to preserve in order to identify what to preserve. Simply, if used inappropriately selection could easily be counter-productive in terms of diverting greater analysis, added cost and possible mis-selections.

Fortunately, IRs present a more concrete example. It appears many IRs commit to a responsibility for all content they admit (Hitchcock et al. 2007b). At least, in defining types of content that can be deposited in an IR will make the selection issue more tractable.

It is also possible to identify other factors in terms of selection. Other parties may be interested in preservation of certain materials found in IRs, for example, research funders who may wish to setup alternative preservation services for these materials (e.g. UK PubMed Central has been set up by the Wellcome Trust to provide a stable, permanent, and free-to-access online digital archive).

Authors may be invited  to identify content, or special features within content, for preservation. This more subjective approach has been applied to artistic and multimedia works in the PANIC project (Hunter and Choudhury 2003). Anderson et al. (2005) describe the TAG Team Questionnaire to try and build the views of creators into preservation decisions, and NLM has devised a set of permanence ratings (Byrnes 2000), informed to some extent by creators and authors, to guide selection and preservation decisions. Both of these latter examples appear more suited to management of in-house digital library materials than to IR authors and submitters, but could be adapted.

In the digital environment, selection questions raise new issues that remain to be framed. It is likely that any analysis of preservation metadata will eventually need to be extended to 'selection metadata' to assist the efficiency and automation of preservation workflows for IRs.


"The set of core elements in the PREMIS Data Dictionary has been widely accepted, at least in principle, but it has not yet proven itself through experience in operational repositories." (Caplan 2006)

PREMIS provides an excellent basis on which assess the needs of institutional repositories with respect to preservation metadata, and it appears to be possible to map the PREMIS elements to an extended model incorporating preservation services and registries. Preliminary evidence, based on a survey of repositories informed by this analysis of preservation metadata, shows that most data can be provided by the sources identified, although some elements may need to be adapted or omitted. The lack of formal preservation policies among repositories is a limitation that we expect to be rectified in time. More implementation and testing are required, especially to validate the allocation of elements to preservation service providers and environment registries.

It should be noted that the PREMIS Data Dictionary for Preservation Metadata and its related XML schemas are currently being reviewed.


Anderson, Richard, Hannah Frost, Nancy Hoebelheinrich, and Keith Johnson (2005) The AIHT at Stanford University: Automated Preservation Assessment of Heterogeneous Digital Collections, D-Lib Magazine, Vol. 11, No. 12, December

Bradley, Kevin (2005) APSR Sustainability Issues Discussion Paper, Australian Partnership for Sustainable Repositories - National Library of Australia, 28 January

Byrnes, Margaret (2000) Assigning Permanence Levels to NLM's Electronic Publications, 2000 Preservation: An International Conference on the Preservation and Long Term Accessibility of Digital Materials, York, England, December 6-8

Caplan, Priscilla, Preservation Metadata, DCC Digital Curation Manual, 1 August 2006

Caplan, Priscilla (2004) PREMIS - Preservation Metadata - Implementation Strategies Update 1. Implementing Preservation Repositories for Digital Materials: Current Practice and Emerging Trends in the Cultural Heritage Community, RLG DigiNews, Vol. 8, No. 5, October

Cedars (2002) Guide To Preservation Metadata, March

Coyle, Karen (2006) Rights in the PREMIS Data Model, Report for the Library of Congress, December 2006

Day, Michael (2003) Preservation metadata initiatives: practicality, sustainability, and interoperability, ERPANET Training Seminar on Metadata in Digital Preservation, Marburg, Germany, 3-5 September (revised)

Guenther, Rebecca (2004) PREMIS - Preservation Metadata Implementation Strategies Update 2: Core Elements for Metadata to Support Digital Preservation, RLG DigiNews, Volume 8, Number 6, December

Hirtle, Peter B. (2006) Author Addenda: An Examination of Five Alternatives, D-Lib Magazine, Vol. 12 No. 11, November 2006

Hitchcock, Steve, Tim Brody, Jessie M.N. Hey and Leslie Carr (2007a) Digital Preservation Service Provider Models for Institutional Repositories: towards distributed services, Preserv project, January 2007

Hitchcock, Steve, Tim Brody, Jessie M.N. Hey and Leslie Carr (2007b) Survey of Repository Preservation Policy and Activity. Preserv project, January 2007

Hunter, Jane and Sharmin Choudhury (2003) Implementing Preservation Strategies for Complex Multimedia Objects. Seventh European Conference on Research and Advanced Technology for Digital Libraries, ECDL 2003, Trondheim, Norway, August

James, Hamish; Ruusalepp, Raivo; Anderson, Sheila; and Pinfield, Stephen (2003) Feasibility and Requirements Study on Preservation of E-Prints, JISC, October 29

Lavoie, Brian, and Richard Gartner (2005) Preservation metadata. DPC Technology Watch Series Report 05-01, September

Lee, Bronwyn, Gerard Clifton and Somaya Langley (2006) Australian Partnership for Sustainable Repositories PREMIS Requirement Statement Project Report (pdf 59pp), National Library of Australia, July 2006

Lupovici, Catherine, Julien Masanès (2000) Metadata for long term-preservation, Nedlib Consortium, July

MacColl, John (2004) DSpace Institutional Repositories and Digital Preservation, DPC Forum on Digital Preservation in Institutional Repositories, London, 19th October, slide 5

National Library of Australia (1999) Preservation Metadata for Digital Collections, 15 October 1999

National Library of New Zealand (2002) Metadata Standards Framework – Preservation Metadata, November

OCLC/RLG Working Group on Preservation Metadata (2002) Preservation Metadata and the OAIS Information Model: A Metadata Framework to Support the Preservation of Digital Objects, June

PREMIS (2005) PREservation Metadata: Implementation Strategies Working Group Data Dictionary for Preservation Metadata: Final Report of the PREMIS Working Group, May