[Mimas logo]"epub@mimas"

Metadata for the Nature Digital Archive

Ann Apps and Ross MacIntyre

Manchester Computing, University of Manchester,
Oxford Road, Manchester, M13 9PL, UK.
Email: ann.apps@man.ac.uk, r.macintyre@man.ac.uk

Abstract: During the development of a prototype digital archive of the journal Nature, issues of the journal are digitised into single page PDF files. To effect the display of the journal by article, article and journal metadata is specified, the article header information being described using Dublin Core metadata. This paper describes the metadata definition, and the data handling processes involved in creating the metadata XML files which are employed to make the Nature articles available to an end-user. Some indication is given of further functionality in the archive, including "Nature Trails", and of possible future developments.
Keywords: Nature; metadata; Dublin Core; digital archive.

Introduction

The publishers Macmillan have made available issues from the prestigious weekly scientific journal Nature (1869-1991) (Nature, 1999) to develop a prototype digital archive for use by academics in UK Higher Education. A digital archive of the vast backstore of source material within Nature would provide a valuable research tool, easier access to Nature papers as a teaching aid, and a unique view of scientific history as an aid to historians and sociologists of science. The actual journals used for digitisation are being provided by The Royal Society. The initial conversion and digitisation is being performed by the UK Higher Education Digitisation Service (HEDS) (HEDS, 1999), based at the University of Hertfordshire, whilst the data management and journal access is provided by Manchester Computing (MC, 1999) at the University of Manchester.

The digitised material is provided to Manchester Computing as a single file for each page of an issue of Nature in PDF "image and text" format (MacIntyre and Tanner, 1998). In order to identify the separate articles of an issue, rather than single pages, it was necessary to define metadata for the articles. The Nature application utilises this metadata to allow access to individual articles by an end-user.

Objectives of the Prototype

The prototype digital archive was developed with the following objectives:

Dublin Core Metadata

Dublin Core (DC, 1999), which is a developing World Wide Web standard for metadata specification, was chosen to describe Nature articles. Dublin Core is primarily concerned with the semantics of the metadata rather than the syntax for its inclusion with an information resource. It was possible to specify the requisite metadata of the journal articles within Nature using the fifteen basic elements of Simple Dublin Core, though some elements were ignored because they seemed irrelevant, and some required further specialisation to capture more detail. Dublin Core allows all elements to be optional and repeatable. For the Nature archive metadata, this is true for some elements but not for others. For instance an article may have several authors, or none indicated, whereas each article must have a single title.

The Dublin Core elements for an article within Nature are defined as:

Title
The article title. There is only one title and it is mandatory.
Creator
In the context of Nature articles this is an author. There may be several authors, or no author. Each Creator is further specialised to indicate first name, or initials, surname (family name), suffix (eg. FRS), and affiliation. Each of these may appear only once. Only first name is mandatory. This may seem a strange decision, but it was found that in the early editions of Nature, some articles were attributed by simply initials, in some cases a single initial. Information about the actual persons indicated by these initials is not accessible.
Subject
According to Dublin Core semantics, Subject should contain the topic, which in the context of serial articles would be the keywords. There are no keywords defined for articles within Nature. However this element is used internally to capture information about the type of article.
Description
Dublin Core uses the term Description for the abstract of an article. Articles within Nature do not include an author-supplied abstract, so generally this element contains a copy of the title. It is however used for articles which include a list of items, to indicate to the end-user the article content. For instance an item entitled "Societies and Academies", which is a regular feature within Nature with news from various societies, may have a Description which includes the names of the societies in the report, eg. "Royal Society, London; Geological Society, London; ...".
Publisher
This will always be "Macmillan Publishers Ltd, Crinan St, London".
Contributor
This element is not used.
Date
This is the cover date of the issue containing the article. It is captured in ISO 8601 format (yyyy-mm-dd), eg. 1877-05-03. A second internal Date element is defined which captures the date in "words", eg. "3 May 1877", for end-user display. This second date element is obviously unnecessary but it avoids "on-the-fly" conversions during article metadata display.
Type
This element captures the category of the article. It may be one of a defined list of "DCObjects", most obviously "Article", or it may be an internal type, for example "Book Review", or "Societies and Academies".
Format
The data format of the article. This is always "application/pdf" because all full articles within the Nature archive are in PDF format.
Identifier
An article identifier. Each article within the Nature archive has two identifiers defined:
Source
This element is used to capture information about the position of the article within the source journal. It is further specialised to include: the journal title, always Nature; the volume number; the issue number; the start page number for the article; the end page number for the article.
Language
The language of the article. Within Nature, articles are written in English. The language is captured in ISO 639 format as "en".
Relation
Within Dublin Core, this element is the identifier of a secondary resource and its relationship to the present resource. For the Nature archive two instances of Relation are defined:
Coverage
This Dublin Core element is not used.
Rights
A copyright statement which is always "Macmillan Publishers Ltd. Year".

Dublin Core appears to be good choice for the capture of serial article metadata. It has been possible to include all the required information about an article using the set of simple Dublin Core elements, though some have needed specialisation. One apparent inadequacy in Dublin Core metadata may be its lack of any obvious way to capture version numbering and change history for a document, but this is not a requirement within an archive of already published journal articles.

XML Syntax for Article Metadata

The syntax used for the Nature archive article metadata is XML (XML, 1999). A Document Type Definition (DTD) is defined which reflects the Dublin Core metadata elements listed above. An example of article metadata in XML is:

<NATART>
<TITLE>Sound-Vibrations of Soap-Film Membranes</TITLE>
<CREATOR SCHEME="INTERNAL">
    <FNMS>Edward B.</FNMS>
    <SNM>Tylor</SNM>
    <SFX>F.R.S.</SFX>
    <AFF>Wellington, Somerset</AFF>
</CREATOR>
<SUBJECT SCHEME="INTERNAL">Nature Article</SUBJECT>
<DESCRIPTION>Sound-Vibrations of Soap-Film Membranes</DESCRIPTION>
<PUBLISHER>Macmillan Publishers Ltd, Crinan St, London</PUBLISHER>
<DATE SCHEME="ISO-8601">1877-05-03</DATE>
<DATE SCHEME="INTERNAL">3 May 1877</DATE>
<TYPE SCHEME="DCObjects">Article</TYPE>
<FORMAT SCHEME="IMT">application/pdf</FORMAT>
<IDENTIFIER SCHEME="SICI">0028-0836(18770503)16:392</IDENTIFIER>
<IDENTIFIER SCHEME="INTERNAL">NATAV16I392A19</IDENTIFIER>
<SOURCE SCHEME="INTERNAL">
    <JTL>Nature</JTL>
    <VID>16</VID>
    <IID>392</IID>
    <PPF>12</PPF>
    <PPL>12</PPL>
</SOURCE>
<LANGUAGE SCHEME="ISO-639">en</LANGUAGE>
<RELATION SCHEME="ISSN" RELATION="IsPartOf">0028-0836</RELATION>
<RELATION SCHEME="INTERNAL" RELATION="IsAbstractOf"
    PDFSIZE="322">NATA/V16I392/01603920012a.pdf</RELATION>
<RIGHTS>Macmillan Publishers Ltd. 1877</RIGHTS>
</NATART>

Article Header File Metadata

The XML article metadata files, or article header files, detailed above, themselves contain metadata. This metadata is captured in a HEAD section of the file using HTML-style META tags. It contains similar information to the article header, the difference being that this is the metadata for the article header XML file not for the article itself. Thus the Format element contains "text/xml" and the internal identifier is the file path to this XML file. This article header metadata does not include the internal version of the date nor a second Relation element indicating the full article.

In addition, an experimental RDF (WWW Resource Description Framework (RDF, 1999)) metadata file is created for each article header. This allows the article header metadata to be defined in a more rigorous fashion than simply using the HTML-style META tags. Dublin Core elements which are specialised to include extra detail, such as Creator and Source, are described by a Nature RDF schema which specialises these elements over their Dublin Core RDF schema descriptions. The location of this RDF metadata file is defined by an HTML-style LINK element within the HEAD section of the article header file.

Because the purpose of metadata on WWW files is primarily for information discovery, the addition of Dublin Core metadata, both META tags and RDF, was strictly unnecessary. The articles within the Nature archive are not available for discovery by web search engines, but only through login to the Nature archive application. But it provides cataloguing of the articles which will be available for future utilisation. Also it is good practice to include metadata with all files intended for WWW access. The RDF metadata files are regarded as experimental because RDF is an emerging standard which is still undergoing specification and change, this being particularly true of RDF schemas.

PDF Full Article Metadata

Using the XML article metadata it is possible to determine which pages correspond to which articles, so that the supplied single page PDF files may be composed into PDF articles. Obviously this involves repeating some of the PDF pages, some several times, where more than one article requires a particular page.

Metadata is added to these PDF articles by way of the PDF Document Information. This consists of four basic simple text fields: Title; Subject; Author; Keywords. The title field is copied from the XML article metadata. The author field captures the list of authors unless the article is unattributed. The subject and keywords fields are left blank. Addition of this metadata to the PDF file identifies the article after discovery by the Nature archive search engine, allowing for a user-friendly list of search results. Without this metadata, search results would be listed by file name.

Manchester Computing developed the code to insert the PDF metadata and to combine the PDF files into articles, using the Acrobat run-time libraries obtained through their membership of the Adobe Developers' Association (ADA, 1999).

Tables of Contents

XML files describe Tables of Contents within the "Nature/issue" hierarchy effectively providing metadata for, and so cataloguing, the archive as a whole. The elements defined for this Nature archive metadata are not Dublin Core, but elements specific to the Nature application defined by XML DTDs. All of these XML Tables of Contents files include information about the journal: journal title, ie. Nature; publisher; ISSN; copyright statement.

The issue metadata, in addition to specifying the volume and issue numbers and the cover date, lists the articles included within the issue. The entry for each article includes: the article title; the authors, if known; the article start and end page numbers; the location of the article metadata XML file; and the location of the PDF full article along with its file size.

The Nature metadata lists the available issues by year. The entry for each issue includes: volume number; issue number; cover date; start and end page numbers for the issue, ie. the first page of the first article and the last page of the last article.

As for the article headers, all the XML Tables of Contents files include metadata describing themselves, both as HTML-style META tags and with separate RDF files.

Nature Archive Data Handling Process

The data for an issue of Nature supplied to Manchester Computing is processed by several data conversion programs before it becomes instantiated as a new issue available via the Nature application.

Nature by Page

Nature articles are supplied to Manchester Computing as a single PDF file for each page within an issue of Nature. These files are named according to a standard convention which uniquely defines for each file its volume, issue and page numbers. Using this filename, basic XML "article" metadata is generated, considering each page as a separate article with a "title" composed from the issue and page number. This page metadata and associated "issue by page" metadata would allow for a simple Nature application to provide the digitised Nature pages to be viewed without any knowledge of the article content. For article discovery, a user would have to know the exact issue and page number required, or would use a search engine. But this simple application would allow for digitised Nature content to be mounted quickly and before any article metadata were provided.

During this initial page data handling process, some validation of the supplied PDF files is performed by running them through a utility which extracts the text from them. This utility would report any damaged PDF files. The single page PDF files maintain a backup of the supplied PDF until the article PDF manipulation has been performed.

Nature by Article

The article metadata is created in plain tagged format by an external "keying agency", Saztec Europe Ltd. (Saztec, 1999), which specialises in bibliographic record creation. This task involves more than just re-keying the tables of contents for each issue, which in many cases contain insufficient information, especially for the early issues. The metadata for each article must be identified from the actual printed copy of the issue, which is loaned by John Rylands University Library of Manchester for this purpose. Only the variable items of metadata are supplied in this way in plain tagged format. The fixed items, such as publisher name, are included automatically during generation of the XML metadata files.

This plain tagged metadata is supplied in a single file for each issue. At the head of the file is the information common to all articles in the issue: volume number; issue number; cover date. For example:

VO 016
IS 0392
CD Thursday, May 3, 1877

An entry for each article specifies: article title; authors; description; article type; start and end page numbers for the article. For example, noting that this particular short article starts and ends on the same page:

TI Sound-Vibrations of Soap-Film Membranes
AU Edward B./Tylor/F.R.S./Wellington, Somerset
DE Sound-Vibrations of Soap-Film Membranes
TY Article
PP 012/012

Data handling to provide "Nature by article" involves the following steps:

Nature Application Browsing

The browsing interface to the Nature archive displays a list of the available issues by year. Selection of an issue causes the display of the list of articles within the issue, with the option for each article to either view the article metadata or to directly view the full article in PDF format.

The Nature application which implements this interface is a C program which processes the information within the XML metadata files for the Nature archive and for each issue, displaying the tables of contents in HTML. This C program also controls access to the archive, currently by IP address checking. Because the "Nature/issue" hierarchy metadata has been captured in XML it would be possible for the program used for its HTML display to be written in an XML aware language such as OmniMark or Perl. It is expected that future web browsers will offer native support of XML, making the conversion to HTML unnecessary.

Nature Archive Article Metadata Display

The metadata for an article is displayed to an end-user using an OmniMark program which generates HTML from the article's XML metadata file "on the fly". The article metadata display includes the article title, authors and description, and provides a hypertext link to the full article PDF file. Generating the HTML dynamically will allow the possibility of including further links to improve functionality but obviating the need to regenerate the base article metadata files. Possibly in the future web browsers could display the XML article metadata directly.

Development of the Nature Archive Application

The current Nature archive application has been developed via several experimental versions as the possibilities for its implementation and functionality were explored.

The initial version of the Nature application displayed each issue as separate pages with no knowledge of the articles within the pages. Provision of article metadata was then explored. Initially the article metadata was typed at Manchester Computing but it quickly became apparent that this was not a viable option because of the volume of data required especially for more recent issues. Even after article metadata was included, the PDF files were retained as separate pages with the application interface providing hypertext links to each PDF file required for an article.

The early versions of the Nature application were based on the SuperJournal (SuperJournal, 1999) application which was previously developed at Manchester Computing. Initially this was controlled by an object database (Fujitsu/ICL ODB-II) to provide a browsing interface, before the application was altered to be controlled by a simpler program processing the archive metadata files. In these versions of the application, article metadata was held as SGML conformant to the SuperJournal Header DTD (Apps and MacIntyre, 1999), a DTD developed at Manchester Computing specifically for the SuperJournal project. This implied duplication of the article metadata files which were also held in an "in house" Dublin Core SGML format.

Other Application Features

The Nature archive application includes the Verity (Verity, 1999) search engine. Both fielded, bibliographic searching over the article metadata, and full-text searching over the PDF article files are supported. The PDF pages are only ever displayed as images, even though OCR'd text is also included within the PDF files, because this OCR'd text is not corrected during the production process. But when an article is discovered via searching the search term is highlighted in the displayed PDF file. The choice of the search engine, and additional lexical analysis tools, took into account the nature of the text produced.

Experiments have been made in including "themed" access within the application. So-called "Nature Trails" have been supported by the insertion of links into the PDF image files as they are served. This can be used to link together articles relating to a common theme, eg. "Early Man". It is planned to include data extracted from the XML article metadata files to better describe these links. This method avoids changing the image files themselves, which, taking into account the variety of possible eventual uses for the archive, make "hard-linking" undesirable.

Conclusion

The development of the prototype Nature digital archive has been challenging and interesting work, the interest enhanced by the fascinating historical scientific content of the journal itself. Development of the enhanced functionality within the archive will continue, particularly in the area of themed access.

Currently the prototype includes 682 articles within 32 issues over 5 volumes, including one complete volume, with dates in 1874, 1877, 1937, 1985 and 1987. Thus the archive covers a range of layout styles and graphical content including some in colour. A further 4 complete volumes are expected shortly, comprising a total of 55 issues from years 1929, 1965, 1979, 1989.

It is hoped that this prototype digital archive will be expanded to become a complete archive for the issues of Nature from 1859 to 1991. Nature is one of the most prestigious science journals in the world and a digital archive would provide a valuable resource to UK academia for research and teaching.

The Nature Digital Archive project has provided a testbed for using Dublin Core, as well as experiments with XML and RDF, for metadata specification and manipulation. The results of this project indicate that Dublin Core is a good choice for capturing serial article metadata. From this experience, Manchester Computing are continuing to use Dublin Core metadata semantics within an XML syntax for other electronic journal applications.

Acknowledgements

The project to develop the prototype Nature digital archive is funded by the Joint Information Systems Committee (JISC) of the UK Higher Education Funding Councils.
The journal content has been made available to UK academia by Macmillan Publishers Ltd., physical volumes being supplied by The Royal Society and John Rylands University Library of Manchester.

References

[ADA, 1999]
ADA, Adobe Developers Association. http://partners.adobe.com/supportservice/devrelations
Apps and MacIntyre, 1999]
Apps A, MacIntyre R. (1999). Proceedings of ICCC/IFIP Third Conference on Electronic Publishing '99: Redefining the Information Chain - New Ways and Voices, May 1999, Ronneby, Sweden, ICCC Press. (The SuperJournal Project: Data Handling Using SGML.)
DC, 1999]
Dublin Core metadata. http://purl.org/DC/
HEDS, 1999]
HEDS, UK Higher Education Digitisation Service. http://heds.herts.ac.uk
Nature, 1999]
Nature. http://www.nature.com
MacIntyre and Tanner, 1998]
MacIntyre R, Tanner S. (1998). DELOS 6th Workshop, June 1998, Lisbon, Portugal. (Nature - A Prototype Digital Archive.)
MC, 1999]
Manchester Computing (at The University of Manchester). http://www.mcc.ac.uk
OmniMark, 1999]
OmniMark Technologies. http://www.omnimark.com
RDF, 1999]
Resource Description Framework. http://www.w3.org/RDF
Saztec, 1999]
Saztec Europe Ltd. http://www.saztec.com
SICI, 1999]
SICI, Serial Item and Contribution Identifier. http://sunsite.berkeley.edu/SICI/
SuperJournal, 1999]
The SuperJournal Project. http://www.superjournal.ac.uk/sj/
SX, 1999]
SX (XML parser and validator). http://www.jclark.com/sp/sx.htm
Verity, 1999]
Verity. http://www.verity.com
XML, 1999]
XML. http://www.w3.org/XML

31 May 1999 (xhtml update 27 June 2002)

[Go to Electronic Publishing at Mimas]Electronic Publishing          [Go to Mimas home page]Home Page          [Valid XHTML 1.0!]