[Mimas logo]"epub@mimas" Report


XML in Publishing

SGML UK / BCS Electronic Publishing Specialist Group Meeting
Wednesday 8th November 2000
St John's College, Cambridge
Report by: Ann Apps


This seminar covered an interesting selection of issues concerned with the use of XML in the commercial publishing sector, and included some lively discussion. The meeting was held in St John's College, Cambridge, with lunch in the impressive college hall requiring a saunter through the college cloisters and across the Bridge of Sighs. The seminar was accompanied by a Technology Showcase where several software vendors demonstrated their wares. This was a popular, well-attended seminar, especially considering the difficulties with travelling at the time because of rail speed restrictions and flooding.

XML and publishing: an overview

Alex Brown, Technical Director, Griffin Brown Digital Publishing Ltd.

This talk looked at how XML has evolved from previous standards and how these changes have affected the publishing industry.

SGML became a standard in 1986, with developments such as HyTime in 1992 for linking and DSSSL in 1996 for formatted output. SGML was supposed to solve the publishing industry's problems, but in reality there was: poor tool support because of hard and expensive implementation; lack of mainstream commercial interest; a requirement for heavy machine use. It did have some success in the niche publishing markets of scientific journals and reference publishing.

In 1993 HTML, an SGML application, appeared to fulfil the dream of information freely accessible over the internet. But HTML subverted many SGML principles by being more concerned with how content was to be rendered not what it is, and with mark-up validity being compromised by web browsers.

In 1998 XML version 1.0 became a W3C recommendation. This promised:

XML appears to be a lot of hype which has not yet been fulfilled: XML schema will replace the DTD, but this has yet some way to go; XML link is still under development; XSLT, etc will be easier to use than DSSSL, but again the standards are not yet fixed. Whether XML is applicable to publishing depends on whether one believes the hype.

Publishing consists of: commission; production; manufacturing; distribution and sale. Production consists of: originate; edit; typeset; produce SGML. But this is an iterative process. In reality much of the problem with using SGML is a quality issue. The production workflow is still geared towards print, whereas today an earlier electronic product is needed.

Publishing should be seen as an exploration of content to make money, with the content delivered to wherever it will make a profit. This changes the process to a content-centric mode and a one-way flow of: acquire; manage; produce; products. This becomes the reason to use XML - acquisition means getting the data into the organisation as XML, the rest being XML content management. This content-centric approach no longer sees content as a slave to the process, and removes proof-reading from being part of the typesetting process.

Practical experiences with the adoption of XML in commercial publishing

Richard Kidd and Neil Hunter, Royal Society of Chemistry

This presentation described the practical experiences at RSC of developing a new web-based electronic journal. The earlier application was based on a conventional publishing model developed from paper-based editing, through PDF for print, to PDF with SGML headers on the web. This model showed a lack of flexibility in electronic delivery, electronic data of variable quality and no archive.

The requirements for the new web application included: publication on the web before print, article by article; a reduction in publication times; a reduction in costs; on-screen editing; one source for all outputs. The ideal process was seen as: author data captured as SGML; RSC edit the SGML; auto proof-reading; RSC correct SGML; RSC publish as HTML and PDF; SGML sent to typesetter for final formatting for print.

There was a question of where to start making the change: as a `big bang'; at the `capture' stage; at the `editing' stage? It was recognised that SGML DTD development, with its associated document analysis and tools evaluation, expensive support and training, is expensive for the typesetters as well as themselves, and that the data would be problematic with complicated mathematics and tables. So it was decided to start at the end, after the final corrections. Then the DTD could be developed against real data, practical experience could be developed using SGML/XML, and the production wouldn't be affected. Investigation of a data repository and an editing process could be done later, at first the file system was used as a repository and a pragmatic approach was taken to the tables, maths and bibliographic references. At this point XML arrived! It was seen that this provided the necessary part of SGML.

The application uses Microsoft MSXML. The use of this with IE5 allowed for testing of the DTD, the XML data and proof of concepts. It is used to generate static HTML pages. The tool includes XSLT, DOM and a parser. ASP and JScript are used to pre-process documents via the DOM with Unicode resolved to graphics/glyphs. The application is Unicode-ready but browser support is not yet available. This has proved to be an inexpensive, well documented and reliable toolset which is fully conformant with the XML standard.

The next stage is to bring XML development forward in the other parts of the publishing process. A pilot project with one supplier is looking at using XML instead of SGML within the workflow of suppliers: data captured as XML; RSC edit XML; typesetter creates auto-proof; RSC correct XML; RSC creates HTML; final XML. They have looked at on-screen editing using Arbortext, and may look at Softquad Xmetal in future. The problem to be solved is the change from editing on paper to editing on screen, rather than the tool to be used. The next steps include: a rollout to the remaining suppliers; continuing training of editors and process improvement; aim for a full XML workflow by mid 2001.

Gaining control of the data should allow the next developments: creation of proofs in HTML or PDF using XSL:FO; all outputs from one source using XSLT; integration with manuscript tracking system, with distinction between a business data layer and a user interface layer; cross publisher article linking, such as CrossRef, possibly with own database for storing results of look ups.

Future developments could include: SVG, MathML, CML; templates to simplify data capture; use of XSL:FO - one file for all outputs; improved Unicode support; improved search functionality; XML displayed direct to browser, or HTML generation on-the-fly, with customised views.

It seems that choosing to use XML has put them in a good position to react quickly to new changes. It has allowed RSC to reap the benefits of using open standards. The archive will serve future publishing needs. The continuous publication model for individual articles has saved 14 days in publication time and points to a future of virtual journals. Costs have been reduced. Now that the data is structured it is possible to concentrate on adding value.

Their conclusions: data is not trustworthy (when planning developments) unless `real'; a publisher knows their own data `inside out'; use of XML will give control over the information which is the business; using XML has helped develop co-operation between suppliers and committed staff, provided expertise in DTD development and provide industry support for XML standards.

NewsML - a revolution in news

Tom Thomson, Director, News 2 Web Programme, Reuters Ltd

NewsML is a metadata wrapper for news. News is a commodity to be transacted over the internet and Reuters need to maintain a competitive edge. It is important that they keep an edge by seconds over competitors. A news item can consist of several media: story; pictures; videos; graphics; etc. The development of the standard NewsML DTD as a wrapper for multimedia news items is a re-engineering of the news industry. Different customers can extract different particular items from the wrapper, possibly selecting by language or display device as well as content. It also allows for adding valuing to the news or maximising the value offered.

The News 2 Web project is replacing computer systems at the production side with wrapped streams of text, pictures, etc. The same news may be for the different markets: financial; retail; mobile; new media (web sites). Style sheets will also be offered although some customers do their own rendering. The theory is `one pipe for all customers'.

Development has involved: looking at metadata - what to describe, standards, metadata dictionaries; internet standards - other flavours of XML, news alerts (these are retyped at present in 24 languages!); the workflow - how to do the wrapping, where and how to add metadata, how to pack the item.

Rollout is planned for 2001 with all Reuters' news delivered in this format by 2002. Personalisation is seen as the key to the future of news. The benefits to users are: threads, topics, aggregation, etc. Optimisation of bandwidth is another consideration. This may be by including pointers to multimedia items rather than the clips themselves, or by compression of the metadata.

Digital Rights Management with XML

Eamonn Neylon, Technology Director, The YRM Group

YRM are a small US-based rights transactions clearing house service and consultancy. Rights management is about information commerce. `In the digital world all transactions are rights transactions' (Sally Morris, ALPSP).

Digital rights management definitions: trusted exchange of digital content; management of digital rights and digital management of rights (in all format); protection of content to ensure only allowed operations will be performed; latest investment craze of venture capitalists in US; a Pandora's box with fundamental consequences for the future of mankind. Types of rights (this is the boring bit!): statutory - legislation, fair use and moral rights; contractual - use within established limits; permissions - extending for one-time usage, e.g. copying; sub-rights - sale of portion of copyright. Rights are a new form of revenue.

Negotiating permissions needs good quality metadata: need to be able to express both bibliographic and rights metadata; context of use is a factor; particular type of licence is constructed; to distribute monies collected. Currently copyright anarchy exists. Expressing granted rights: needs a vocabulary for specifying what is being bought / licensed; selling confers rights - need to define what is allowed; licensing can enforce specific use; need to know parties (not people) involved in transaction and content being licensed.

Currently proposed rights languages are:

Comparison of XrML / ODRL: both cover usage rather than access rights; different governance modes, XrML is licensed Intellectual Property, ODRL seeks to be open development; different levels of maturity, XrML is mature, ODRL is a proposal.

An alternative approach is to use metadata: metadata used as resource description is information about things; metadata can be used as an `about' wrapper; it can provide a consistent view to serve a particular purpose - there can be multiple representations of intuitive models; the <indecs> project did much research into using metadata for exchanging content; other standards groups are looking at rights - MPEG, SDMI (Secure Digital Music Initiative), Dublin Core. It is necessary to identify the components: unique identifiers are needed to track and authenticate resources, parties, transactions; ISBN can be used at the product level, so can DOI which accommodates other identifier standards and is consistently available (CrossRef); need to be able to identify personas; need identifiers at different levels.

Technical implementations of digital rights management could cover: digital watermarking - but needs legal measure to be enforced; access controls such as user identification authentication or digital certificates; content wrappers to provide usage control; super-distribution - packaging content for sharing, dynamic negotiation of rights. Consumer rights need to be considered: legislation and common practice; digital millennium copyright act to outlaw circumventing of copyright protections; exceptions are access for fair use and research into cryptography; access controls and first sale doctrine - is this applicable to digital works?

Conclusion: there is a substantial industry growing round protection and transaction of media assets, for instance the coding of out-of-print copyright works for e-books; the traditional publishing industry will end the race to develop digital rights management when publishing business models are supported; eventually rights management will become part of the Operating System, similar to TV broadcasting systems.

XML and e-commerce in book and serials publishing

Francis Cave, Chairman, SGML UK; Francis Cave Digital Publishing

EPICS and ONIX projects are aiming to encode descriptions of published products for e-commerce and trading down-stream of the supply chain, in the context of other e-commerce applications. These are projects / initiatives using XML. They use DOI as an identifier for intellectual property entities which is persistent, actionable (ie. include the technology for resolution), and interoperable. CrossRef is a genre application for learned journal articles, and for applications such as BioImage database, outside the traditional publishing areas.

DOI (www.doi.org) provides a kernel metadata - a minimal description of an identified entity - consisting of: the DOI itself, genre, unique identifier such as ISBN, entity type, origination, primary agent, agent role. It needs an extension for genre metadata which focuses on rights metadata following the indecs model. In future registration of metadata will be mandatory in XML, using a format based on ONIX. DOI and ONIX are aiming to conform to the <indecs> data model to describe digital rights management, using a vocabulary in the digital rights area.

EPICS (developed from EDitEUR) is looking at book trade metadata and international e-commerce standards, to promote the use of EDI in publishing. EPICS has developed an abstract data dictionary or vocabulary. The pilot version uses XML expression, the DTD being based on RDF. This will provide a vocabulary source for ONIX.

ONIX is being developed as an XML application, currently using an XML DTD, but with the intention of moving to an XML schema later. ONIX messages are valid XML documents. There was a requirement to produce something as quickly as possible, and to have a small DTD with few tags and a flat structure. It was decide to have one DTD but with two implementation levels. Level 1 has minimal composite elements and short numeric element names; level 2 will have more composites. ONIX descriptions and element groups cover such items as product numbers, forms, series details, audience, conference details, bibliographic data and rights (territorial, supplier and prices, sales promotion). Some metadata is grouped, for example there is a group of contributors (author, editor, translator, etc).

When deciding on element names there were two camps within the project. One group preferred short meaningless names because: there was concern about message length; this is an international standard so there shouldn't be an implicit semantic from an English name; dumb element names were preferred. The second group preferred mnemonic names because these would be easier for support systems and content creation. Thus the DTD contains two names for every element, implemented by a marked section switch. But this will present a problem when development moves onto using an XML schema.

The current status is: version 1.1 was issued in July / August 2000; version 1.2 was expected in November 2000; an XML schema, to enable verification of element content, is under development; extension is in progress to e-books (December 2000), video (2001), rights metadata. Serials publishing needs integrating into this model. There are some future XML/EDI initiatives, eg. using the XML for trade with libraries and a draft DTD for library book ordering. Future direction is likely to be influenced by the Global Commerce Initiative (GCI) and ebXML (www.ebxml.org).

Publishing for profit on the internet

John Chelson, Managing Director, CSW Informatics Ltd.

SGML used to be a single source for publishing. Print used to make money. Publishers tried selling CDs but this wasn't very lucrative. Then they moved to HTML and PDF on the Web, but the Web has lots of free high quality content and it is difficult to protect the content you charge for. Now XML has arrived and publishers think they can make money using XML!

This talk was based on a case study of a European online portal for doctors.

Publishers want online books: to promote print sales; using XML online gives control over structure and content; it is possible to have a full audit trail of navigation and searching which gives feedback to publishers and thus targeted marketing; an application can be run from an XML database. The advantages of XML delivery are: richness of information mark-up; enhanced search capability; linking and metadata facilities; dynamic styles and personalised views; tracking and audit control; targeted advertising and sponsorship. The first four of these are also advantages to the user.

There are different ways of selling e-content, possibly using several revenue streams: subscription; e-commerce, including pay-per-view (micro charging); e-document sales both on and off line; book sales; advertising such as banners and targeted email; sponsorship; providing added value such as customised or linked content.

Subscription models can be individual, institutional, organisational or domain based. Accounts can be open, deposit or publication sets. The subscription model retains customers for a year, and hopefully for the following year too.

Pay-per-view means each element must be tracked for a user session with a cost attached to each element type or behaviour (eg. following a link), accumulating charges for the session. This would be charged to a customer account, or in the future micro-payments could be used. This model is not particularly successful. As soon as people are likely to clock up a significant charge they prefer the subscription model.

The personal publications model allows a user to select sections of a publication to add to their personal `issue' (like a shopping trolley). This would be available to subscribers only. This personal issue is available online at any time and updates automatically as the content is updated, with an alert to the subscriber of any changes. Personal publication possibly requires creation of PDF on-the-fly.

For targeted advertising the use of each element of information is tracked. This requires a link to individual element types, possibly by metadata attributes, of adverts or sponsor's message. It would be tailored by user profile.

Other ways money is being made on the internet: business to consumer e-commerce with a market place created by supply-side business competition for customers; business to business e-commerce with supplier or aggregated marketplaces on the supply side; enterprise information portals (get people to the portal then sell them something).

There are current initiatives looking at a Global Trading Web which is a global, platform independent framework for trading, and UDDI (Universal Description, Discovery and Integration).

Questions and discussion - panel of speakers.

There was some discussion about the use of XML for interactive page-based make up and layout, the economics of on-demand publishing, and maintaining publisher branding on pages.

The panel were asked to say what each thought were the key opportunities / advantages of using XML. Answers included: integrity, flexibility, longevity of content, print time and parallel publishing, control, information requirements, speed.

The use of XML for branded portals and for content syndication was discussed.

The printer's perspective

David Lewis, Clowes Information

`Printer' really means the typesetter who converts the data to pages. The printer is seen as a cinderella industry, but the book is still the main source of revenue for publishers. Printers are not luddites but they do have reason to be cynics! Typesetters were involved in the early days of structured document creation because they realised it would result in less work.

The loss of structure in a document happens at various points in the process. Authors tend to originate with Word with no added structure. New media publishing means using HTML where all structure is lost. DTP made everyone an expert, but it was all about appearance not document structure.

The idea of typesetting from publishing databases isn't new. But it has not been an ideal solution with difficulties in changing the template, and the fact that a relational model is not ideal for free text. An XML database may give the opportunity to do the job properly, rather than just emulating the old interfaces, with data turned into well-formed XML as a first step.

The problems of DTD creation, particularly for reference books where page layout is still important, were described, with some recommendations.

Changing to using XML has implications for legacy conversion, where re-keying or `smart' conversion may be required. Conversion from SGML is probably not too difficult, but shouldn't be attempted at a critical time. Authoring XML is most likely to be taken up if a Word look-alike approach is used.

To the printer, page make up is still important. It is more demanding for the printed page than for electronic delivery because there is a much higher expectation of quality on a printed page. XSL does not yet have printed page features (if it ever will). Publishers / editors still correct on printed pages.

The printer / typesetter still has a contribution to make to publishing. Publishers have a high turnover of staff, so that printers offer the only continuity. They are trusted by the production team so can best manage the transition from legacy to structured data. Because printers have the best knowledge of the data they are well placed to lead DTD design. Printers are probably already using new media operations themselves.

Maybe the approach to structuring and commissioning content needs changing. A design for the envisaged media should be done before the authors start writing, with authors having some involvement in the design process. Authors produce words, sentences, paragraphs. They should be commissioned to provide information not a print article. The editor would approve the information. Then it would go to a designer and content editor for the particular medium. Currently publishers still want print and web display to look the same. Ideally this is not true, they should be different.

Other sources of income for printers will become important, such as putting the information in different media. Using XML mustn't be allowed to restrict creativity.


16 August 2002

[Go to Electronic Publishing at Mimas]Electronic Publishing          [Go to Mimas home page]Home Page          [Valid XHTML 1.0!]