Posts Tagged ‘DC-X Whitepaper’

DC-X: Managing image rights

Posted in March 1, 201014:15hTim StrehleNo Comments »

Usage rights for images – and other types of assets like video – are a tricky, but important part of Digital Asset Management. Am I allowed to use this image for my publication, under which conditions, and will I have to pay for it? Are there any restrictions? A DAM system must help answer these questions.

As is the case whenever money is involved, details matter. Only a subset of my publications might be allowed to use the image, and only for a limited time. I may or may not have to pay again if I reuse the same image. Assets may have to be prohibited from reuse due to legal reasons. An exclusive deal may lock out everyone else from using the same image (or even variants – and probably only for a limited time, or only in a certain geographical region).

Capturing all these details as structured data is the goal of the PLUS Coalition‘s Picture Licensing Universal System. It looks like a great standard for exchanging rights metadata, but we feel it is overkill to use internally within our DC-X DAM system: We need something that’s simple enough that checking usage rights for hundreds of assets has no noticeable impact on response times. (But we’re still planning to add support for PLUS metadata during image import and export.)

Here’s how DC-X manages usage rights metadata:

There’s a database table for rights metadata properties, with columns for a property name and a value, scope, publication, remark, and date from/to. A simple example: Property name=”Fee Required”, value=”1″, scope=”Online” means that usage on an online website incurs a fee (value=”0″ would mean “no fee”).

Predefined property names are “Contract”, “Embargo”, “Exclusive Rights”, “External Syndication”, “Fee Required”, “Internal Syndication”, “Notice”, “Price Category”, “Purchased”, “Rights Agent”, “Rights Unclear”, “Singular Usage”, “Usage Permitted”. This list will certainly be expanded, and customers can define their own types.

Usually a contract exists with each provider, defining common rights for all images sent by them. To make this easier to handle, DC-X rights metadata properties can be bundled into “rights profiles”. We recommend creating one rights profile per provider (or multiple rights profiles if there’s different conditions for subsets of their images). Example: A rights profile named “Reuters images” could combine the properties UsagePermitted=”1″, FeeRequired=”0″, ExternalSyndication=”0″, Notice=”Credit required” if your contract with Reuters allowed you to use all images sent by them with no additional per-image fee, but images must be credited and redistribution is not allowed. (If different parts of your organization have different contracts with Reuters, you could even use the “scope” or “publication” fields to limit properties to a certain part.) And if your contract changes, you simply update the rights profile; all images will immediately reflect the changes.

Rights profiles must be attached to documents (multiple profiles per document are allowed). This can be done manually, or automatically during ingestion: At the moment, you can configure DC-X hotfolders to automatically attach a certain rights profile to all images coming in through it. In the future, it will also be possible to have DC-X determine the appropriate rights profile by looking at the document’s metadata (like the IPTC Credit or ByLine field).

If you need to override certain properties – like when a certain Reuters image must not be used anymore for legal reasons – you can also attach rights metadata properties directly to documents (without rights profiles). Here’s a screenshot from DC-X that shows an image with both a rights profile and a directly attached property:

dcx-rights

To visualize usage rights in search results, DC-X displays icons like the Euro sign or the globe in the screenshot above. They are called “flags” and represent rules being dynamically applied to each document. Example for the “Euro sign” flag definition: “Display if the rights property FeeRequired=1 exists.” (Flags can do much more; they can inspect other document metadata like IPTC fields or image file properties like size and colorspace.) Any number of flags can be defined by the customer.

Rights properties can be used in the DC-X user interface to detect whether certain actions are allowed (i.e., the export to the online CMS can detect that you’re trying to export an image for which you do not have online usage rights). Finally, usage rights can be queried and even changed through the DC-X Web Service API.

Differences compared to DC5: DC5 had no notion of usage rights built in, rights handling was meant to be implemented during the installation and customization phase.

Atom (RFC 4287) entry or feed as the standard DC-X input format

Posted in September 11, 200913:24hTim Strehle1 Comment »

A typical DC-X system receives data from lots of different sources: News agencies, editorial systems, files in hotfolders, e-mails, RSS feeds. So there is a lot of code that deals with parsing various data formats and inserting content, metadata and files into DC-X.

In DC4, all of that code was written in PHP and doing everything at once: Reading input data, parsing and extracting metadata, creating a DC document object and inserting it into the database. We learnt later on that such tight coupling does cause some headaches.

With DC5, we broke this process down into several reusable steps and tried to move everything to XML: If the input format wasn’t XML, we’d run some PHP code to translate it into XML first. The input XML was then converted to the DC5 XML format using XSLT, and in a last step the DC5-formatted XML was imported into the database.

This proved to be a good decision, so we’re doing the same thing in DC-X (with some fun improvements, of course). Only the burden of converting various formats into our own remained with us.

But times are changing. We’re very happy that customers and partners are starting to ask, “in which format should we deliver data to DC-X?”. Offering a generic input format that others can code towards gives control to our customers (and takes a little work off our shoulders…).

DC-X has its own XML format that maps closely to its database structure, but it’s not in any way a standard format so you as an implementor would be left alone with the (maybe poor) documentation we’re providing, and a running DC-X installation as the only way to test your output.

Instead we’d like to rely an existing standard, and we think the Atom Syndication Format (RFC 4287) is a good choice: It is XML-based, simple, well-specified, extensible, and widely implemented. You can use the RSS reader of your choice to test your output. If it’s a valid Atom feed and looks fine in your RSS reader, you know it’s going to import well into DC-X.

You can either make your data available as an Atom feed that DC-X can fetch over HTTP, or you can put a file containing the feed XML or a single entry into a DC-X hotfolder. (DC-X also supports the Atom Publishing Protocol, RFC 5023, for creating DC-X documents by sending the same format via a HTTP POST.) Image and other files to be imported are referenced with the standardized link rel=”enclosure” construct (except when you are using the Atom Publishing Protocol). Special DC-X metadata can be embedded using the DC-X XML namespace.

Here’s an example (an image file with some metadata):

 <?xml version="1.0" encoding="UTF-8"?>
 <entry xmlns="http://www.w3.org/2005/Atom">
   <!-- ID (optional) -->
   <id>my-doc-5p3svhdupvolejj7efw</id>
   <!-- Reference to the associated file (file or HTTP URL, optional) -->
   <link rel="enclosure" href="file://filename.jpg" type="image/jpeg"/>
   <!-- Creation date (optional) -->
   <updated>2009-05-06T09:39:37+02:00</updated>
   <!-- Author (optional) -->
   <author>
     <name>John Doe</name>
     <email>john.doe@example.com</email>
   </author>
   <!-- Title -->
   <title type="text">The remains of a car bomb are seen at the site
of a bomb attack in Baghdad</title>
   <!-- Text as XHTML, always embedded in a <div> element -->
   <content type="xhtml">
     <div xmlns="http://www.w3.org/1999/xhtml">
       <div>The remains of a car bomb are seen at the site of a bomb attack in
Baghdad May 6, 2009. A vehicle bomb killed 10 people and <b>wounded</b> 37 others
on Wednesday when it exploded in a wholesale vegetable market in southern
Baghdad, police said.  REUTERS/Ahmed Malik (IRAQ CONFLICT POLITICS)</div>
     </div>
   </content>
   <!-- Additional meta data in the native DC-X XML format (optional) -->
   <document xmlns="http://www.digicol.com/xmlns/dcx" version="1.0">
     <head>
       <Country>Iraq</Country>
       <Provider>REUTERS</Provider>
       <City>Baghdad</City>
       <Keywords>:rel:d:bm:GF2E5560KT301</Keywords>
       <Keywords>War</Keywords>
       <Person>Ahmed Malik</Person>
     </head>
   </document>
 </entry>

What do you think?

DC-X: The Topic Map

Posted in September 9, 200913:14hTim Strehle6 Comments »

Thesauri and lists of keywords are stored by DC-X in its topic map – a set of database tables modeled after the XML Topic Maps (XTM) 1.0 standard. For an introduction to topic maps, see the wonderful article The TAO of Topic Maps by Steve Pepper.

So far we have implemented merely half of the XTM standard; we’ll look into supporting more of it when the need arises. But the core concepts are all there. – [By the way: Why not RDF? Because topic maps are a higher-level abstraction (RDF triples have less semantics built in) and seemed to provide more value "out of the box"…]

The benefits of treating thesaurus and list terms as topics in a topic map:

  • Built-in support for multiple names, which we’re using to store translations for terms: All lists and thesauri can now be multi-lingual.
  • Class/instance relationship between terms; the “City” list is itself a topic, “Hamburg” and “Oslo” are instances of the “City” topic. This way an unlimited number of lists or thesauri can co-exist. Terms can even belong to multiple lists.
  • Arbitrary relations between terms: A thesaurus hierarchy is modeled using associations like “broader/narrower” or “synonym/preferred term”. Geographic hierarchies can use “part/whole” associations.
  • External identifier URIs can be specified for any term, so metadata can be mapped to metadata of other software using RDF, or anything else that points to the same URI.
  • Custom metadata can be attached to any term. We’ll use this for thesaurus “scope notes”, geo coordinates for cities etc.

We are already importing the (multi-lingual) IPTC subject codes thesaurus and CLDR language and country name lists into the DC-X topic map via the XTM XML format. Importing custom thesauri (in a few common text file formats) is also supported. A couple of DC-X fields are set up to auto-fill lists in the topic map as documents with new values come in. Lists and thesauri can be used for auto-completion during document editing, or for lookup in an “assistant dialog”.

In an upcoming DC-X release we will add a simple topic map browser and editor so that administrators can modify lists and thesauri, and we will be looking into automatically following “use/preferred term” relations so that the administrator can define values that are automatically to be corrected during document import.

Differences compared to DC5: Lists and thesauri are not stored as flat files in the file system anymore, they live in the database. They are available out of the box in DC-X with much less configuration overhead. Multiple languages are now supported. All kinds of relations between terms are now possible, not just simple hierarchies.

DC-X: Monitoring

Posted in July 30, 200910:31hTim StrehleNo Comments »

There’s lots of things that have to be working in order for a DC-X system to do its job. Even though DC-X is very stable, things can still go wrong (a hard disk is full, hardware fails, a process crashes) and you’ll want a monitoring system to tell you about it before users start complaining.

Nagios is the monitoring system we’re concentrating on, as it is free, popular, mature and well-documented. (But other monitoring software will probably work fine as well.)

The standard Nagios plug-ins let you check general server health as well as database, web server and fulltext search availability. They are used to verify that all DC-X command line processes are running. In addition, DC-X comes with its own plug-in (a command line tool) that can report data specific to DC-X: The number of failed import jobs, number of new documents in the last hour, free diskspace across a pool of storage devices and much more.

We’re using Supervisor to start and stop DC-X command line processes (and automatically restart them after a crash). It’s a great tool that also reports the process status nicely.

You’ll want to see trends and statistics as well – how fast is diskspace filling up, how is the number of DC-X documents or workflow jobs evolving? For generic server information, we’re recommending something like collectd. DC-X trends are captured and graphed on our demo server using custom scripts calling rrdtool – this functionality is going to be packaged into the standard DC-X distribution so that every DC-X installation can benefit.

DC-X: Tagging

Posted in June 12, 200907:08hTim StrehleNo Comments »

Tags are an important element of so-called “Web 2.0″. In addition to changing the metadata of the document itself, DC-X allows the user to add tags to the document – a great way of organizing documents for personal or departmental use (and you don’t need “update permissions” on a document to add tags to it).

Tags belong to either a user or a group of users: When you create a new tag, it’s your choice to create it as a personal (private) tag or to make it owned by one of the groups you belong to (meaning anyone in the group can see, apply, or change the tag).

For more specific permissions, tags can be shared with multiple other users or groups. When setting up sharing for a tag, you specify which of these actions will be allowed: Applying the tag, removing the tag from documents it’s already used for, renaming the tag, deleting the tag.

In DC-X, tags are more than simple strings: Each tag is a combination of a metadata field and a value. By default, your tag is supposed to be a simple keyword, but you can also specify that your tag represents a person, a city, or a geo location.

We’re also planning to let users make use of lists and controlled vocabulary when tagging, so that they can use your carefully-cultivated thesaurus terms or lists of names (with “autocomplete” while the user is typing for faster input).

While tags are kept in a “flat” structure and don’t have an inherent hierarchy, the DC-X user interface allows you to “drill down”: You’re starting with one tag, then you’re being offered all the other tags used in conjunction with the first one as a one-click filter. Click on a second tag, and you see the tags used in conjunction with the first two, and so on… This frees you from single-hierarchy structures (“have I put this under ToDo/ClientName or under ClientName/ToDo?”), you can start from any tag.

A powerful new feature is tag groups (internally called “combi tags”): You can group tags together, which makes for a great shortcut for applying multiple tags at once. Example: Create a  tag group named “Roskilde Festival 2009″, set the sub-tags “Country: Denmark”, “City: Roskilde”, “Event: Roskilde Festival”, “Keywords: Music”, “Keywords: Concert”. Apply this tag group to all your Roskilde articles and photos. Now you don’t just have one-click access to them, but you’ll also be able to find them when clicking on the generic “Concert” tag.

Differences compared to DC5: Instead of tags, DC5 had collections (hierarchical collections in the Picture Desk client) which had a similar purpose, but didn’t support metadata fields, tag groups or dynamic drill-down. Also, DC5 collections could not be used on a regular search form (“limit search results only to documents in collection X”), and when displaying a document you couldn’t see which collections it was in – DC-X tags can do both.

DC-X: Web Service API

Posted in April 22, 200911:15hTim Strehle1 Comment »

DC-X offers a comprehensive web service API which is based on the Atom Publishing Protocol (RFC 5023) and OpenSearch 1.1 standards – basically XML over HTTP. This means two things:

  • Skipping the browser user interface, you can read and search for DC-X documents (and other information) in any RSS reader (including the RSS readers integrated into your web browser or e-mail client).
  • Developers can write custom software that acts as a “remote control” for DC-X, since almost all functionality the browser interface offers is also available through the API. There’s endless possibilities: They could write an iPhone application, an integration with a content management system, a custom front-end…

Why not a SOAP web service? That’s a rather long story… The web service landscape is divided into the SOAP/WS-* and REST (Representational State Transfer) camps, and the Atom Publishing Protocol (AtomPub) belongs to the REST side. If you’re interested, read the excellent book „RESTful Web Services“ for an in-depth comparison and REST philosophy and best practices. For us, a REST approach was easier to implement and more transparent.

Differences compared to DC5: DC5 had a custom, REST-inspired web service API. DC-X still offers limited support for that old API, but we recommend migrating to the new AtomPub API when moving from DC5 to DC-X. It is based on popular standards and offers many more features.

DC-X: Architecture and Scaling

Posted in April 15, 200914:34hTim StrehleNo Comments »

DC-X Architecture

DC-X is built on a solid foundation – well-established (and mostly open source) software. Here’s a list of the components that form the basis of DC-X:

  • The DC-X source code is written in the open source language PHP. Its web browser user interface and web service API are delivered by the open source Apache web server, batch processes are run via the PHP command line interface.
  • All data (except for files, see below) is held in an Oracle or MySQL database.
  • Imported files (images, videos, PDF etc.) are stored in the file system – which can be a simple local hard disk, a cluster file system or SAN, or a mounted network volume (NFS et al.).
  • Searches are performed by the open source Solr search engine (using its XML over HTTP API).
  • For better performance, document views are cached in memcached (also open source, which has a TCP/IP API).
  • User and group data and authentication utilizes an external directory service that talks LDAP (like Active Directory or the open source OpenLDAP server).

In the diagram above, each component with one or more asterisks can run on its own server. Two asterisks mean that the component can be distributed across multiple servers. (Of course, you can also run all components together on a single server.)

Some notes on horizontal scaling, i.e. splitting components across multiple servers:

  • The Apache web server can run on multiple servers, with optional load balancing or round robin DNS as per customer requirements. This is supported by DC-X out of the box.
  • The command line batch processes (importers, workflows) can be split across multiple servers as well. Only one import process may run per hotfolder, but there is no limit on the number of workflow processes that can be started on one or multiple servers (since they’re synchronized through the database): This means that you can easily set up specialized “workflow servers”, keeping load off the servers delivering the user interface.
  • Oracle RAC works fine with DC-X, so that you can build a database cluster from multiple servers. Support for MySQL replication is not yet built into DC-X, but we’re planning to add it.
  • We’re leaving storage choices up to the customer, so you can work with your favourite SAN or cluster filesystem vendor as long as the device looks like a regular Unix file system to DC-X. We’re thinking about implementing external storage based on Amazon S3 or the Atom Publishing Protocol. Note that not all storage needs to mounted on all other servers, so you can have a centralized database while still storing files in the location where they’re used.
  • Solr is accessed via HTTP, so it can run on its own hardware. Solr can replicate indexes, allowing for multiple search servers.
  • Memcached can run on multiple servers out of the box.

As you can see, there’s lots of ways to scale DC-X!

Differences compared to DC5: Solr and memcached are new components. MySQL support has been added, and decentralized storage support and workflow parallelization have been added.

DC-X: Content import and the workflow engine

Posted in March 19, 200915:43hTim StrehleNo Comments »

Content is imported (ingested/catalogued) into DC-X using one of these methods:

  • Users or automated processes are dropping files and folders into “hotfolders” monitored by DC-X.
  • Users are manually uploading files using the DC-X browser interface.
  • Users are creating new text documents in the web-based DC-X editor.
  • DC-X is fetching remote data via HTTP in the form of RSS or Atom feeds.
  • External software is pushing data into DC-X through its web service API.
  • Administrators are importing data into DC-X through its Unix command line tools.

The first two are (still) the most popular ways to get content into DC systems. Hotfolders are especially convenient; a lot of things can be configured for them (including field values that should automatically be added to all files arriving in the hotfolder, and how related files – like XMP sidecar files – can be found). There’s a standard way of reading field values from subfolder names; and information extraction from file names is possible, too.

DC-X has a tiny embedded workflow engine that allows administrators to configure how new content is to be handled during import. Here’s how this works:

Multiple workflow definitions can be set up (and DC-X comes pre-installed with the most common ones), for example the “workflow for importing media files”, a “workflow for importing an RSS feed” and a “workflow for importing news agency text articles”.

A workflow definition looks like a pipeline – it lists the steps to be performed during the workflow. Each step is a call to a piece of program code, with defined input and output parameters. This makes it possible to plug pre-defined functionality together as desired. (While a workflow is executed sequentially be default, it is possible to jump to specific steps and to call child workflows, allowing for more complex workflows.)

An example for an image file import workflow definition: The step “create medium-sized preview image” would call an image processing function with 800 pixels as the desired size, the next step “create thumbnail-sized preview image” would call the same function, but setting the size to 400 pixels. A third step could read IPTC, XMP and EXIF metadata, a fourth step would map that data into standard and custom fields using XSLT, with the last step finally importing the input file, the preview files and the data into the DC-X database and filesystem.

When a new file is to be imported, that file is being moved into the DC-X filesystem, and a job record is created in the database. The process monitoring the hotfolder does nothing else: It does not execute the job, so the file is not yet visible to the user in DC-X. Instead, one or multiple “worker processes” running in the background are picking up jobs and doing the actual processing (this allows for parallel imports and load distribution among multiple servers).

The job record has quite a lot of metadata attached to it: Which file is to be acted upon (can also be multiple files or documents), which workflow definition to follow, when it was created, a priority value, whether a worker process already picked it up, whether it was processed successfully, and so on.

A lot of information regarding imports and workflows can be monitored in the DC-X administration interface: Which processes are running, how many jobs are in the queue, which errors occurred. Processes can be started and stopped, hotfolders added or reconfigured, workflow definitions changed.

Workflows are useful for more than just importing content: Since jobs can be assigned to users, mixed human/machine workflows are possible. Example: A user can trigger an export workflow which automatically prepares files in specific formats and then assigns the job to another user who is to approve the export. After approval, the worker processes will once again pick up the job, transfer the files to the export destination and mark the documents as exported.

Differences compared to DC5: Hotfolder and workflow configuration is now in the database and the admin interface, no longer in .ini files. The workflow engine is completely new, including job records and mixed human/machine workflows. Hotfolder monitoring and the actual import process have been separated. The admin interface has become much more powerful in this area.

DC-X: Publication data

Posted in March 13, 200916:36hTim StrehleNo Comments »

An unlimited number of publications can be set up in DC-X, along with metadata related to the publication itself (name, short name, platform, aggregation type, publisher, ISSN, eISSN, hierarchical list of sections/subsections or web channels).

A document can be assigned any number of publication data records so that you can track how often it has been published (or is being planned for publication). This record of course points to both a document and a publication, along with the following metadata from the PRISM 2.0 standard:

  • Cover date, cover display date
  • Publication date
  • Edition
  • Special issue
  • Volume, Number
  • Section, Subsection
  • Web channel
  • Starting page, Ending page, Page range
  • Kill date
  • Identifier / DOI
  • URL

DC-X has a couple of additional fields:

  • Status: “planned”, “published” or other (customizable) states
  • Remark
  • Title
  • Revision (if multiple printed revisions of the same edition should be tracked)

There’s a standardized import format for getting PDF pages, articles, and media files into DC-X, along with accompanying publication metadata.

We’re still working on the user interface; it will be possible to search for and edit publication metadata, and to browse through a publication.

Differences compared to DC5: There was no ready-made publication data storage in DC5; this was subject to customizing. DC-X now has a standard way to do this which works out of the box and allows for much more metadata.

DC-X: Documents and Files

Posted in February 25, 200913:27hTim StrehleNo Comments »

The “assets” (as in “Digital Asset Management”), the main entities of any DAM system, are called “documents” in DC-X: Database records containing text content along with structured metadata, and with zero or more attached files (stored in the filesystem, not in the database).

dc-x_documents_and_files

All content is stored in Unicode, encoded as UTF-8, meaning that almost every language and special character can be handled. Text content is represented as XHTML (which is HTML with a slightly stricter syntax to make it XML-compliant), thus supporting formatted, “rich” text.

Metadata values consist of a field name, (plain) text content and (optional) attributes. DC-X comes with a comprehensive list of common field names based on standards like IPTC IIM, EXIF, XMP, IPTC Photo Metadata, Dublin Core, NewsML, NITF and PRISM. Each installation can define an unlimited number of additional custom fields, along with a datatype (string, date, number etc.) and the number of occurrences (multiple or not).

Metadata fields can be configured to use a look-up list (stored in the topic map which we’ll write about in a forthcoming post). Values from look-up lists are referenced in the document, not copied into them, which allows you to re-label them globally just by editing the list. And the look-up list supports translations into multiple languages out of the box. DC-X will by default automatically add list values as documents are coming in with new values.

DC-X supports multi-language documents, so each text body or metadata value may have a language attribute.

Each document belongs to exactly one document pool, which may help in structuring assets. A special field is the unique hash which can be left blank, but will cause automatic rejection of duplicates if filled in. In addition, every document is being assigned a globally unique identifier.

Files are stored in the filesystem and represented through a database record pointing to their location and holding common metadata like the file format, size, image dimensions etc. The document points to zero or more of these file database records, these are the files “attached” to the document.

Here’s an example: When an image file is imported, a document record will be created with structured metadata read from EXIF, XMP and IPTC. Text content (an image caption or description) will be stored in the document body. The original file will be stored in the filesystem exactly as it came in – it will never be altered by DC-X. A file database record (with the property “type=original”) will point to its location and be linked to the document. Thumbnail- and layout-sized preview images will be generated, each with their own database record, again linked to the document (with “type=thumbnail” and “type=layout”, respectively).

An unlimited number of files can be attached to a document. Custom values for the “type” property can be defined. In addition to the “type” property, there’s “variant” and “version” properties. File variants mean that separate branches of a file can be stored and still point to the same document, for example an alternate version of an image with a blacked-out number plate (along with its own preview images). When a file is being changed, it usually won’t be deleted, but be kept and marked as an outdated version so that it can be viewed or restored as needed.

Differences compared to DC5: Text content is now XHTML instead of plain text. The default list of field names is much more comprehensive. Fields now have a datatype. Field values pointing to look-up lists are new. Multi-language support has been added, as well as file variants and versions.