Industrial Strength Publishing

On its own, the Apache server is a bare bones HTTP server engine. But native capabilities such as server side includes (SSI), add-on's such as PHP or component systems such as Zope turn Apache into a publishing environment. We'll look at Salon.com's decision to build a full-featured publishing environment that provides workflow, styling, and scalability by leveraging mod_perl, XML, and Java technologies. The architecture of Salon.com's content management, distribution and delivery systems is the product of its engineers years of experience with ZDNet, C/Net, and CNN's web publishing properties. We'll focus on such systems, their design and their implementation. The broad topics we'll cover: We'll walk through the various technologies examined as we narrowed down our choices of technologies and then take a look at the current state of the publishing system. Salon.com output's a high daily volume of content through this system and as we extend it, we expect to solve a number of broad publishing problems.
Ian Kallen
Manager of systems and software, Salon.com
This talk was presented 07/20/2000, 1:30pm to 3:00pm in Marriott San Carlos III at the O'Reilly Open Source Software Convention in Monterery, California
Requirements: Edit versus Production
The fundamental publishing problem usually boils down to this:
See Illustration
Edit versus Production

The process flows these separate functions products together
Independent Workflows
From a higher level view, editorial data and formatting data can be broadly thought of as components. These different components have different people acting on them though, each with their own specialization of tasks.

Part of content management, for both editorial data and for formatting data, is the process of handing off data from one actor to another. In practice this isn't an actor per se but a role, a group of actors who intervene for specific functions in the life of a components development.

Since the actions that the different groups take in the course of output development share little in common, it's advantageous to have workflow processes that are independent of each other.

Editorial Workflow
An editorial object might get handed off through a workflow like this:
Author
The writer who developed the story
Edit
The editor responsible for where this article will appear
Review
Editors responsible for fact checking and other follow-up
Copy
Editors responsible for style consistency, spelling and grammatical checks
Media
Integrates attribution and links for illustrations, graphs, audio and/or video components
Publish
Overall check of the preceeding steps and live deployment
Somewhere along the way (perhaps Edit or Copy, someone may also intervene to build relationships between stories. For instance, a piece about water in the Martian poles may also be enriched with links to stories about the spate of failed NASA Martian missions. So in addition to workflow, content management must facilitate building complex relationships (by category, by author, by keyword, etc) between stories.
See Illustration
Editorial Workflow
Production Workflow
Similar to the editorial group, the production department may have a workflow as well but have entirely different processes:
Author
The producer who developed the illustration, audio or video piece
Palette
Producers who sets the palettes or encoding characteristics
Production
Producers who assigns metadata and catalogs the work
Publish
Overall check of the preceeding steps and live deployment
See Illustration
Production Workflow
Wheels re-invented, over and over
Show me ten thousand different editorial organizations and I'll show you ten thousand different workflow requirements. The ones we just saw for editorial and production workflow examples are among the many possibilities. Thus a key additional requirement must be:
The tools must not define the workflows, the workflows must define the tools!
Many web businesses are operations in a constant state of change: new business deals, acquisitions or being acquired, new editorial directions, cross-media integrations and on and on. Workflow processes should be open and flexible to accomadate all of these contingencies.

At least 90% of the web publishing shops operating now are using ad hoc workflows and software setups. Content management and publishing is not dominated by any company or technology despite the fact that commericial publishing tools have been on the market for years. Why haven't these tools succeeded?

Flexible Data I/O
Varied inputs
Likewise, the inputs may entail more than those conducted by the in-house editorial staff. Content infomediaries, wire services, content partnerships and other subscriptions mandate an API for getting editorial in to the content management system. Perhaps specific components are best delivered by protocols/languages that the user's client is aware of.
  • Web forms
  • Applet text-editors
  • WebDAV
  • XML uploads
The possibilities are many, so design for accomadating them all.
Flexible Data I/O
Varied outputs
The target output may not be confined to a web site's main design. Cobrandable and syndication versions, PDA appropriate designs, WAP, PDF and numerous other output types might be mandated by the publishing business.

Supporting multiple outputs may add considerable complexity, designing for easy expansion and contraction of the output repertoire is therefore necessary.

Some rhetorical questions
Since we covered some of the key components of content management, independent workflows and complex story relationships, we have to look at the other end of publishing: delivery. Once we've defined our data and presentation, what do we have handled by our delivery engine?

Many systems in use for solving publishing problems suffer from low performance ceilings because the delivery system is going to a data store (a RDBMS, an object database or some other lookup process) to fetch and format. It's an accepted given that maintaining complex relationships between editorial objects mandates storage in such a system, but why make the HTTP request fulfillment performance hinge on retrieval? Doesn't it make more sense to calculate in advance that which can be so that the HTTP server can do what it does best, serve files?

Not all content components can be statically pre-calculated but if 90% of a presentation only changes episodically, not on a per HTTP request basis, we re-calculate 100% of the presentation on a per HTTP request basis?

baking versus frying
Balancing what processing happens at HTTP request time (the end user loads a page) versus what happens when the content management system outputs at publish time is the lynch pin of a performant, scalable publishing system. We'll use this vocabulary:
Baking
This is the publish time commitment of data to a file (or a more efficient cache, if you have one). Editorial narrative, headlines, datelines and other pieces of editorial/business data that doesn't change can and should be pre-calculated in advance.
Frying
This is the request time processing of data for the final presentation. Stylesheet assignment, session start/finish accounting, ad placements and other things that actually may change on a per request basis are handled by the HTTP delivery engine.
Obviously some applications (shopping carts, auctions, messaging, sports scores and stock quotes) are going to lean on the frying side. When a presentation is mostly of volitile data, the amount of baking will be minimal and crosses the street from web content to web application. This is not necessarily going to require different technologies but does necessitate a shift in computational burden. However, most web content doesn't change with a frequency that requires a web application server (apologies for raining on the application server vendors' parade).
The presentation needs to be flexible
The computational burden of frying varies greatly depending on the amount of processing is performed on a per request basis. For instance, a page that that is build like this is pretty lightweight:
<HTML>
  <HEAD>
    <TITLE>
    Server Side Includes
    </TITLE>
  </HEAD>
  <BODY>
  <!--#include virtual="/navbars/header.html" -->
  Standard headers and footers are a piece of cake with SSI's
  <!--#include virtual="/navbars/footer.html" -->
  </BODY>
</HTML>
This case would use the mod_include semantics to build the final presentation with header and footer components. There's no computation going on beyond employing Apache's subrequest mechanism to resolve the request for virtual documents. However, the basic idea here on which we'll benefit from architecturally (even if we're using something more sophisticated than SSI's) is that the page is built of fry-time components.
Piece-specific components separated from general ones
Fry-time components are a crucial part of flexible presentations. If too much of the presentation is baked into the page and we want to, say in the preceeding example, change the navigational links used in the footer, then you have re-bake all of the pages!

The computational overhead incurred by including components pales in comparison to the burden of republishing. Does this conflict with our previous call to pre-calculate that which we can? No, it means we must be judicious of our use of bake-time and fry-time resources!

Defining separate components for as much as possible that presentations share from page to page is a major page architecture problem that must be made by system architects and production management. Too much of the presentation baked into the pages and the presentation is too inflexible. Too much of the presentation calculated at fry time and the per request overhead grows, dragging down performance.

Components should not have to re-calculate the same result
If your components are calculating the same thing over and over again for the requests that come in, there's probably something wrong with the architecture!
<HTML>
  <HEAD><TITLE>PHP &amp; Databases</TITLE></HEAD>
  <BODY>
  <? require("$DOCUMENT_ROOT/navbars/header.html") ?>
  <?
  mysql_pconnect("localhost:3306","myuser","mypass");
  $rs=mysql_db_query("mydb","
        SELECT c.category_path, c.category_name 
        FROM story s, category c 
        WHERE s.story_id=$STORY_ID 
        AND c.category_id=s.category_id
      ");
  $row=mysql_fetch_row($rs);
  ?>
  <!-- link to the top page for this category -->
  <A HREF="<? echo $row[0] ?>"><? echo $row[1] ?></A><BR>
  <? require("$DOCUMENT_ROOT/navbars/footer.html") ?>

  </BODY>
</HTML>
In this case, PHP code connecting to the database (circa PHP3 API) for every request to calculated the link URL and link text for a link to the top level page for the category the current story is in. This is an example of what not to do!
Components should not be stupid
Confining componentization to use of mod_include severely handicaps how smart our components can be. There's no complete programming language, access to database connections or caching. It's simplicity is its beauty and its curse; lightweight yes, but too much so.

A step up is of course PHP, it at least provides a programming framework and database connectivity.

If our content management system is to output ("bake") components that are to later build ("fry") the presentation, we at least must insist that they do what PHP does. Components that cannot calculate anything more substantial than time formats and pattern matches in environment variables (the limits of mod_include's capabilities) are going to be too stupid to do any heavy lifting with our content.

Desktop publishing is not web publishing!
Some people confuse "publishing tools" with "authoring tools" and this has lead many web publishing shops (and tool vendors while we're at it!) down a blind alley.

Not to disparage everyone's favorite authoring applications but when NetObjects' NetFusion or Microsoft's FrontPage works well for someone setting up a web site with 50-100 pages, they may be lulled into believing that they can scale usage of these tools to deal with the constant addition, updating and other maintainence tasks of a web publishing operation.

You probably wouldn't be here if this were working for you, getting a design implemented and an initial setup running is trivial compared to the task of daily maintainence of a web site. Don't let this point get lost next time a web launch specification is being drawn up!

That said, an additional requirement for any publishing system is allowance for editing environments that users like and find familiar. Producers like to develop in Allaire's HomeSite and Macromedia's DreamWeaver, editors seem to prefer Microsoft Word. This can be a vexing problem.

Web based authoring environments
If you build your own home page on Geocities or one of the other free page building sites, you might be lulled into thinking that that is the answer to all of your needs. It's platform neutral, accessible from anywhere and changes are immediately reflected. But have you ever noticed how all of the pages that people put on those sites tend to look the same? Or at least have the same level of quality?

Those tools work to the extent that they do for page-at-a-time authoring but lack facilities for defining a motif (not in the X library sense) and applying templates based on that motif. They lack any facility for relating one page to another.

A minor step up is slashdot software and the slash-alikes that provide tools for serial content posting and muli-user posting. Along similar lines is the service provided by blogger, weblogs have arisen as a medium unto themselves. But again, these tools don't provide any facilities for relating one published work to another; each piece of content is an island of its own.

Template driven content generation
We've rejected the desktop authoring tools and the "mom & pop" grade web based tools because we need templating and complex relationships between editorial objects.

The basic requirement for templating can be fulfilled by just about any technology that can perform "paint by numbers" variable substitution and component includes.

<HTML>
  <HEAD>
    <TITLE>
    [ TITLE GOES HERE ]
    </TITLE>
  </HEAD>
  <BODY>
  [ HEADER COMPONENT GOES HERE ]
  [ CONTENT GOES HERE ]
  [ FOOTER COMPONENT GOES HERE ]
  </BODY>
</HTML>

For every component system out there, someone has built their own templating system with it. SSI, PHP, ASP, JSP, and so on. All are plausible candidates for a templating system (though we decided earlier that we don't want stupid components, so we rejected SSI's).

Template driven content generation
Merging data and formatting requires some kind of logic engine behind it to perform the variable substitution and component inclusion. That means a programming language, API's and engineering to implement it. Cold Fusion, ASP and numerous other plug-and-pray technologies for accomplishing this have been devised. We'll discuss popular technologies for use with and wihtout Apache here.
  1. Non-Apache Technologies
  2. PHP
  3. Servlets & JSP
  4. Python & Zope
  5. mod_perl & ...
We can't possibly cover all of the possibilities here but hopefully we've looked at all of the important ones. ASP didn't even make the list since it ties you too closely to limited and proprietary Microsoft technologies. Though we'll make mention of an exception there as well.
Commercial and Non-Apache Options
There's a wide range of commercial and some open source products that claim to solve the publishing problem.
Vignette StoryServer
A very high priced box that generates cryptic URL's, StoryServer's process model is one plagued with problems. Everything is baked (stored in the StoryServer file cache) so a minor template tweak causes the content generation processes to go ballistic. StoryServer does support Apache integration but it's not maintained on the same tier as their Netscape/iPlanet support.
Interwoven Teamsite
Another very high priced box. Teamsite's highly regarded for its workflow but the flexibility for various outputs appears limited. We also have to question how well the overall solution will scale with the price points that the various pieces require.
Allaire ColdFusion
A proprietary language with a limited development model, limited platform choices and a limited community around it. Does this sound limiting?
ArsDigita Community System
Phil Greenspun has spoke extensively but unconvincingly of the virtues of the AOLServer. We don't agree that Tcl is the world's greatest language and Alex seems nice and everything but a good dog doesn't make a good publishing technology.
Apache Options
PHP
I've used PHP extensively off and on over the years. Its Perl-ish simplicity often brings a good deal of development speed and quick wins are nice. The wide array of databases supported (with persistent connections) and other extensions make it very appealing as well. However, we rejected PHP for our architecture because of some crucial design and language problems. The important ones were:
  1. Function duplication. Last I looked there were 6 different sort functions and none did the sorting that I needed.
  2. Database API's vary widely from database to database unlike Perl's DBI or Java's JDBC API.
  3. Difficulty with the object model. Defining classes and creating objects to package up data and methods was not a well understood or documented capability.
  4. Template code must be expressed top to bottom, having calculations performed outside of the sequence of HTML parsing is not possible
  5. No caching, if the home page is requested 100 times per second, the code gets re-evaluated
For the first three issues, we turn our attention to Java and Perl solutions. For the latter two, we have to consider component systems that are implemented in those languages. I understand that the PHP community is aware of these items and perhaps have addressed them since I last made any substantial use of PHP (Fall 1999).
Apache Options
Servlets & JSP
The case for going the Java route gets more and more compelling. The servlet API has matured to a point now where it provides fairly rich programming constructs, the tools have matured to a point where performance and stability are admirable and the spec itself seems to have come of age along with many Java technologies. However, while the Java story seems to get steadily better, we still have some misgivings.
  1. Java is not open source. Sun still maintains reticent control over it, it's a technical Berlin Wall that must be torn down
  2. There is no CPAN equivalent for Java. Every Java shop is an island unto itself with no facility provided for community code reuse
  3. Development time can be lengthy. Java in general puts you in a box in which you must create objects and watch your data typing even to do the simplist things.
Despite these issues, Java clearly has a number of things going for it. The early maturity of XML tools being among the most important ones. Java is not presently at the core of our technology but it may gain greater importance as we extend it.
Apache Options
Python & Zope
Clearly this is Python's killer app. The Zope development community and Digital Creations have been on a tear. Zope's content management looks like a complete set of tools as it organizes around folders, applications and users. Python itself has appeal in its object model, mature thread implementation and modest footprint. We rejected Zope for these reasons:
  1. We don't know Python (we could learn if we had too, but do we have to?) and we don't know many people who are Python programmers. Hiring developers adept with more popular languages is hard enough, hiring for Python skills seems like an additional challenge
  2. Extending and customizing Zope would require not just knowing Python but being highly skilled with it
  3. Zope development doesn't seem to focus on best-of-breed integration but on internally developed tools. There are lots of HTTP engines and SQL databases, why has the Zope community expended their effort inventing their own?
  4. Python has no CPAN
There may be good retorts to these issues but to us, they look like barriers we'd rather not contend with.
mod_perl & ...
One of the most striking things about mod_perl is how rich with possibilities it is. Allowing one to leverage all of the power of Perl, add on capabilities from the CPAN and providing full access to the Apache API, mod_perl is the ultimate foundation on which to build web applications.

There are some important and valid critiques of Perl:

Loose requirements for structure and code
Perl doesn't force you to instantiate objects nor express logic in a boxed in fasion. The old saying with Perl, TMTOWTDI, can often be restated as TMTTWTDI ("there's more than ten ways to do it!"). So if Perl wants to permit you enough rope to hang yourself, I don't fault the tool because someone managed to construct a noose with it!
Bloatware
Who uses formats anymore? How often do you use the various low level socket functions? There are a lot capabilities rolled into Perl that, in these days when XS can provide what's needed on an as needed basis, are clearly anachronistic.
Threads
Frankly, I haven't needed them enough to care about how the Thread development has matured in recent Perl releases. I'm sure the people who do care will work that out! Nonetheless, Python and Java both offer mature thread implementations.
XSLT
The Java community were early adopters of XSLT and thus Saxon, XT and Xalan are all, AFAIK, fairly complete XSLT processor implementations. The equivalent in Perl has been sorely lacking.
mod_perl & ...
There are a number of component systems either built around mod_perl or are widely used in the mod_perl community
HTML::EmbPerl
An early development in the mod_perl community, HTML::EmbPerl expresses logic top to bottom embedded within HTML. While it's a very mature system, it seems to have been playing catch-up with regard to caching and its object model.
Apache::ASP
This implements request, response, session and other ASP objects using Perl. If the ASP API appeals to you and you don't want to use Microsoft or ChiliSoft technologies, Apache::ASP is the way to go.
HTML::Mason
Mason components can be HTML (or XML or FooML for that matter), Perl code or a combination of them. Components may generate output or they may be stand alone routines that accept data inputs and return data outputs. The object cache stores components as Perl code so evaluation from the cache is very fast.
AxKit
An XML & Sablotron (XSLT Processor) based system, it's pretty new, so I haven't looked at it closely yet!
There are links to these and other mod_perl technologies on the mod_perl site, http://perl.apache.org/.
Best of Breed Choice: mod_perl
Embedding a Perl interpretter into Apache allows one to harness a lot of power! By building on component systems, CPAN libraries and the Apache API, we clearly benefit from an expanded breadth of possibilities using mod_perl.

For instance, to log into our publishing system, you must access the SSL virtual host. It's running the Apache::AuthDBI module to look up users in the database. We have our own PerlHandler that issues a cookie with connection characteristics hashed against a secret key. A PerlAccessHandler then verifies that user is permitted by checking the cookie values against the hashed data. Username and passwords never cross the wire in the clear thus users can access the publishing system across the public networks reasonably assured of security.

Session data is maintained with Apache::Session to prevent repeated SQL lookups as users access the system. Once we know who the user is, we can populate their session with some of their personal infoirmation, their group memberships and other vital data that we don't want to have to fetch from the database for every request.

Of course there's also a Logout handler to ditch the session and login cookies. All of this is accomplished with fairly simple code on top of CPAN modules and the Apache API.

Best of Breed Choice: HTML::Mason
Mason has garnered a considerable amount of favor in write-ups appearing in WebTechniques, The Perl Journal, and PerlMonth so don't just take my word for it. It's a component system originally devised to power the CMP TechWeb site, a leading technology publishing operation.

The dhandler and autohandler concepts are very powerful ways of expressing template application. For instance, a dhandler at the document tree root (you're actually concerned with the component root, coming...) can be a default formatting component when the original URI translated document or a higher level dhandler is absent.

Component trees can be developed in parallel so that a sparse tree of components earlier in the component root path take precedence over those later in the path. We're currently planning on implementing developer-specific sparse trees so that developers can work in their own self-contained environment accessing a component root path specific to their access. Integrating this with CVS will make our engineering a truly multiuser development environment. This characteristic is unique to Mason (AFAIK) and one of the key advantages it maintains over other mod_perl component systems.

Best of Breed Choice: DBI
The DBI API provides facilities working with a RDBMS system in standard ways. Placeholders, cached SQL parsing and other things sweeten the deal further to provide robust performance and rapid development.

We're using Oracle 8.1.5 on Solaris which performs very reliably. DBD::Oracle, the driver that DBI uses to access the Oracle engine, is very fast and handles CLOB and other Oracle data types just fine. While we'd like a lighter budget for out RDBMS, we require integrity constraints and robustness that mandates a commercial database.

Since we're running DBI under mod_perl, our libraries that have to deal with Oracle directly (and in turn provides a higher level API to the tools and formatting components running under HTML::Mason) can take advantage of Apache::DBI's persistent database connections and connection pool maintainence.

In the future we may pursue forking our code to run under PostgreSQL. They seem to have the most complete SQL implementation of the non-commercial databases. If we get really adventurous, we may experiment with supporting MySQL even though that will require integrity enforcement within our Perl libraries.

The publishing technology roadmap
There are a number of technologies maturing that we want to integrate to extend how we publish further in the future.
XML
We already use XML extensively to store "data blobs&blobs; since we find that users need to be able to vary what meta-data they store without adapting the database schema. In the future, we intend to make the XML and RDBMS integration more pervasive and flexible.
XSLT
You don't need an XSLT processor if your native programming language (Perl, Java, Python, etc) can express all of the data structures and logic needed to apply formatting. However, perhaps we'll want to abstract out our API's so that they can work with not only mod_perl and our repertoire of Perl libraries but Java and Enterprise Java Beans, Python or someother language. XSLT opens us up our "best of breed" pursuit to these other technologies
What about Java?
We anticipate being able to leverage the things that Java is good at and integrating it into the publishing system.
The quick tour
We'll take a quick look around the system interface now.

The basic flow of defining a new story, editing story data, associating other data nodes such as writers with a story, desking and publishing a story is shown. Browsing and editing formatting components is also shown.

Access controls are applied by granting privileges to user groups. The user creation, password management and group assignments are all done via additional web interfaces. We're going to be implementing a user preference tracking system so that certain aspects of the interface will be customizable on a per-user basis.

Additional functionality for managing keywords and ad placements as well as other tools such as polls, redirect trackers and so forth are also accessible via the web user interfaces.

When you log into the system, you land on your desk. On the left, the navigation links to the workflow desks, defining a new story, editing writer data and other content management functions.
When defining a new story, the first step is to categorize it, specify its target publish date and give it an identifier for that category/day; the slug. This places the story with the minimum amount of info needed to assign it a URL.
Once a story is placed, we can start defining its metadata and narrative content.
When editing a story, the associated writer data can be picked by a search or a browsing interface.
Once a user is ready to hand off a story to someone else in the workflow, the desk that it should go to is selected; submitting saves it and makes it available for someone else browsing that desk to check out.
Eventually, the story gets to the Publish desk where it can be selected to go live. The story that we saw in the "Story Edit" window is now live
Formatting components are implemented in HTML::Mason, they're used to bake the editorial data into output that will served for further request-time processing.
Formatting components can consist of HTML, Perl or, as is usually the case, a mix of the two.
End Notes
There are a lot of code and conceptual refinements coming to the system but I hope this has been an informative walk through the experiences and thinking that went into our technology decisions.

This presentation:
http://www.salon.com/contact/staff/idk/

This presentation is written in HTML::Mason. The slides are styled and navigation links built using an HTML::Mason autohandler. Screenshots come from The Gimp, acquired from Netscape Navigator 4.7 windows running under FreeBSD 3.4 and the Windowmaker window manager.

Me:
mailto:idk@salon.com

Technologies:


© 2000 Ian Kallen