On its own, the Apache server is a bare bones HTTP server engine. But native capabilities such as server side includes (SSI), add-on's such as PHP or component systems such as Zope turn Apache into a publishing environment. We'll look at Salon.com's decision to build a full-featured publishing environment that provides workflow, styling, and scalability by leveraging mod_perl, XML, and Java technologies. The architecture of Salon.com's content management, distribution and delivery systems is the product of its engineers years of experience with ZDNet, C/Net, and CNN's web publishing properties. We'll focus on such systems, their design and their implementation. The broad topics we'll cover:Requirements: Edit versus ProductionWe'll walk through the various technologies examined as we narrowed down our choices of technologies and then take a look at the current state of the publishing system. Salon.com output's a high daily volume of content through this system and as we extend it, we expect to solve a number of broad publishing problems.
- Requirements for Managing Content
- Requirements for Content Delivery
- Applicable technologies
- Apache Options
- Salon.com's current solution: Best of breed technologies
- The Publishing Technology Roadmap
Ian KallenThis talk was presented 07/20/2000, 1:30pm to 3:00pm in Marriott San Carlos III at the O'Reilly Open Source Software Convention in Monterery, California
Manager of systems and software, Salon.com
The fundamental publishing problem usually boils down to this:
- Editors don't want to be concerned with how their data is displayed. Crafting the words and ideas that constitute their content, the editorial creators would like to assume that, whatever design was agreed upon for the output, they don't need to worry about tags and fonts and so forth.
- Producers don't need to know about editorial data. They have designs implemented in the appropriate mark-up language. If editorial data needs any special adaptations, they don't want to be concerned with the details of the editorial copy, just how the formatting adaption should be implemented.
See IllustrationEdit versus Production

The process flows these separate functions products togetherIndependent Workflows
From a higher level view, editorial data and formatting data can be broadly thought of as components. These different components have different people acting on them though, each with their own specialization of tasks.Editorial WorkflowPart of content management, for both editorial data and for formatting data, is the process of handing off data from one actor to another. In practice this isn't an actor per se but a role, a group of actors who intervene for specific functions in the life of a components development.
Since the actions that the different groups take in the course of output development share little in common, it's advantageous to have workflow processes that are independent of each other.
See IllustrationEditorial Workflow
Production WorkflowSee IllustrationProduction Workflow
Wheels re-invented, over and overShow me ten thousand different editorial organizations and I'll show you ten thousand different workflow requirements. The ones we just saw for editorial and production workflow examples are among the many possibilities. Thus a key additional requirement must be:Flexible Data I/OThe tools must not define the workflows, the workflows must define the tools!Many web businesses are operations in a constant state of change: new business deals, acquisitions or being acquired, new editorial directions, cross-media integrations and on and on. Workflow processes should be open and flexible to accomadate all of these contingencies.At least 90% of the web publishing shops operating now are using ad hoc workflows and software setups. Content management and publishing is not dominated by any company or technology despite the fact that commericial publishing tools have been on the market for years. Why haven't these tools succeeded?
|
|
|
|
Since we covered some of the key components of content management, independent workflows and complex story relationships, we have to look at the other end of publishing: delivery. Once we've defined our data and presentation, what do we have handled by our delivery engine?baking versus fryingMany systems in use for solving publishing problems suffer from low performance ceilings because the delivery system is going to a data store (a RDBMS, an object database or some other lookup process) to fetch and format. It's an accepted given that maintaining complex relationships between editorial objects mandates storage in such a system, but why make the HTTP request fulfillment performance hinge on retrieval? Doesn't it make more sense to calculate in advance that which can be so that the HTTP server can do what it does best, serve files?
Not all content components can be statically pre-calculated but if 90% of a presentation only changes episodically, not on a per HTTP request basis, we re-calculate 100% of the presentation on a per HTTP request basis?
Balancing what processing happens at HTTP request time (the end user loads a page) versus what happens when the content management system outputs at publish time is the lynch pin of a performant, scalable publishing system. We'll use this vocabulary:The presentation needs to be flexibleObviously some applications (shopping carts, auctions, messaging, sports scores and stock quotes) are going to lean on the frying side. When a presentation is mostly of volitile data, the amount of baking will be minimal and crosses the street from web content to web application. This is not necessarily going to require different technologies but does necessitate a shift in computational burden. However, most web content doesn't change with a frequency that requires a web application server (apologies for raining on the application server vendors' parade).
- Baking
- This is the publish time commitment of data to a file (or a more efficient cache, if you have one). Editorial narrative, headlines, datelines and other pieces of editorial/business data that doesn't change can and should be pre-calculated in advance.
- Frying
- This is the request time processing of data for the final presentation. Stylesheet assignment, session start/finish accounting, ad placements and other things that actually may change on a per request basis are handled by the HTTP delivery engine.
The computational burden of frying varies greatly depending on the amount of
processing is performed on a per request basis. For instance, a page that
that is build like this is pretty lightweight:
<HTML>
<HEAD>
<TITLE>
Server Side Includes
</TITLE>
</HEAD>
<BODY>
<!--#include virtual="/navbars/header.html" -->
Standard headers and footers are a piece of cake with SSI's
<!--#include virtual="/navbars/footer.html" -->
</BODY>
</HTML>
This case would use the mod_include semantics to build the final
presentation with header and footer components. There's no computation
going on beyond employing Apache's subrequest mechanism to resolve the
request for virtual documents. However, the basic idea here on which we'll
benefit from architecturally (even if we're using something more sophisticated
than SSI's) is that the page is built of fry-time components.
Piece-specific components separated from general onesFry-time components are a crucial part of flexible presentations. If too much of the presentation is baked into the page and we want to, say in the preceeding example, change the navigational links used in the footer, then you have re-bake all of the pages!Components should not have to re-calculate the same resultThe computational overhead incurred by including components pales in comparison to the burden of republishing. Does this conflict with our previous call to pre-calculate that which we can? No, it means we must be judicious of our use of bake-time and fry-time resources!
Defining separate components for as much as possible that presentations share from page to page is a major page architecture problem that must be made by system architects and production management. Too much of the presentation baked into the pages and the presentation is too inflexible. Too much of the presentation calculated at fry time and the per request overhead grows, dragging down performance.
If your components are calculating the same thing over and over again
for the requests that come in, there's probably something wrong with
the architecture!
<HTML>
<HEAD><TITLE>PHP & Databases</TITLE></HEAD>
<BODY>
<? require("$DOCUMENT_ROOT/navbars/header.html") ?>
<?
mysql_pconnect("localhost:3306","myuser","mypass");
$rs=mysql_db_query("mydb","
SELECT c.category_path, c.category_name
FROM story s, category c
WHERE s.story_id=$STORY_ID
AND c.category_id=s.category_id
");
$row=mysql_fetch_row($rs);
?>
<!-- link to the top page for this category -->
<A HREF="<? echo $row[0] ?>"><? echo $row[1] ?></A><BR>
<? require("$DOCUMENT_ROOT/navbars/footer.html") ?>
</BODY>
</HTML>
In this case, PHP code connecting to the database (circa PHP3 API) for every
request to calculated the link URL and link text for a link to the top level
page for the category the current story is in. This is an example of what
not to do!
Components should not be stupidConfining componentization to use of mod_include severely handicaps how smart our components can be. There's no complete programming language, access to database connections or caching. It's simplicity is its beauty and its curse; lightweight yes, but too much so.Desktop publishing is not web publishing!A step up is of course PHP, it at least provides a programming framework and database connectivity.
If our content management system is to output ("bake") components that are to later build ("fry") the presentation, we at least must insist that they do what PHP does. Components that cannot calculate anything more substantial than time formats and pattern matches in environment variables (the limits of mod_include's capabilities) are going to be too stupid to do any heavy lifting with our content.
Some people confuse "publishing tools" with "authoring tools" and this has lead many web publishing shops (and tool vendors while we're at it!) down a blind alley.Web based authoring environmentsNot to disparage everyone's favorite authoring applications but when NetObjects' NetFusion or Microsoft's FrontPage works well for someone setting up a web site with 50-100 pages, they may be lulled into believing that they can scale usage of these tools to deal with the constant addition, updating and other maintainence tasks of a web publishing operation.
You probably wouldn't be here if this were working for you, getting a design implemented and an initial setup running is trivial compared to the task of daily maintainence of a web site. Don't let this point get lost next time a web launch specification is being drawn up!
That said, an additional requirement for any publishing system is allowance for editing environments that users like and find familiar. Producers like to develop in Allaire's HomeSite and Macromedia's DreamWeaver, editors seem to prefer Microsoft Word. This can be a vexing problem.
If you build your own home page on Geocities or one of the other free page building sites, you might be lulled into thinking that that is the answer to all of your needs. It's platform neutral, accessible from anywhere and changes are immediately reflected. But have you ever noticed how all of the pages that people put on those sites tend to look the same? Or at least have the same level of quality?Template driven content generationThose tools work to the extent that they do for page-at-a-time authoring but lack facilities for defining a motif (not in the X library sense) and applying templates based on that motif. They lack any facility for relating one page to another.
A minor step up is slashdot software and the slash-alikes that provide tools for serial content posting and muli-user posting. Along similar lines is the service provided by blogger, weblogs have arisen as a medium unto themselves. But again, these tools don't provide any facilities for relating one published work to another; each piece of content is an island of its own.
We've rejected the desktop authoring tools and the "mom & pop" grade web based tools because we need templating and complex relationships between editorial objects.Template driven content generationThe basic requirement for templating can be fulfilled by just about any technology that can perform "paint by numbers" variable substitution and component includes.
<HTML> <HEAD> <TITLE> [ TITLE GOES HERE ] </TITLE> </HEAD> <BODY> [ HEADER COMPONENT GOES HERE ] [ CONTENT GOES HERE ] [ FOOTER COMPONENT GOES HERE ] </BODY> </HTML>For every component system out there, someone has built their own templating system with it. SSI, PHP, ASP, JSP, and so on. All are plausible candidates for a templating system (though we decided earlier that we don't want stupid components, so we rejected SSI's).
Merging data and formatting requires some kind of logic engine behind it to perform the variable substitution and component inclusion. That means a programming language, API's and engineering to implement it. Cold Fusion, ASP and numerous other plug-and-pray technologies for accomplishing this have been devised. We'll discuss popular technologies for use with and wihtout Apache here.Commercial and Non-Apache OptionsWe can't possibly cover all of the possibilities here but hopefully we've looked at all of the important ones. ASP didn't even make the list since it ties you too closely to limited and proprietary Microsoft technologies. Though we'll make mention of an exception there as well.
- Non-Apache Technologies
- PHP
- Servlets & JSP
- Python & Zope
- mod_perl & ...
There's a wide range of commercial and some open source products that claim to solve the publishing problem.Apache Options
- Vignette StoryServer
- A very high priced box that generates cryptic URL's, StoryServer's process model is one plagued with problems. Everything is baked (stored in the StoryServer file cache) so a minor template tweak causes the content generation processes to go ballistic. StoryServer does support Apache integration but it's not maintained on the same tier as their Netscape/iPlanet support.
- Interwoven Teamsite
- Another very high priced box. Teamsite's highly regarded for its workflow but the flexibility for various outputs appears limited. We also have to question how well the overall solution will scale with the price points that the various pieces require.
- Allaire ColdFusion
- A proprietary language with a limited development model, limited platform choices and a limited community around it. Does this sound limiting?
- ArsDigita Community System
- Phil Greenspun has spoke extensively but unconvincingly of the virtues of the AOLServer. We don't agree that Tcl is the world's greatest language and Alex seems nice and everything but a good dog doesn't make a good publishing technology.
Apache Options
- PHP
- I've used PHP extensively off and on over the years. Its Perl-ish simplicity often brings a good deal of development speed and quick wins are nice. The wide array of databases supported (with persistent connections) and other extensions make it very appealing as well. However, we rejected PHP for our architecture because of some crucial design and language problems. The important ones were:
For the first three issues, we turn our attention to Java and Perl solutions. For the latter two, we have to consider component systems that are implemented in those languages. I understand that the PHP community is aware of these items and perhaps have addressed them since I last made any substantial use of PHP (Fall 1999).
- Function duplication. Last I looked there were 6 different sort functions and none did the sorting that I needed.
- Database API's vary widely from database to database unlike Perl's DBI or Java's JDBC API.
- Difficulty with the object model. Defining classes and creating objects to package up data and methods was not a well understood or documented capability.
- Template code must be expressed top to bottom, having calculations performed outside of the sequence of HTML parsing is not possible
- No caching, if the home page is requested 100 times per second, the code gets re-evaluated
Apache OptionsDespite these issues, Java clearly has a number of things going for it. The early maturity of XML tools being among the most important ones. Java is not presently at the core of our technology but it may gain greater importance as we extend it.
- Servlets & JSP
- The case for going the Java route gets more and more compelling. The servlet API has matured to a point now where it provides fairly rich programming constructs, the tools have matured to a point where performance and stability are admirable and the spec itself seems to have come of age along with many Java technologies. However, while the Java story seems to get steadily better, we still have some misgivings.
- Java is not open source. Sun still maintains reticent control over it, it's a technical Berlin Wall that must be torn down
- There is no CPAN equivalent for Java. Every Java shop is an island unto itself with no facility provided for community code reuse
- Development time can be lengthy. Java in general puts you in a box in which you must create objects and watch your data typing even to do the simplist things.
mod_perl & ...
- Python & Zope
- Clearly this is Python's killer app. The Zope development community and Digital Creations have been on a tear. Zope's content management looks like a complete set of tools as it organizes around folders, applications and users. Python itself has appeal in its object model, mature thread implementation and modest footprint. We rejected Zope for these reasons:
There may be good retorts to these issues but to us, they look like barriers we'd rather not contend with.
- We don't know Python (we could learn if we had too, but do we have to?) and we don't know many people who are Python programmers. Hiring developers adept with more popular languages is hard enough, hiring for Python skills seems like an additional challenge
- Extending and customizing Zope would require not just knowing Python but being highly skilled with it
- Zope development doesn't seem to focus on best-of-breed integration but on internally developed tools. There are lots of HTTP engines and SQL databases, why has the Zope community expended their effort inventing their own?
- Python has no CPAN
One of the most striking things about mod_perl is how rich with possibilities it is. Allowing one to leverage all of the power of Perl, add on capabilities from the CPAN and providing full access to the Apache API, mod_perl is the ultimate foundation on which to build web applications.mod_perl & ...There are some important and valid critiques of Perl:
- Loose requirements for structure and code
- Perl doesn't force you to instantiate objects nor express logic in a boxed in fasion. The old saying with Perl, TMTOWTDI, can often be restated as TMTTWTDI ("there's more than ten ways to do it!"). So if Perl wants to permit you enough rope to hang yourself, I don't fault the tool because someone managed to construct a noose with it!
- Bloatware
- Who uses formats anymore? How often do you use the various low level socket functions? There are a lot capabilities rolled into Perl that, in these days when XS can provide what's needed on an as needed basis, are clearly anachronistic.
- Threads
- Frankly, I haven't needed them enough to care about how the Thread development has matured in recent Perl releases. I'm sure the people who do care will work that out! Nonetheless, Python and Java both offer mature thread implementations.
- XSLT
- The Java community were early adopters of XSLT and thus Saxon, XT and Xalan are all, AFAIK, fairly complete XSLT processor implementations. The equivalent in Perl has been sorely lacking.
There are a number of component systems either built around mod_perl or are widely used in the mod_perl communityBest of Breed Choice: mod_perlThere are links to these and other mod_perl technologies on the mod_perl site, http://perl.apache.org/.
- HTML::EmbPerl
- An early development in the mod_perl community, HTML::EmbPerl expresses logic top to bottom embedded within HTML. While it's a very mature system, it seems to have been playing catch-up with regard to caching and its object model.
- Apache::ASP
- This implements request, response, session and other ASP objects using Perl. If the ASP API appeals to you and you don't want to use Microsoft or ChiliSoft technologies, Apache::ASP is the way to go.
- HTML::Mason
- Mason components can be HTML (or XML or FooML for that matter), Perl code or a combination of them. Components may generate output or they may be stand alone routines that accept data inputs and return data outputs. The object cache stores components as Perl code so evaluation from the cache is very fast.
- AxKit
- An XML & Sablotron (XSLT Processor) based system, it's pretty new, so I haven't looked at it closely yet!
Embedding a Perl interpretter into Apache allows one to harness a lot of power! By building on component systems, CPAN libraries and the Apache API, we clearly benefit from an expanded breadth of possibilities using mod_perl.Best of Breed Choice: HTML::MasonFor instance, to log into our publishing system, you must access the SSL virtual host. It's running the Apache::AuthDBI module to look up users in the database. We have our own PerlHandler that issues a cookie with connection characteristics hashed against a secret key. A PerlAccessHandler then verifies that user is permitted by checking the cookie values against the hashed data. Username and passwords never cross the wire in the clear thus users can access the publishing system across the public networks reasonably assured of security.
Session data is maintained with Apache::Session to prevent repeated SQL lookups as users access the system. Once we know who the user is, we can populate their session with some of their personal infoirmation, their group memberships and other vital data that we don't want to have to fetch from the database for every request.
Of course there's also a Logout handler to ditch the session and login cookies. All of this is accomplished with fairly simple code on top of CPAN modules and the Apache API.
Mason has garnered a considerable amount of favor in write-ups appearing in WebTechniques, The Perl Journal, and PerlMonth so don't just take my word for it. It's a component system originally devised to power the CMP TechWeb site, a leading technology publishing operation.Best of Breed Choice: DBIThe dhandler and autohandler concepts are very powerful ways of expressing template application. For instance, a dhandler at the document tree root (you're actually concerned with the component root, coming...) can be a default formatting component when the original URI translated document or a higher level dhandler is absent.
Component trees can be developed in parallel so that a sparse tree of components earlier in the component root path take precedence over those later in the path. We're currently planning on implementing developer-specific sparse trees so that developers can work in their own self-contained environment accessing a component root path specific to their access. Integrating this with CVS will make our engineering a truly multiuser development environment. This characteristic is unique to Mason (AFAIK) and one of the key advantages it maintains over other mod_perl component systems.
The DBI API provides facilities working with a RDBMS system in standard ways. Placeholders, cached SQL parsing and other things sweeten the deal further to provide robust performance and rapid development.The publishing technology roadmapWe're using Oracle 8.1.5 on Solaris which performs very reliably. DBD::Oracle, the driver that DBI uses to access the Oracle engine, is very fast and handles CLOB and other Oracle data types just fine. While we'd like a lighter budget for out RDBMS, we require integrity constraints and robustness that mandates a commercial database.
Since we're running DBI under mod_perl, our libraries that have to deal with Oracle directly (and in turn provides a higher level API to the tools and formatting components running under HTML::Mason) can take advantage of Apache::DBI's persistent database connections and connection pool maintainence.
In the future we may pursue forking our code to run under PostgreSQL. They seem to have the most complete SQL implementation of the non-commercial databases. If we get really adventurous, we may experiment with supporting MySQL even though that will require integrity enforcement within our Perl libraries.
There are a number of technologies maturing that we want to integrate to extend how we publish further in the future.The quick tour
- XML
- We already use XML extensively to store "data blobs&blobs; since we find that users need to be able to vary what meta-data they store without adapting the database schema. In the future, we intend to make the XML and RDBMS integration more pervasive and flexible.
- XSLT
- You don't need an XSLT processor if your native programming language (Perl, Java, Python, etc) can express all of the data structures and logic needed to apply formatting. However, perhaps we'll want to abstract out our API's so that they can work with not only mod_perl and our repertoire of Perl libraries but Java and Enterprise Java Beans, Python or someother language. XSLT opens us up our "best of breed" pursuit to these other technologies
- What about Java?
- We anticipate being able to leverage the things that Java is good at and integrating it into the publishing system.
We'll take a quick look around the system interface now.The basic flow of defining a new story, editing story data, associating other data nodes such as writers with a story, desking and publishing a story is shown. Browsing and editing formatting components is also shown.
Access controls are applied by granting privileges to user groups. The user creation, password management and group assignments are all done via additional web interfaces. We're going to be implementing a user preference tracking system so that certain aspects of the interface will be customizable on a per-user basis.
Additional functionality for managing keywords and ad placements as well as other tools such as polls, redirect trackers and so forth are also accessible via the web user interfaces.
When you log into the system, you land on your desk. On the left, the navigation links to the workflow desks, defining a new story, editing writer data and other content management functions.
When defining a new story, the first step is to categorize it, specify its target publish date and give it an identifier for that category/day; the slug. This places the story with the minimum amount of info needed to assign it a URL.
Once a story is placed, we can start defining its metadata and narrative content.
When editing a story, the associated writer data can be picked by a search or a browsing interface.
Once a user is ready to hand off a story to someone else in the workflow, the desk that it should go to is selected; submitting saves it and makes it available for someone else browsing that desk to check out.
Eventually, the story gets to the Publish desk where it can be selected to go live. The story that we saw in the "Story Edit" window is now live
Formatting components are implemented in HTML::Mason, they're used to bake the editorial data into output that will served for further request-time processing.
Formatting components can consist of HTML, Perl or, as is usually the case, a mix of the two.End Notes
There are a lot of code and conceptual refinements coming to the system but I hope this has been an informative walk through the experiences and thinking that went into our technology decisions.This presentation:
http://www.salon.com/contact/staff/idk/This presentation is written in HTML::Mason. The slides are styled and navigation links built using an HTML::Mason autohandler. Screenshots come from The Gimp, acquired from Netscape Navigator 4.7 windows running under FreeBSD 3.4 and the Windowmaker window manager.Technologies: