Content Management and site search represent different software families, but nonetheless have become increasingly related. Clearly, a good site search engine is essential to any high-quality Web publishing effort. Moreover, a CMS can significantly improve site searches by normalizing (and even automatically generating) Meta information such as key words, page titles, content descriptions, structured asset descriptors, categories, and so forth. The result is improved site searches for your visitors, potentially improved Internet search engine rankings (Search Engine Optimization), and reduced cost and effort on your part to add these features to your site.
When it comes to Web site searches, two approaches dominate. The most common method is “spider”-based search, which scans your site and builds an index of your published content. This works well for static HTML sites, and is also the technique used by Internet search systems such as Google, Alta Vista, and Lycos. The other approach -- “dynamic”-based search -- is used where a database delivers fluid content. In this case, the search is turned into a database query, submitted to the database, and the results are reformatted in HTML.
With both approaches, a few basic technology alternatives have an important impact on search performance and results. Among these is full text vs. fielded searching, and whether the content is indexed or not. Because at their core, all CMSs store your content in a database, it may seem that a more dynamic approach would be the preferred method if a CMS is involved. In reality, there are major differences in how various CMSs handle site searches; these differences will have impact on how well your search system works and how simple it is for you to manage.
Full Text vs. Fielded Searching
Full text searching treats your content like a big block of text. If you have records, pages etc., the fields (ex. title, author, body, date, categories) are considered together when searching. If any of the search query words appears in any of these fields, the whole document is returned. An example might be articles with an author and body field, where some of the articles were authored by Mark Twain (the name appears in the author field), and some of the articles are about Mark Twain (name appears within the body). In this example, searching for “Mark Twain” would return all of the documents with the name in either field. Full text searching is useful when your content consists of a great deal of text; articles, press releases, product documentation etc. It is also useful when you are looking for rare data and need to find any mention of the query words. Full text search is often referred to as “basic search.”
Fielded searching enables you to look for query words in specific fields. In the example above, you could limit the search to the author field, thus returning an article by Mark Twain, but not articles about him. In many systems, this is called the “advanced search,” and is useful when you need to find a specific piece of information, or if the results from your queries return a great many records.
Many search systems combine these approaches to allow both “basic” and “advanced searching”
Indexing for Performance
As your content continues to grow, searching through it for each user query can become a slow, resource consuming process. To solve this problem, an index is created for the content. The index is usually built off-line, and contains a list of every word in all of your documents, along with all of the documents that contain the word. When a query word is compared to the index (which is sorted for even faster performance), the documents are found rapidly. Indexing has several drawbacks, however: keeping the indexes current (adding new data can be slower and indexing may lag behind insertion), and increased storage requirements, since indexes can often consume as much space as the original content. Indexing is essential for reasonable performance on all larger data sets, and considering the economical price of disk space, indexing remains very common.
With indexed search systems, it’s important to know whether you can search for partial words. Many systems provide stemming, which removes plural and other common word endings, and for a query with few results, you will prefer to allow partial words and wildcards, which some systems do not offer.
Spider vs. Dynamic
Building upon these basic technologies, the two most common Web site search systems come into play:
Spider-based search systems work on your content remotely. They use the Web server to request a document, index it, and then put any links in that document in a queue to be subsequently indexed and searched. In this manner, a search spider moves through all of the content in your site, building an index of words and URLs. The indexing is scheduled to run as often as necessary to keep up with changing content -- typically nightly. A separate query system allows a user to type in query words, search the index, and return a list or the links that contain the word.
It’s important to note that spidered searches are not limited to full text content – the page can have fielded information, which can be separately indexed. The most common fields are the Meta title and Meta description fields, since both Internet search engines and local site search use them.
This is where a CMS can be a big help: with a CMS, you enter the information only once, and the CMS will take care of creating the pages and Meta tags automatically, since the latter often contain a subset of the former. Meta content, such as keywords not in the document text, or information about the author etc., can also be inserted into Meta tags automatically, and a CMS can easily generate the alt tags for images so that image search engines will more readily find your documents.
Be aware of limits on Meta field sizes and limiting Meta information to plain text. Here again, a CMS can be a great help, since the system can make sure the fields are compliant – high-end CMSs can even remove noise words so that the Meta description fits within a search engine limits. Some can even enhance the Meta description for search engine optimization, summarizing the document body for the description field.
Dynamic searches rely on having all of your content in a database. Typically, each document is a record, and the fields are separated out in the database. A query is executed against the database using SQL commands (most commonly), and the results are formatted and returned to the user. The advantage of a database search is that the data is always current – it becomes available as soon as it’s entered without a delay for indexing. Dynamic searches often make document access control easier since the access information is stored right along with the data. Databases also provide indexing, though at some performance and storage cost, so indexes are limited to the fields most often used.
Products like Oracle interMedia provide independent indexing and searching for databases as well as documents in one integrated system.
Using a CMS with a Search Engine
Given that a CMS stores all of your content in a database, you may find it surprising that many content management systems use spidered techniques to provide site search rather than deploying a dynamic search. An installed CMS that relies on the user accounts to protect information may provide site search in a more dynamic fashion, but most CMSs work well with spidered search systems. By using a CMS with a spidered search system, the process of adding search to your site can be greatly streamlined – thanks to some of the techniques we’ve discussed, as in the following situations:
- If your site contains an abundance of specific content (like articles, product descriptions, etc.), you often want to restrict your site search to those content items and not things like your navigation pages, privacy policy etc. A CMS can help with this situation: one way is to have a checkbox on each content entry form, to allow you to selectively decide which pages should be indexed. A CMS can also print a special index page for the search spider, with specific links to view.
- Or suppose you want to add controlled Meta tags to your site, such as the Dublin Core standard fields. By adding these fields, you greatly facilitate the use of your content as a data source for other systems. Adding these tags by hand is labor intensive, but a CMS can automatically add them along with the other Meta tags, making conformance to this standard completely painless.
Summary
Adding search to your site involves a few key technology decisions, but the market offers a number of products that make implementation simpler. Once you add site search, providing quality results and enhancing Internet search engine rankings can require extensive manual effort. Fortunately, using a CMS not only streamlines your content creation and deployment, but also enables you to automate some of the Meta tagging chores for your content pages, yielding effortless improvements in your site’s knowledge base.
Carl Sutter is a Principal with CrownPeak Technology, in Marina Del Rey, California. CrownPeak offers Advantage CMS, an ASP-based Content Management Service.