Using the Sitemap Protocol
What is the Sitemap protocol? #
It's a tool for webmasters to help search engines crawl their website and make sure that all the relevant stuff gets indexed. It was started by Google but it's now being developed along with Yahoo and MSN. The site http://sitemaps.org has all the information that anyone might need about it. Here is their description for the protocol:
Sitemaps are an easy way for webmasters to inform search engines about pages on their sites that are available for crawling. In its simplest form, a Sitemap is an XML file that lists URLs for a site along with additional metadata about each URL (when it was last updated, how often it usually changes, and how important it is, relative to other URLs in the site) so that search engines can more intelligently crawl the site.
Web crawlers usually discover pages from links within the site and from other sites. Sitemaps supplement this data to allow crawlers that support Sitemaps to pick up all URLs in the Sitemap and learn about those URLs using the associated metadata. Using the Sitemap protocol does not guarantee that web pages are included in search engines, but provides hints for web crawlers to do a better job of crawling your site.
Sitemap 0.90 is offered under the terms of the Attribution-ShareAlike Creative Commons License and has wide adoption, including support from Google, Yahoo!, and Microsoft
Does Liferay support the Sitemap protocol? #
Yes. Starting with Liferay 4.3 (not released at the time of writting, see LEP-1630 for details) Liferay has out of the box support for automatically generating sitemap information and for notifying the main search engines about it.
For earlier versions it's also possible to support the protocol with some programming or using the tools provided by Google.
How do I use it? #
It's actually pretty simple because Liferay generates the sitemap XML automatically for all public websites. To try it out go to a public site of a community or a user and go to 'Page Settings'. Click on the root node of the tree and you'll see a tab that says 'Sitemap'. Going to that tab should show the following:
By clicking on the Search Engine links the sitemap will be sent to them. Note that it's only necessary to do this once per site. The search engine crawler will automatically ask for the sitemap again every so often.
Clicking the 'Preview' link will allow you to see the generated XML. This is interesting if you want to know what is being sent to the search engines.
Customizing the sitemap #
The sitemap protocol allows for the following parameters for each page of the website:
- Change frequency: always, daily, weekly, etc.
- Priority: a number from 0.0 to 1.0 indicating the priority of the page relative to other pages of the website
- Last modification: the last modification of the website (this is not currently supported in the automatic XML generated by Liferay)
Liferay allows to set the first two parameters through the 'Page Settings' tool as shown in the following screenshot:
Liferay additionally lets the administrator of the website select which pages of the website are to be included. In order for a page to be included the following conditions have to be met:
- If has to be of a layout type that supports to be included in the sitemap. The default configuration (in portal.properties) establishes that the types Portlet, Embedded and article support it while URL and Link to Page do not (because it does not make sense).
- It is not hidden
- It is configured to be included in the sitemap (the default)