Liferay remote publishing - Troubleshooting

During last Liferay North America symposium in Boston, I had the opportunity to attend to Máté's interesting presentation (Best Practices for Using Staging in Liferay 6.2). I have always been fascinated by this complex feature in Liferay and I have spent hours "struggling" with it in the past years while helping companies implementing and using it.

Remote publishing has been improved a lot since its first implementation and is very robust and reliable in Liferay 6.2. However, this feature is so complex that they are many situations for which the process will fail or not complete how you would expect.

I'd like to share some of my past experiences in order to help you understand how remote publishing works and how to debug and fix some common issues. Please also refer to Liferay documentation for basic understanding about Staging Page Publication and its configuration.

This applies only to Liferay 6.2+. Remote publishing has been re-implemented for this version and works differently than in previous versions.

Understanding the remote publishing process

The remote publishing feature is based on Liferay's export/import functionality. There are several important steps:

  • 1. Connection: staging server establishes a connection with the remote server in order to check the configuration.
  • 2. Export: staging server exports the desired site and its content as archive (.lar) on the local storage (temp)
  • 3. Data tranfer: staging server transfers the archive to the remote server
  • 4. Checksum: remote server validates archive's integrity
  • 5. Validation: remote server checks for missing references or invalid content
  • 6. Import: remote server proceeds with the import of the content. If it fails, the entire import will be rolled back.
  • 7. Cleanup: both servers will cleanup temporary files

If you're lucky, everything will work as expected ;-)

This will most probably be the case when you test it for the first time with a small site and a few contents. You'll realize later that it becomes more tricky with complex web sites with hundreds of pages and web contents.

Remote publishing fails with errors

If your publication fails, first check the error message. Sometimes, it gives you a good advice about the issue (most probably a configuration or a missing reference).

If the error message says "Unexpected error" with strange details (FileNotFoundException, InvalidCheckSum, ...), proceed by identifying which step is failing and checking one of my advices below.

Identify which step fails

In order to identify which step fails, you need an access to both servers (staging and remote server). The remote publishing process will not clearly indicate which step failed but you can figure out by checking the servers.

  • During the export (step 2), Liferay creates a temporary file for the archive on the application server of the staging environment. For instance, check the /temp folder in Tomcat to see if you can see a new file beeing created. If this is the case, you know that Liferay is proceeding with export.
  • During the data transfer (step 3), Liferay sends the archive by splitting it into 10MB files to the remote server. The remote server receives it and stores it into the document library. If the size of the archive file in the temp folder (staging) is not increasing anymore and you see the remote server starting to create new file in the document library (/data/document_library), that means that the process is currently proceeding with the data transfer. Another way to identify the beginning of this step is by monitoring the CPU of the staging server. Export uses CPU a lot (to compress data into zip file) but data transfer doesn't.
  • Step 4 is extremely quick and you won't be able to distinguish it from the step before. Generally, you will get an obvious message (InvalidChecksumException) if the process fails during this step.
  • Validation (step 5) is also difficult to clearly identify but you should get an detailed error message about missing references when this steps fails.
  • When the process starts importing the data, you'll notice that CPU increases on remote server. You should also notice that individual 10MB files have been removed and merged into one valid package in the document library (/data/document_library). This step can take several minutes to complete.
  • Last step never fails. If the import succeeds, cleanup will also do.
Steps 3 (data transfer) and 6 (import) are the most frequent to fail. If you are not sure which step fails, it is probably one of these two.

Error during 1. Connection

If you get an error during the connection (after a few seconds), check one of these:

  • Make sure that both servers have been properly configured (tunneling.servlet.shared.secret, axis.servlet.hosts.allowed, ...). Check Liferay Documentation.
  • If you're using a web proxy server (Apache HTTPD, BigIP F5, Netscaler, ...), make sure that it preserves the origin host in the proxy request or the request will be rejected by the remote server due to invalid IP address. For instance, use "ProxyPreserveHost on" configuration in Apache.
  • If you may not configure the proxy server (see previous point), consider changing the property "axis.servlet.hosts.allowed" of the remote server in order to match the IP address of the proxy server (rather than the staging server). WARNING: by doing this, you're allowing anyone to access Liferay remote API. This is unsecure.
An easy way to check connectivity to remote server is to acces /api/axis URL (wget or browser). If you're not authorized, you'll get a clear message with IP address of the requester. This will help you understand what the remote server receives.

Error during 2. Export

Export should not fail. If it really does, I recommend:

  • Check that disk space is not empty on the server ;-)
  • Try to export the site (from the Control Panel) in order to see if this is really the problem
  • Try installing the latest patches (Liferay EE). Remote publishing is frequently improved by Liferay.

Error during 3. Data Transfer

This is a tricky one because you'll get strange errors and it's very difficult to debug. Try to find a good system administrator in the company who can monitor connections on the network (wireshark or similar). Common issues that I have faced:

  • Check timeouts and any other rules on proxies, servers, switches, application servers, anti-virus. If the process always fails after X minutes, there is a good chance that something on your network, between the two servers, cuts the connexion.
  • Check the size of the archive in the temp folder of staging environment. If the file is bigger than 10MB, it will be sent to the remote server in multiple pieces. Check with less data in order to see if the problem is related to that.
  • Good luck!

Error during 4. Checksum

This step should technically never fail. If it does, consider one of these recommendations:

  • Try to publish again. Maybe one sent file got corrupted during the transfer.
  • If you're using a cluster, check to see if sent files (when archive > 10MB) are clustered. This should not happen but if it does, Liferay will end up with 50% of the files on one server and 50% on the other server (assuming your cluster has 2 servers). The checksum will then always fail.
  • Check the size of the archive in the temp folder of staging environment. If the file is bigger than 10MB, it will be sent to the remote server in multiple pieces. Check with less data in order to see if the problem is related to that.

Error during 5. Validation

An error during validation will provide more information in the "remote publishing" interface (history):

  • Try to identify the missing reference and understand why Liferay is complaining about it.
  • If your site is using global references (structures, templates, categories, etc.), make sure to publish /global site first. Global site can be published from the control panel (Sites).
  • If your site doesn't have any external reference, try publishing the entire content. In the "remote publishing" options, choose "All Content".
If none of the above solutions works, you could export the site from the control panel and check the content of the archive (zip file). This might help you understand what is missing.

Error during 6. Import

It's very difficult to cover all possible situations during the import step because it strongly depends on the content in your site. Common issues are:

  • If you get errors about duplicated content, try to publish the entire site.
  • Disable "Version history" to reduce the size of your publication and see if that makes any difference
  • Check logs on the remote server for more detailed information (stacktraces).
    • If you see OutOfMemoryException, increase the memory available for your server application (-Xmx)
    • If you see GenericJDBCException, check your JDBC connection pool. It might happen that all connections have been used and none released. You might need to restart your application server to free them. Check your JDBC settings according to your needs.
  • If you have the feeling that the import is really slow, monitor your infrastructure (CPU, IO access) and give more ressources to your virtual machine or server. If the import process takes too long (typically 30 minutes and more), your staging server will get a timeout back and will not cleanup properly, although the import still continues and completes (see next point).

Error during 7. Cleanup

Cleanup step will never fail but remote publishing doesn't end properly if the process takes too long. The staging server keeps waiting for the remote server to finish the import. After some time, the staging server will get a timeout and its connection will be reset. When this happens, the staging server will clean up on its side and set the status of the process to "failed". However, the import process continues on the other end (remote server) and might succeed. You'll end up with a wrong status and a "last publication date" not correctly set. To avoid this situation:

  • Try to optimize your infrastructure (CPU, Memory, IO, ...)
  • Configure timeouts (proxy server, application server, switch, etc.) for your conveniance. But don't set timeout value too high!

If you get into the situation that the publication technically succeeded but was reported as failure because of some timeout, start another publication and set the date range to "Last 12 hours". This smaller publication should execute faster, succeed and update the "last publication date" correctly. Your environment will then be ready for further publications.

Remote publishing, best practices

  • Disable version history by default (journal.publish.version.history.by.default=false). On production, you won't need all the versions but only the latest approved ones.
  • When using asset publisher, don't choose a scope which is different than "Current site" or "Global" with staging. Your asset publisher won't work after publication any more because the group ids are different on the two environments. There are strategies to make it work but they are too complicated to be explained here :-)
  • Enable and experiment remote publishing as soon as possible. Make sure to test the process with archives that are bigger than 10 MB before assuming that everything works perfectly.

Still doesn't work?

If remote publishing won't work in your environment, consider trying one of these:

  • If you're using a cluster, try to disable all but one node.
  • If a proxy server is installed in front of your application server, try publishing directly to the application server (by opening firewalls if any)
  • If using a SSL connexion for the remote publishing process, try without it (http).
  • Try moving your remote server on the same network.
  • Try moving your remote server on the same machine.
Above experimentations are not permanent solutions but will help you identify the issue.

--

Share your experience with me!

Blogs
Hi Sven,
Kudos to you, this is damn useful blog emoticon

And this is so nicely covered and explained.

Looking forward for the more details (maybe some guidance steps) on the case you mentioned emoticon
--------
When using asset publisher, don't choose a scope which is different than "Current site" or "Global" with staging. Your asset publisher won't work after publication any more because the group ids are different on the two environments. There are strategies to make it work but they are too complicated to be explained here :-)
--------
Thanks Gaurav!

Regarding the asset publisher, I didn't include the instructions there because I know that Liferay will probably fix this "issue" in the next version. If you still want to do it in 6.2, here is the trick:
1) Create as many sites as you think you'll need for the next X years.
2) Duplicate the environment (copy BD + /data) in order to create your staging environment
3) Now, you have Staging and Production that have sites with same group IDs. Your asset publisher will still work on production after publication because group ids will be the same.
4) When you'll be out of sites, you'll have to redo the same technique: duplicate the production and re-create your staging environment. This is annoying, so better create enough sites in advance. Hopefully Liferay will change the AssetPublisher exporter such ids will properly be adapted during export/import.