Troubleshooting Liferay from Source

Many months later, it's time to continue the series on Liferay from Source.

For this segment, we'll start with the following question: why would you have wanted to build from source in the first place?

This particular series entry assumes that maybe you want to fix a bug that affects you. Or maybe you feel like there is a feature that is missing from an existing module, but you want to contribute that change so that it gets incorporated into the Liferay upstream repository as something that Liferay maintains instead of being a custom module that you maintain. In both cases, your ultimate goal is to have your contribution benefit the larger community instead of being isolated to your own Liferay installation.

In this example, I'm going to walk you through the process of troubleshooting a bug that exists in the 7.0.3 GA4 code base but seemingly not in the current master code base. It's an example of how Liferay internally handles the two diverging branches. Along the way, you'll be introduced to Liferay's internal processes for handling bugs, and ultimately you'll find different techniques that our internal developers use for troubleshooting issues. Finally, you'll see how to deploy your changes on top of what you've already built from source.

Step 1: Encounter the Problem

First, let's talk about the problem I want to solve (which in this case happens to be a bug), which is logged in LPS-74173. Essentially, the fix for LPS-64906 was incomplete. If you disable scheduler via portal-ext.properties by setting scheduler.enabled=false, if you navigate to the forms portlet by selecting Liferay > Content > Forms, you are unable to view the Forms portlet due to an UnsupportedOperationException.

java.lang.UnsupportedOperationException
    at com.liferay.portal.workflow.WorkflowDefinitionManagerProxyBean.getActiveWorkflowDefinitions(WorkflowDefinitionManagerProxyBean.java:60)

If you'd like to try this yourself, you will probably use a specific tag where the bug definitely exists, which is the tag corresponding to 7.0.3 GA4, aptly named 7.0.3-ga4. In case you are new to Git and do not yet know how to pull down a specific tag without pulling down every tag (and Liferay can accumulate lots of them), you can do it with these commands from the folder where you cloned the repository:

git fetch --no-tags git@github.com:liferay/liferay-portal.git tags/7.0.3-ga4:refs/tags/7.0.3-ga4
git checkout 7.0.3-ga4

Note that you can either build this tag from source, or you can simply use the existing release bundle. Using existing bundles is generally preferred if you simply want to use Liferay, but Liferay's decision to place everything inside of .lpkg files (an additional layer of zip files) and requiring marketplace overrides makes debugging issues and fixing them much more difficult, as you may be required to restart the server after every change for the overrides to be recognized.

Therefore, it's recommended to build from source if your end goal is to fix an issue.

Step 2: Reproduce in Master, Part 1

As part of Liferay's validation process, we try to determine if the issue is still present in the master branch. This is because in most (but not all) cases, Liferay's red tape dictates that you should try to make sure that the fix is applied to the unstable branch (master) before making it available in the stable version (in this case, 7.0.x).

Master itself doesn't have any useful tags, so if you'd like to try this yourself, you will probably use a specific hash where the bug definitely exists, because it might be fixed in the actual master by the time you view this blog post. In this case, it's known to exist at 325c62f9b7709c5c96ef484c9b050199500a41eb (which is where Liferay was when I started this blog entry), which you can checkout with the following command:

git checkout 325c62f9b7709c5c96ef484c9b050199500a41eb

After you've built Liferay from source, start up Tomcat in debug mode. If you navigate to the Forms portlet with scheduler disabled, you'll find that the issue doesn't appear.

Digging into the code, an easy conclusion to draw is that the issue does not occur in master.

You might draw this conclusion because as a side-effect of some refactoring in LPS-66761, the code throwing the exception is no longer directly included in the portlet class of the Forms portlet. However, the code still exists in a different class: WorkflowDefinitionDataProvider.

Because the code still exists, you now deal with another layer of Liferay red tape. Before you can bring the supposed fix back to a release branch, you have to prove to yourself (and to the reviewer) that this code change actually fixed the issue instead of masking the issue.

One approach is to perform a git bisect, and this is if you're fairly sure that the fix works. In my case, I'm pretty skeptical (the code throwing the stack trace hasn't been touched), so I will want to dig into where the code moved to see if a similar bug is still present.

The trick of course is figuring how to get to the new code.

Step 3: Reproduce in Master, Part 2

Because it's an OSGi component, it's possible the error goes away due to the component no longer meeting its requirements when scheduler is disabled (which would make our job really easy). So, let's ask ourselves this question: is an instance of this class still registered as an OSGi component?

Let's find out by asking Gogo shell. The following code snippet asks OSGi about all the components that claimed that they provided the DDMDataProvider service (the second argument that is currently null is a filter), and then prints out the class names. We're checking to see if our WorkflowDefinitionsDataProvider appears in the list.

each [($.context getServiceReferences "com.liferay.dynamic.data.mapping.data.provider.DDMDataProvider" null)] { ($.context service $it) class }

Sure enough, it shows up. This means that in theory, this code can be called, because our component is available.

Step 4: Reproduce in Master, Part 3

The next part is to figure out how to actually get the code to be called. There are a lot of different ways to do this (dig up the call stack), but since in this case the code simply moved, an easier test is to see if simply visiting the original page hits a breakpoint we set in our code. We achieve this by attaching a remote debugger, adding a breakpoint to the getData method in WorkflowDefinitionDataProvider, and visiting the Forms portlet.

image

Looks like our breakpoint was hit! Let's step over it and see what happens.

image

Liferay is swallowing the exception because WARN level logging is not enabled on the class DDMDataProviderInvokerImpl. Let's go ahead and enable it and see if we reproduce the issue. You reach the logging interface by choosing Control Panel > Configuration > Server Administration, choosing Log Levels, choosing Add Category, and adding com.liferay.dynamic.data.mapping.data.provider.internal at WARN level.

After visiting the Forms portlet again, we see the same error as we saw in 7.0.3-ga4, though the portlet still renders because the exception is swallowed.

java.lang.UnsupportedOperationException
    at com.liferay.portal.workflow.WorkflowDefinitionManagerProxyBean.getActiveWorkflowDefinitions(WorkflowDefinitionManagerProxyBean.java:60)

Detour: Workflow Implementation Details

Now that we've encountered the bug, let's take a step back and understand why this exception was raised.

Workflow is a weird beast where a dummy implementation is always provided by default, which is WorkflowDefinitionManagerProxyBean. Therefore, if the code ever calls any methods on this default implementation, it will throw an exception. The only way for these exceptions to go away is for a different, non-default implementation to be provided. In the case of Liferay, this non-default implementation is known as Kaleo.

Looking at the code block that is throwing the exception, the call to _workflowEngineManager succeeds, but the call to _workflowDefinitionManager fails. This suggests that the non-dummy implementation is available for WorkflowEngineManager, but it is not available for WorkflowDefinitionManager. Let's use Gogo shell to confirm.

each [($.context getServiceReferences "com.liferay.portal.kernel.workflow.WorkflowDefinitionManager" null)] { ($.context service $it) class }
each [($.context getServiceReferences "com.liferay.portal.kernel.workflow.WorkflowEngineManager" null)] { ($.context service $it) class }

This code confirms that there is only a proxy implementation for WorkflowDefinitionManager, and both a proxy and a non-proxy implementation for WorkflowEngineManager. So the question is, why did the non-proxy implementation for WorkflowDefinitionManager (WorkflowDefinitionManagerImpl) not get registered to OSGi? The most likely answer is that its dependencies were not satisfied.

We can take a look at each of its non-optional @Reference annotations to see if any exist.

$.context getServiceReferences "com.liferay.portal.workflow.kaleo.service.KaleoDefinitionLocalService" null
$.context getServiceReferences "com.liferay.portal.workflow.kaleo.KaleoWorkflowModelConverter" null
$.context getServiceReferences "com.liferay.portal.kernel.workflow.comparator.WorkflowComparatorFactory" null

Running these commands, we find that KaleoDefinitionLocalService and KaleoWorkflowModelConverter have not been satisfied. Interesting, KaleoDefinitionLocalService has the naming format for a Liferay service builder service (it ends with the words LocalService), and a quick investigation inside of the source code in an IDE reveals that it is a service builder service provided by the portal-workflow-kaleo-service, made available in a bundle with the symbolic name com.liferay.portal.workflow.kaleo.service.

Detour: Wiring Spring into OSGi

If you've been Liferay for awhile, you know that service builder services are not created and managed by OSGi, but rather they are created and managed by Spring. However, a lot of Liferay stuff uses it as though OSGi manages it. You might wonder, how does that actually work?

Liferay is coded so that a bundle waits for a variety of things to be registered to OSGi, and then it has code that will register its Spring managed dependencies as OSGi components.

Of course, the whole registration process has dependencies, which are managed in a build time generated file OSGI-INF/context/context.dependencies for each bundle. These are then assumbled into a placeholder component and then registered with the Apache Felix dependency manager.

If for any reason any of these components have unsatisfied dependencies (one common example being the Release component for the module, like in LPS-66607), the bundle will be active, but none of the Spring-managed components (such as message listener configurations) will be made available as OSGi components. As you can imagine, this means that all the OSGi components that depend on those Spring-managed components being registered will also fail to resolve.

Naive Debugging

One way to troubleshoot this issue is to construct a Gosh script that basically does what we've been doing in all the examples above. It will read in OSGI-INF/context/context.dependencies from the bundle and check each entry to see if there are any service references that match it, similar to what we have been manually doing with the previous commands. If something exists which satisfies the reference, you get a check mark. If not, you get a warning exclamation point and the text will be in red.

In other words, if there is at least one warning exclamation point (or one line in red) in the output of this script, none of the Spring managed dependencies for this bundle will be registered as OSGi components!

Note that it's not valid to paste the following directly into a shell (every line break is treated as a command, which will not work given that some closures span multiple lines).

math = ($.context bundle 0) loadclass "java.lang.Math"
stringutil = ($.context bundle 0) loadclass "com.liferay.portal.kernel.util.StringUtil"

id_map = [""=0]
each [($.context bundles)] { $id_map put ($it symbolicname) ($it bundleid) }

checkref = {
    pos = $args indexOf " "
    has_filter = $pos equals ($math abs $pos)
    service = if { $has_filter } { $args substring 0 $pos } { $it }
    filter = if { $has_filter } { ($args substring ($service length)) trim }

    refs = $.context servicereferences $service $filter
    color = if { $refs } { "\u001B[0;0m" } { "\u001B[1;31m" }
    satisfied = if { $refs } { "[✔]" } { "[!]" }
    echo "$color" "$satisfied" "$args" "\u001B[0;0m"
}

check = {
    bundle_id = if { (($args class) name) equals "java.lang.String" } { $id_map get $args } { $args }

    bundle = $.context bundle $bundle_id
    url = $bundle entry "OSGI-INF/context/context.dependencies"
    dependencies = []
    if { $url } { $stringutil readlines ($url openstream) $dependencies }

    echo ''
    echo "$bundle"
    echo ''
    each $dependencies $checkref
}

# List the bundle ids or symbolic names that you wish to check here

check "com.liferay.portal.workflow.kaleo.service"

# Blank line echo to suppress output

echo ''

Copy it into a text file, which I'll assume will be named check_context_dependencies.gosh and stored in /tmp. If you're diagnosing something other than com.liferay.portal.workflow.kaleo.service (or you're checking multiple services), update the check lines accordingly.

You can run this script by using the gosh command provided by Gogo shell. You will need to specify the file path using a file URI. In this example, the local file is located in /tmp/check_context_dependencies.gosh, which means we'd use the following command to execute it:

gosh file:///tmp/check_context_dependencies.gosh

We'll get the following as script output.

[✔] com.liferay.counter.kernel.service.CounterLocalService
[✔] com.liferay.portal.kernel.dao.orm.EntityCache
[✔] com.liferay.portal.kernel.dao.orm.FinderCache
[✔] com.liferay.portal.kernel.model.Release (&(release.bundle.symbolic.name=com.liferay.portal.workflow.kaleo.service)(release.schema.version=1.3.4))
[✔] com.liferay.portal.kernel.scheduler.SchedulerEngineHelper
[!] com.liferay.portal.kernel.scheduler.TriggerFactory
[✔] com.liferay.portal.kernel.service.ClassNameLocalService
[✔] com.liferay.portal.kernel.service.ClassNameService
[✔] com.liferay.portal.kernel.service.PersistedModelLocalServiceRegistry
[✔] com.liferay.portal.kernel.service.ResourceLocalService
[✔] com.liferay.portal.kernel.service.RoleLocalService
[✔] com.liferay.portal.kernel.service.UserLocalService
[✔] com.liferay.portal.kernel.service.UserService
[✔] com.liferay.portal.kernel.service.persistence.ClassNamePersistence
[✔] com.liferay.portal.kernel.service.persistence.CompanyProvider
[✔] com.liferay.portal.kernel.service.persistence.RolePersistence
[✔] com.liferay.portal.kernel.service.persistence.UserPersistence
[✔] com.liferay.portal.workflow.kaleo.runtime.calendar.DueDateCalculator

Interesting, so it looks like in order for all of our Spring managed dependencies to be added as OSGi components, we must first make sure that com.liferay.portal.kernel.scheduler.TriggerFactory is also available. If we look at the package name which includes the word "scheduler", we can draw the conclusion that TriggerFactory is not available because we disabled scheduler, and so nothing exists in our environment to satisfy this dependency.

Smarter Debugging

While the approach above is what I did when I was troubleshooting the issue, I was later told that there is a built-in way to do this, because a closer look at the source code for ModuleApplicationContextExtender that I linked above shows that we're using Apache Felix's dependency manager to manage Liferay's Spring dependencies.

The documentation provides shortened variants on the command options (such as b) that are not available in the version of Apache Felix shipped with Liferay, but the longer command variant works, giving us this for troubleshooting our issue:

dm bundleIds "com.liferay.portal.workflow.kaleo.service"

If you aren't sure which bundle it is that is failing to work, then you can express your frustration with the acronym for a very popular expletive, and the command will report all services that are currently failing to resolve, even though some OSGi component is asking for them to be resolved.

dm wtf

The former command asking for bundleIds as a parameter has to be parsed visually in order to understand which component is failing, but the latter command that only provides wtf only lists the ones that are failing. In both cases, we see that com.liferay.portal.kernel.scheduler.TriggerFactory and many service builder services are missing, just as was discovered in our naive analysis.

Step 5: Choose Your Battle

If we track down why TriggerFactory is needed by com.liferay.portal.workflow.kaleo.service, we find that KaleoTimerInstanceTokenLocalServiceImpl wants to use it in order to add scheduled jobs related to workflow timers.

Knowing this, let's look back at the facts so far. As we were troubleshooting the issue, we made the following observation:

Looking at the code block that is throwing the exception, the call to _workflowEngineManager succeeds, but the call to _workflowDefinitionManager fails.

With this in mind, and with our discovery for TriggerFactory in mind, you might consider the following question: given that scheduler is disabled, is the bug that the call to _workflowEngineManager succeeds, or is the bug that the call to _workflowDefinitionManager fails?

You could argue it both ways, depending on whether you agree with the following statement: workflow timers should be treated a required part of the Kaleo workflow implementation, and that discussion would probably happen with the component team and the project owner team for the Kaleo workflow component. Whether you're an internal user or an external user, you could start that conversation by assuming one stance, submitting your changes via a pull request to the component team, and then using the pull request for the conversation.

However, as a contributor, you might not be very deeply invested in the correct answer to this question, and so the thought of starting this conversation doesn't appeal to you. After all, you simply wish to fix a bug that affects other people. Understanding the nuances of how workflow should work sounds like an unnecessary use of your time.

What can you do instead? Well, you'd re-frame the question.

Instead of asking whether timers should work, you could answer the more specific question that applies to our current bug: if the only thing we're using _workflowEngineManager for is to determine whether or not workflow is available, and we know that it's giving us a misleading answer, and we can achieve what we want by changing the way we're pulling in the _workflowDefinitionManager reference, should we keep the logic that uses the misleading _workflowEngineManager manager reference?

This one has a more obvious answer: no.

Step 6: Test Your Changes, Part 1

In this case, we can update WorkflowDefinitionsDataProvider to simply wait for non-proxy implementations of WorkflowDefinitionManager, which we will assume simply provides the property proxy.bean=false.

@Reference(
    cardinality = ReferenceCardinality.OPTIONAL,
    policyOption = ReferencePolicyOption.GREEDY,
    target = "(proxy.bean=false)")
private WorkflowDefinitionManager _workflowDefinitionManager;

For ReferenceCardinality, because we don't need one to function (we're using it for a null check), and we only expect one to be provided, "The reference is optional and unary" is what best describes the cardinality for what we want, so we choose OPTIONAL.

For ReferencePolicyOption, because we don't expect it to be provided immediately, and we want to function properly should one be deployed, "When a new target service for a reference becomes available, ... bind the new target service" is what best describes the policy option for what we want, and so we choose GREEDY.

Now, how to deploy the change? Navigate to the folder containing our change, and call the gradlew command located at the root of the portal source.

cd modules/apps/forms-and-workflow/dynamic-data-mapping/dynamic-data-mapping-data-provider-instance
../../../../../gradlew deploy

After deploying the change, visit the Forms page. Looks like our issue has been resolved. Let's make sure that our component is still available.

each [($.context getServiceReferences "com.liferay.dynamic.data.mapping.data.provider.DDMDataProvider" null)] { ($.context service $it) class }

We still see our WorkflowDefinitionsDataProvider available, so everything checks out.

Step 7: Test Your Changes, Part 2

Up until this point you have everything you need to run the change locally. The last step is to make your change available to others, possibly submitting it as a bugfix. To do that, the next thing to confirm is whether you made the change in the correct repository.

In this example, we made the change to the liferay-portal repository. However, this isn't always the correct place to create a fix.

So, how do you know if something has moved to a subrepository? This was briefly touched on in the previous blog entry in this series.

What's in the master branch of liferay-portal is actually not the latest code. Liferay has started moving things into subrepositories, which you can see from the hundreds of strangely named repositories that have popped up under the Liferay GitHub account.

However, a lot of these repositories are just placeholders. These placeholders are in what's called "push" mode, where code from the liferay-portal repository is pushed to the subrepository. However, a handful of them (five at the time of this writing) are actually active where they're in what's called "pull" mode, where code is pulled from the subrepository into the liferay-portal repository on-demand. You know the difference by looking at the .gitrepo file in each subrepository and checking the line describing the mode.

So in practice, find the file you're modifying, and then see if any parent folder has a .gitrepo file. If it does, check inside of that file for a line that says mode = pull. In this case, the .gitrepo located in dynamic-data-mapping, which is the parent folder for the dynamic-data-mapping-data-provider-instance module that we deployed, does say that our mode is pull.

Therefore, we'll need to clone the com-liferay-dynamic-data-mapping repository, we would update to the latest master branch of this repository, we would deploy the module to confirm that the issue still exists, and we would then apply our change to make sure that it fixes the issue.

Step 8: Submit Your Changes

After finding right repository, the next step is to make sure you've followed the Contributors Guide located on the GitHub repository.

There is a vague step in the guide, which is finding the core maintainer to review your changes so that it can actually be included in Liferay itself.

While Liferay has started a CODEOWNERS file, it's still only in the evaluation phase and so the file only has a single name in it. Given that, the easiest way is to leverage the pre-existing people who are dedicated to creating fixes and routing them to the correct people: Liferay's support department. A few people who work in Liferay's support department hang out in the Liferay Community Slack, so it should be easy to convince someone to take a look at your proposed changes and at least begin the conversation.

Once that conversation starts, the next step after that is a lot of patience while you wait for your changes to be included in the product. As the saying goes, you can lead a Liferay core maintainer to a pull request, but you can't make them merge.

Blogs
Great post! Very valuable to see all the details involved with diagnosing and fixing an issue spelled out.
Hello Mr Chau,

Thank you very much for this post, I really appreciate it.

P/s: Em la nguoi Viet, moi bat dau lam liferay nen nho a chi giao giup e voi a. Sorry anh em dang o cong ty khach hang nen may ko co unikey.

Thanks and best regards,
Hung Tran
Thanks for the article.

Regarding the liferay dxp sources, do you know where they are available?

They don't seem there in the customer portal..
For each Service Pack release, there should be a "Source" option under the "For Developers" section of the dropdown.

https://web.liferay.com/en/group/customer/dxp/downloads/digital-enterprise

For each DE release (better known as fix packs), there will be a version of the fix pack that includes "Source Code", with a folder structure that should help you understand what the changed files are.

https://web.liferay.com/en/group/customer/dxp/downloads/digital-enterprise/fix-packs

When you download the fix pack source, you use the Liferay patching tool to apply it to the source code for any service pack that preceded it.

https://customer.liferay.com/documentation/7.0/deploy/-/official_documentation/deployment/patching-tool-advanced-configuration#configuration-properties