Introducing Portal Sandbox: Improve Portal Resiliency

General Blogs May 8, 2014 By Shuyang Zhou Staff

We’re excited to introduce the Portlet Sandbox plugin which can greatly improve portal resiliency and stability. This new plugin does so by isolating high-traffic or resource-hungry portlets. It prevents unstable portlets from crashing portal JVM (Java Virtual Machine). Memory leaks or other stability issues of a single portlet or a group of portlets do not terminate the entire portal, instead their services will be offline for a short period of time and system will automatically recover by restarting or disabling the offending components. With Portlet Sandbox you improve your portal’s resiliency against things like misbehaving custom portlets. Enterprises with multiple development teams deploying to a single instance can reduce their impacts on one another by isolating their portlets to their own sandboxes.


The Portlet Sandbox has a very similar concept comparing to WSRP. WSRP allows us to treat portlets as enterprise services, accessing them via web services. However, WSRP has excessive overhead due to request marshalling, single sign-on, and a variety of other factors. The Portlet Sandbox runs isolated portlets in different JVMs on the local machine. It ensures the lowest communication level possible to reduce overhead. The communication between the MPP (Master Portal Process) and PSC (Portlet Sandbox Container) processes are implemented by a private RPC (Remote Procedure Call) framework over pipes (POSIX mkfifo) or sockets (Windows). The communication utilizes a private binary protocol, there is no SOAP overhead as WSRP does. Since all JVM processes run in the same machine, static resource (javascript, css, etc.) are served directly by the MPP, only the portlets accessing are isolated by the Portlet Sandbox. By breaking the entire portal into multiple JVM processes, each JVM can have a smaller heap size which eases the GC (Garbage Collector) overhead and improve the system memory utilization.


In cluster setup, each cluster node can be independently configured for the Portlet Sandbox. The MPP and PSCs on the same machine are logically considered as a single traditional cluster node that interacts with other cluster nodes. When requests are dispatched to a cluster node, MPP first takes over, and then delegates to proper PSCs based on the configuration. The combination of cluster and Portlet Sandbox can help the system to scale-out (on multiple machines) and scale-up (on high-end machines).


Would you like to try it out? The Portlet Sandbox plugin is available today in the Liferay Marketplace. It is free for Liferay Portal Enterprise Subscribers. Documentation for the plugin is also available now.


If you’re not yet using Liferay, check out our free 30-day trial or request a quote and see if the Liferay Portal Enterprise Subscription is right for you.

 

 

 

Will my hashing cache keys get a conflict?

Company Blogs April 26, 2011 By Shuyang Zhou Staff

The short answer to the title question is YES. However a longer version is : After LPS-16789 you don't really need to worry about it. The possibility for seeing an actual conflict is negligible for most people in most cases.

If your system is pursuing 100% safety in any case, stop reading this blog, go to LPS-16744, that is what you need.

However, if you are willing to take a little risk (It is very very low, keep reading you will see the actual number) by hashing your cache key for store, the reward will be a performance boost on old generation gc improving.

Let's clarify some conceptions first:

  1. No hashing algorithm is strong enough to prevent any conflict from unlimited input source. The reason is there is no one-to-one mapping from an umlimited space to a fixed size space.
  2. No system is 100% safe. Are you sure your colo will never have a power failure? Or your hardware will never have a "heart attack"? (Even the EARTH can not live for ever).

Your system will eventually halt, either by an exceptional reason or for an on schedule maintaining. So if the hashing conflict possibility is lower than those non-preventable situations, I suggest you to just let it go, the performance pay back for this will be much more worthy than your sacrifice.

Before I start to do math for caculating the hashing conflict possibility, I have to make an assumption.

Like no hashing algorithm is strong enough, no hashing algorithm can do the mapping in an absolutely even manner. Liferay's HashCodeCacheKeyGenerator is using the following formula(which is the exact same algorithm as java.lang.String.hashCode() using):

Hash(s) = s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]

Note: s is a String with n length.

It is not absolutely even for sure, to analyze its flatness is beyound my limited mathematics knowledge, so please pardon me to consider it as an even algorithm from now on for the sake of simplify analyzing.

With the above assumption, the problem now is much clear, let me describe the same problem in a different way:

Throwing n balls into b buckets(n <= b), what is the possibility that we end up with no bucket has more than 1 ball?

While I was trying to remind myself about how to do the Permutation and Combination, my wife surprisingly gave me the follow answer within 1 minute(To be honest, I was pissed off by this. Hate to admit she can do math better than meBut glad for married a smart women)

P = (b/b) * ((b-1)/b) * ((b-2)/b) * ... * ((b-n+1)/b)
    = (b * (b-1) * (b-2) * ... * (b-n+1))/b^n
    = b!/(b^n*(b-n)!)

Then the conflict possibility is:

Pc = 1 - P = 1 - b!/(b^n*(b-n)!)

Now let's put numbers into this formula, we use long (64bit) to hold the hash result, so we have 2^64 usable hash code, which means b = 2^64.
The n is your cache size, although you could try to configure your cache to support a super large size, but unless your using an off heap cache provider, I can promise you mostly likely you will see an OOM before the hashing conflict appears.

A reasonable in-heap cache size in the real world would be about: 10,000 ~ 100,000 entries.
If you are using a cache more than this size, maybe you should consider to break it down into smaller regions and maybe even move them out of the jvm heap, regardless of the hash conflict, a huge cache region is slow for looking up and inserting.

Cache warmup hashing conflict possibility
Cache sizeConflict possibility
10,000 entries2.710e-12
100,000 entries2.710e-10

 

Please be aware the number above is the possibility of conflict when you insert 10,000 and 100,000 entries into an empty cache. I will call this cache warmup hashing conflict rate.

Once the cache is properly populated, its size will remain, however new entries will come in, old entries will evict out, the cache is entering swapping phase.
Now let's analyze the hashing conflict possibility for a swapping cache.

This is actually quite easy, you have n balls inside b buckets(n <= b), no bucket has more than 1 ball, now you throw a new ball toward those buckets, what is the possibility that you end up with putting the new ball into a bucket that already contains a ball? Obviously it is n/b.

The difficult question here is how othen do you throw a new ball(inserting new entry into cache)? For any individual cache the actual insert rate normally is very low, the reason is cache is designed for reading, not writing. If you have a cache that is mainly be written to, rather than reading from. You'd better reconsider your design. In my experience this number should be under 50 ips (insert per second), well this is an experience number, I can not prove it. If you have a way to measure the actual insert rate of your cache, please replace it with your number.

Note: you may see a high cache missing rate from Ehcahe JMX report for certain cache(could be 100+ per second sometimes). In ehcache, missing means go fetch from DB then cache it, which equals to an insert. But this is not the exact same insert rate I am talking here. In my assumption, the input source is unlimited which means it won't generate any same result, every single new entry is different from all history results. This kind of thing does not exist in real world. All input source repeat itself at certain degree. For example, you may find user cache has a super high swapping rate at peak load, maybe 300+ ips(Means within one second, your system has 300+ users login and 300+ users logout). But it is ok, people login/logout, for the same user no matter how many times he login, it only counts once. The reason is the actual cache key for that user is same, insert same key multiple time will not increase the conflict rate at all. Only different keys can cause conflict.

Now let's see what is the possibility of keeping your system running for whole year long without maintaining(not even clearing the cache from Liferay's control panel).

There are 365 * 24 * 60 * 60 = 31536000 seconds in a year, let's say your cache insert rate is 50 ips, then in one year you will insert 31536000 * 50 = 1576800000 times.

For each insert you are taking the risk of n/2^64 to see a conflict, n is your cache size, so we have:

Cache swapping hashing conflict possibility
Cache sizeConflict possibility
10,000 entries8.548e-7
100,000 entries8.548e-6

 

So in summary, within one year if the possibility for your server to be out of service (no matter it is a system failure or an on schedule maintaining) is higher than 2.710e-10 + 8.548e-6 ≈ 8.548e-6(This is almost guaranteed to be true), I suggest you to hash your cache key for store. The possibility to see a conflict is very very low, even it actually happens you can recover it by clearing the cache in control panel.

Let me emphasize this again, if you do need a 100% safe cache key, see LPS-16744, that is the solution.

Direct JSP Servlet

Company Blogs December 21, 2010 By Shuyang Zhou Staff

A lot of Liferay TagLibs use including JSP files to present their content.

Is there any difference between TagLib used JSPs and the normal ones?
Functionally, they are the same thing.
But when we see them from performance view, there is a huge difference.

Whenever we do a jsp include, there are two major performance related steps:
 

  • 1) Jsp servlet looking up.

Code example:
RequestDispatcher requestDispatcher = servletContext.getRequestDispatcher(path);

The ServletContext.getRequestDispatcher(String path) is a heavy app-server api call, which needs to look up app-server's jsp servlet resource.

  • 2) Filter Stack setting up and traversing

Code example:
requestDispatcher.include(request, response);

No matter you want it or not, the invocation will set up the filter stack(by request URL regex pattern mapping), and traverse all the filters in the stack one by one.

For a normal jsp including, there is nothing wrong with these two steps. But for TagLib, things are a little different, And this little difference can bring us huge performance improvement.

For 1) in general cases, path parameter is various, there is a large candidate pool.
But for TagLibs, the candidate pool is relatively smaller, which means we can consider to cache the path mapping to the underneath JSP servlet. This can save us the look up time significantly.

For 2) in general cases, filters are absolutely needed.
But for all Liferay TagLibs, filters are guaranteed not needed. So the setting up and traversing filters are huge performance waste, we should call the JSP servlet directly without bothering the filters.

In LPS-13776, I made an improvement based on these analyzing.
I added a cache to hold path to servlet mapping, so the looking up will only happen once when the jsp is included for the first time. After that all looking up hit the cache.
By using the DirectRequestDispatcher, all include calls is sent directly to the jsp servlet without any filter processing.

You can turn on this feature by setting:
direct.servlet.context.enabled=true
By default this is on.

One drawback about the mapping cache is losing jsp dynamic reloading ability at runtime, which is not acceptable for developers. To overcome this pitfall, I added a timestamp to the mapped jsp servlet, whenever receives a request for a jsp servlet, it will check the jsp file's timestamp, compare to the jsp servlet's timestamp. If the external jsp file is newer, do a reload. This reloading is only required at development time, on production server it is totally a waste.
You can turn off this feature by setting:
direct.servlet.context.reload=false
By default this is on.

New AOP Mechanism

Company Blogs July 7, 2010 By Shuyang Zhou Staff

Liferay is heavily using Spring's aop. Spring's aop is doing an amazing job to make your life easy by making your code clean.

So why i am creating something new, if the exist one is good enough? Spring is taking care of general use cases, but Liferay's use case is a little bit special, so it gives me a chance to improve.

Let's see how does Spring do the general aop first.
Basically the aop is a wrapper on top of the real bean, so it gives you a chance to do extra processing before(or after, or throwing, or finally) real logic. And to make this more powerful, the wrapper can wrap another wrapper, which means multiple layers aop support.

Your extra processing is put inside advice class, and the advice will be binded to certain point-cut at runtime. The following picture demonstrates this.


Besides your own logic inside advices, which parts contribute the most overhead? I mean the overhead purely generated by the framework itself.
There are two parts:

  1. The aop wrapper's creation and invoking.
  2. The runtime wiring between advice and point-cut.


For 1, each aop wrapper is a JDK dynamic proxy object(or CGLib proxy), as you all know the creation of proxy object is kind of heavy. And the generated code introduces a few more stack-calls, so the more aops you have, the deeper the stack is. If you have ever seen liferay's thread dump, you will know what i am talking about.
For 2, the wiring between advice and point-cut is using a regular express like pattern matching which is a complex calculating process. So the more point-cuts you have and the more complex the patterns are, the slower your code is.

Now you should see, to improve the performance, we should create aop wapper objects as fewer as possible, we should use fewer point-cuts.
These rules can not be applied to Spring itself, since for general purpose processing, you can not limit the usage for them.

But for liferay, things are different, the key point is, the aop advices are mainly for service beans.

First, we could limit the point-cut to just service beans, if an advice only cares a few service beans, it can do a second matching by bean name or annotation. Either way is cheaper than regex.

Second, since advice itself will take care of secondly matching, all aop wrappers are just against service beans. So why bother creating a new wrapper for each advice? All advices could share the same wrapper which is created by the first invoked advice, then all other advices will be invoked as a chain. The following picture demonstrates this.


So this reduce Na(advice number) aop wrapper to 1, Np(point-cut number) point-cut matching to 1(still needs a few secondly matching, but they are much cheaper).

Since the advices now are chained up as a linked list, it becomes possible to modify aops structures esaily at runtime, just insert or remove elements from the list. By this way you can take out the unused advices complete(not just disable them, even they are disabled, they still cause stack calls) for performance, or apply new advices without restart the server(of course you have to take care of the thread safe property yourself).

So for a complex system(Liferay system) with a lot of aop usage, this can improve performance significantly.

For more detail info, please see LPS-9793 and LPS-9795

Embed StringBundler into App-server

Company Blogs July 6, 2010 By Shuyang Zhou Staff

If you have read my previous blog, you must already know StringBundler.

StringBundler can improve your String processing performance a lot, but it is still not enough.

Now answer this question: In a web application, which part creates most String garbages?

Normally it is not your Java code, but the template engine for generating the page content. Most likely is Jsps. Jsps will be complied into Servlets before providing services. Jsp's performance depends on Jsp Compiler's optimization skills. Well not a single Jsp Compiler knows StringBundler(It is liferay's internal code), so they must either use StringBuffer, StringBuilder or some similar home-made way. So can we improve jsp's String processing performance? It seems almost impossible without modify Jsp Compiler's code. Thanks to the well designed extendable Jsp framework. It is actually doable!

Let's skip all the explanation and impl detail, directly see the optimization result.

By adding the follow line to your code(you can add it to any where, but for better performance, i suggest to add it into startup progress):

JspFactorySwapper.swap();

You can improve String processing performance significantly. How much that would be actually? In our MessageBoard benchmark test case, this line reduces 66% old generation collection.

Now let me explain what kind of things this simple line do to provide you the improvement.

If you are familiar with Jsp standard, you must know PageContext. It is the center of the bottleneck. It has two performance problems.

  1. It has a JspWriter, and Jsp standard does not provide a standard way to configure that writer's buffer, most Jsp Compilers make it a fix size, a normal size is 8K.
  2. It is responsible for creating BodyContent, and most app-server's BodyContentImpl uses a StringBuilder like way to collect page data.


For 1, if you have read my IO performance blog, you should know this buffer is actually hurting your performance by introduce unnecessary data copying. So we should disable the buffer.

For 2, if we can use StringBundler to collect the data, we can avoid a lot of intermediate data buffers.

Jsp standard provides a way to hack into its internal code, without modify any app-server's code. We can install our own JspFactory instance by:

JspFactory.setDefaultFactory(newJspFactory);

And in our JspFactoryImpl, we wrapper out the actual PageContextImpl to disable the writer buffer and wrapper the BodyContentImpl with a StringBundler enabled wrapper.

The whole process is following Jsp standard, no reflection and code modification is required, so theoretically it should work on all app-servers, as long as their also follow the Jsp standard. And in practice, it works on all app-servers except WebLogic. WebLogic's Jsp Compiler is not following the rules programming to interface, the compiled servlet code assumes all code are their internal impl code, it does downcast everywhere. So i am sorry say we have to do this optimization without WebLogic.

For more impl detail, please see LPS-9099

New Ehcache Replication Mechanism

Company Blogs July 5, 2010 By Shuyang Zhou Staff

2010's benchmark and optimization is reaching the end, it is time to write something about what we did in last few months.

Today we will talk about an EE-only feature, a new ehcache cluster replication mechanism. Since this is only available in EE version portal, i won't talk about impl detail too much. Just some general concepts about what is the problem and how we fix it.

Ehcache supports cluster, by default it uses RMI replication mechanism, which is a point to point communication graph. As you can guess this kind of structure can not scale for large cluster with many nodes. Because each node has to send out N-1 same event to other nodes, when N is too big, this will become a network traffic disaster.



To make things even more worse, ehcache creates a replication thread for each cache entity. In a large system like Liferay Portal, it is very easy to have more than 100 cache entities, this means 100+ cache replication threads. Threads are expensive, because they take resource(memory and cpu power). But these threads are most likely sleeping over 99% time, since they only start to work when a cache entity needs to talk to remote peers. Without regard to thread heap memory(Because this is application depended), just consider stack memory footprint of that 100+ threads. By default on most platform, the thread stack size is 2MB, that means 200+MB. If you include the heap memory size, this number may even reach 500MB(This is just for one node!). Even memory chips are cheap today, we still should not waste them! And massive threads can also cause frequent context switch overhead.

We need to fix both the 1 to N-1 network communication and massive threads bottleneck.

Liferay Portal has a facility called ClusterLink which is basicly an abstract communication channel, and the default impl is using JGroups' UDP multicast to communicate.

By using ClusterLink we can fix the 1 to N-1 network communication easily.
To reduce the replication thread number, we provide a small group of dispatching threads. They are dedicated for delivering cache cluster event to remote peers. Since all cache enities' cluster event will go through one place to network, this gives us a chance to do coalesce, if two modification to the same cache object are close enough, we only need to notify remote peers once to save some network traffic.



(Newer version ehcache supports JGroups replicator, it can also fix the 1 to N-1 network communication, but it cannot fix the massive threads problem and cannot do coalesce.)

For EE customer who is in interested in this feature, you can contact our support engineers for more detail info.

Master Your ThreadLocals

Company Blogs January 22, 2010 By Shuyang Zhou Staff

ThreadLocal is not the "Silver Bullet" for concurrent issues, actually it is not encouraged to be used in some concurrent best practice.

But sometime, it is really needed or can significantly simplify your design. So we have to face it. Since it is very easy to be wrong used, we have to find a way to prevent it causing troubles. So today we are not talking about when and how to use ThreadLocal, but talking about when you are using it, how can you make sure it won't cause serious troubles.

The most serious error one could make with ThreadLocal is forgetting to reset it. If you are using ThreadLocal for caching user authentication, user A login your system through the service provided by worker thread 1, you cache the authentication info in a ThreadLocal for performance. But after worker thread 1 finishes the request for user A, you forget to reset the ThreadLocal(clear the cache). At this point, user B hits your system without login, but just happen to be served by worker thread 1, worker thread 1 simply check its cache for authentication, it will consider user B as user A. You can image what will happen next.

So the intuitive solution for this problem is reset the ThreadLocal after each request processing. But the difficult part is a work thread may have several ThreadLocal objects created from all around your application, how can you reset them all easily? You need a ThreadLocal registry of all ThreadLocal variables for each thread. Be awared! The registry itself also has to be a ThreadLocal object, so when a thread reset all ThreadLocal variables in the registry, it only resets its own ThreadLocal variables, not others. Once you have a registry like this, you can just reset the whole registry after each request processing, usually in a filter. Another question should come to your mind now is how can we register a ThreadLocal variable to the registry? Of course you can add a registering line after each ThreadLocal setting call. But this is really ungly and suffered from the same problem, what if you forget to add that registering call? The solution is creating a sub class of ThreadLocal, override the set() and initialValue() method, whenever this method is called, register itself to the registry. So by this way the whole registering and resetting process is transparent to programmer, all you need to do is use our sub class of ThreadLocal instead of the original ThreadLocal class.

Here are the sub class of ThreadLocal and the ThreadLocalRegistry:

public class AutoResetThreadLocal<T> extends InitialThreadLocal<T> {
    public AutoResetThreadLocal() {
        this(null);
    }
    public AutoResetThreadLocal(T initialValue) {
        super(initialValue);
    }
    public void set(T value) {
        ThreadLocalRegistry.registerThreadLocal(this);
        super.set(value);
    }
    protected T initialValue() {
        ThreadLocalRegistry.registerThreadLocal(this);
        return super.initialValue();
    }
}

public class ThreadLocalRegistry {
    public static ThreadLocal<?>[] captureSnapshot() {
        Set<ThreadLocal<?>> threadLocalSet = _threadLocalSet.get();
        return threadLocalSet.toArray(
            new ThreadLocal<?>[threadLocalSet.size()]);
    }
    public static void registerThreadLocal(ThreadLocal<?> threadLocal) {
        Set<ThreadLocal<?>> threadLocalSet = _threadLocalSet.get();
        threadLocalSet.add(threadLocal);
    }
    public static void resetThreadLocals() {
        Set<ThreadLocal<?>> threadLocalSet = _threadLocalSet.get();
        for (ThreadLocal<?> threadLocal : threadLocalSet) {
            threadLocal.remove();
        }
    }
    private static ThreadLocal<Set<ThreadLocal<?>>> _threadLocalSet =
        new InitialThreadLocal<Set<ThreadLocal<?>>>(
            new HashSet<ThreadLocal<?>>());
}

 Also a graph to demonstrate the registering and resetting:

Here are some advices:

  • Don't forget to reset your ThreadLocal, no matter how do you use it.
  • When your ThreadLocal's valid period is limited to request processing(or some other kinds of period time), try to use AutoResetThreadLocal and ThreadLocalRegistry to simplify your code.(The fewer lines you write, the fewer chance you make a mistake).
  • Be awared! You still need to call ThreadLocalRegistry.resetThreadLocals() somewhere(Usually in a filter).

IO performance

Company Blogs December 22, 2009 By Shuyang Zhou Staff

IO is very important in almost all types of applications, because IO operations can cause bottleneck very easily.

In Java's world, there are two groups of IO classes, Traditional IO(TIO), New IO(NIO). And a coming enhancement for NIO, NIO2.
The NIO(and NIO2) are targeted to improve performance for certain cases and provide better OS level IO integration, but they can not replace TIO! There are huge places, TIO is the only option for you.
Today we will talk about TIO's performance.

There are two major types IO bottlenecks:

  1. Wrong IO buffer usage
  2. Overkilled synchronized protection.

We all know buffer can improve IO performance, but not everyone knows how to use buffer correctly, at the end of this blog i will list some best practice advices.

Part I: Wrong IO buffer usage
There are two popular misusage and one misunderstand

a)Add buffer to in-memory IO classes(misusage)

b)Add explicit buffer to Buffered version IO classes(misusage)

c)The relationship between Buffered verion IO classes and adding explicit buffer(misunderstand)

For a), this is ridiculous! Adding buffer is supposed to group a lot of IO device accessing to one acessing, in-memory IO classes(like ByteArrayInput/OutputStream) will never touch any IO device, there is no reason to buffer them.

For b), this is redundant! You only need one level buffer, more than one level buffering can only introduce more stack calls and garbage creation.

For c), this needs more explanation.

They are trying to achieve the same goal, but by different ways, which causes they have different performance!

We did a test for this, comparing reading/writeing files by Buffered version IO classes and explicit buffer
The following performance test result shows you how big the performance difference is:

Read: (All numbers are token after wamup, each sample time is for 10 times read, time unit-ms)

File size1K10K100K1M10M100M1G
BufferedInputStream01553549549256002
With explicit byte[] buffer00110113112611448

 

Write: (All numbers are token after wamup, each sample time is for 10 times write, time unit-ms)

File size:1K10K100K1M10M100M1G
BufferedOutputStream01545472479348794
With explicit byte[] buffer01110124130013138


Why there is such a huge performance difference? There are two reasons:

  • Buffered version IO causes more stack calls(Thanks to the decorator pattern)
  • All Buffered version IO classes are thread-safe which means a lot of synchronized protection(Will explain this more in Part II)

Now you know explicit buffer has better performance, try to use it whenever possible, but there are two cases you still need buffered version IO:

  • You are working with some third part lib who requires IO input parameters, but using them in a stream way, not with an explicit buffer. To improve the performance you have to pass in a Buffered version IO.
  • Well, if you are lazy, you may prefer buffered version IO classes, since they can make your code has fewer lines.

Part II:Overkilled synchronized protection

I mean JDK's io package, i don't really like those codes, since they are all thread-safe which also means a lot of synchronized protection. If I need thread-safe i prefer to do the protection myself, so i will never add overkilled synchronized protection. But JDK's io package gives me no choice

As long as you use JDK IO code, you are adding a lot of synchronized protection, even though you are 100% sure you are under single thread context, you can not bypass these unnecessary synchronized protection. You may wonder is this really a serious problem? JVM should able to handle less contended locks faster, right? Apparenly, he can not do it well enough, see the performance test result.

I recreate a batch of IO classes following JDK IO package's javadoc, all my classes do not do any synchronization protection. All tests are done in a single thread, so don't worry about thread-safe.
We did a tests for this, comparing reading/writing in-memory data by original JDK IO classes and our unsync version IO classes. The reason for using in-memory data is to magnify synchronized's performance impact.

Read:

 

Write:

The write curve is not as smoothness as read, because of the internal growing byte[] causes a lot of GC(Similar with the problem we saw in SB).

Ok, now you see how heavy the synchronized protection is. We have a lot of IO usage within a method call which is guaranteed in a single thread. We also have a lot of IO usage, even though the references of IO objects go out the method scope, but we can reason out they are only accessed by a single thread. For cases like these, feel free to use the Unsyc version IO classes under com.liferay.portal.kernel.io.unsync

For more detail about com.liferay.portal.kernel.io.unsync, see issues.liferay.com/browse/LPS-6649

My final advices:
1)Use explicit buffer rather than buffered version IO whenever possible.
2)Use buffered version IO with third part lib, or when you are lazy
3)Use Unsync version IO classes from com.liferay.portal.kernel.io.unsync whenever you are sure about you are under single thread context, or you are adding the sync protection by yourself.

 

String Performance

Company Blogs December 16, 2009 By Shuyang Zhou Staff

String in Java is special, because its immutable particularity.
Immutable String is the cornerstone of Java Security and thread-safety, without it Java is brittle.

But immutable comes with price, whenever you "modify" String, you are actually creating a new one, and in most cases the old one becomes garbage. Thanks for Java's automatic garbage collection mechanism, programmers don't have to worry about those String garbages too much. But if you totally ignore them, or even abuse String api, your program will suffer from excessive GC.

In JDK's history, some efforts have been made to improve(to avoid) String garbages creation. JDK 1.0 added StringBuffer, JDK 1.5 added StringBuilder. StringBuffer and StringBuilder are same except StringBuilder is not thread-safe. Most String connecting operation are happened inside a method call which is under a single thread context, so no need to sync. So JDK's advice is whenever you need to connect String, try to use StringBuffer and StringBuilder, whenever you are in a single thread context, use StringBuilder rather than StringBuffer. Following this advice can improve performance compares to directly using String.concat() in most cases, but the real world cases sometimes can be complex. This advice can not give you the best performance gain. Today let's talk about String's connecting performance deeply, to help you completely understand this.

First, let's refute a rumor, some people says SB(StringBuffer and StringBuilder) is always better than String.concat(). This is wrong! sometime String.concat() can beat SB! We will prove this by an example.

Goal:
    Connect two Strings,
    String a = "abcdefghijklmnopq"; //length=17
    String b = "abcdefghijklmnopqr"; //length=18

Explanation:
    We are going to analyze garbage creation by different connecting solution. In our discuss, we will ignore input parameters(String a and b), even they become garbages, since they are not created by our code. We only count String's internal char[], since except that String's other states are very small, we can simply ignore them.

Solution 1:
    Use String.concat()
Code:
    String result = a.concat(b);
This is very simple, let's see JDK String source code to analyze what actually happens.

String source code:

public String concat(String str) {
    int otherLen = str.length();
    if (otherLen == 0) {
        return this;
    }
    char buf[] = new char[count + otherLen];
    getChars(0, count, buf, 0);
    str.getChars(0, otherLen, buf, count);
    return new String(0, count + otherLen, buf);
}

String(int offset, int count, char value[]) {
    this.value = value;
    this.offset = offset;
    this.count = count;
}

    This piece of code creates a new char[] whose length equals a.length() + b.length(), then copys a's and b's content to the new char[], finally creates a new String from the char[]. You need to pay attention to the constructor, it only has package accessibility, it directly uses the passed in char[] as its internal char[], does not do any copy protection. This constructor has to be package visible, otherwise user may use this constructor to break String's immutable.(Modify the char[] after uses it to create a String.) JDK's code guarantees no one will modify the char[] passed to this constructor.

    In this whole process, we do not create any garbage(As we said, a and b are parameters, not created by you, so we don't count them). So we are good

Solution 2:
    Use SB.append(), let's use StringBuilder as a demo, it is the same thing for StringBuffer.
Code:
    String result = new StringBuilder().append(a).append(b).toString();

This code looks more complex than String.concat(), but how about the performance? Let's analyze it by 4 steps, new StringBuilder(), append(a), append(b) and toString().
First, new StringBuilder().
See StringBuilder source code:
public StringBuilder() {
    super(16);
}
AbstractStringBuilder(int capacity) {
    value = new char[capacity];
}
We create a chat[] whose size is 16, no garbage created so far.

Second, append(a).
See source code:
public StringBuilder append(String str) {
    super.append(str);
    return this;
}
public AbstractStringBuilder append(String str) {
    if (str == null) str = "null";
    int len = str.length();
    if (len == 0) return this;
    int newCount = count + len;
    if (newCount > value.length)
        expandCapacity(newCount);
    str.getChars(0, len, value, count);
    count = newCount;
    return this;
}
void expandCapacity(int minimumCapacity) {
    int newCapacity = (value.length + 1) * 2;
    if (newCapacity < 0) {
        newCapacity = Integer.MAX_VALUE;
    } else if (minimumCapacity > newCapacity) {
        newCapacity = minimumCapacity;
    }
    value = Arrays.copyOf(value, newCapacity);
}
    This piece of code ensures the capacity first, which creates a new char[] whose size is 34, then causes the old 16 char[] becomes garbage. Check point 1, we create 1st garbage char[], size 16.

Third, append(b).
Same logic, ensures the capacity first, which creates a new char[] whose size is 70, then cause the old 34 char[] becomes garbage. Check point 2, we create 2nd garbage char[], size 34.

Finally, toString().

See source code:
public String toString() {
    // Create a copy, don't share the array
    return new String(value, 0, count);
}
public String(char value[], int offset, int count) {
    if (offset < 0) {
        throw new StringIndexOutOfBoundsException(offset);
    }
    if (count < 0) {
        throw new StringIndexOutOfBoundsException(count);
    }
    // Note: offset or count might be near -1>>>1.
    if (offset > value.length - count) {
        throw new StringIndexOutOfBoundsException(offset + count);
    }
    this.offset = 0;
    this.count = count;
    this.value = Arrays.copyOfRange(value, offset, offset+count);
}
    You should pay attention to this String constructor, it is public, so it has to do a copy protection, otherwise user may break String's immutable. But it creats our 3rd garbage, whose size is 70.

So totally we create 3 garbage objects, total size is 16+34+70=120 chars! Java is using Unicode-16, which means 240 bytes!

One thing can makes SB better, change the code to:
String result = new StringBuilder(a.length() + b.length()).append(a).append(b).toString();
Calculate it yourself, we create only one waste object whose size is 17 + 18 = 35, it is still bad, isn't it?

Compare to String.concat(), SB creates a lot of garbages(Anything bigger than 0 compares to 0 is infinity!), and as you can see SB has much more stack calls than String.concat().
With further analyze(do it yourself) you will find out, when connecting less than 4 Strings(Not including 4), String.concat() is much better than SB.

Ok, so when we are connecting more than 3 Strings(Not including 3), we should simply use SB, rigth?
Not exactly!

SB has an inherent problem, it uses a growing internal char[] to append new Strings, whenever you append new String, and SB reaches its capability, it grows. After that it has a bigger char[], the old one becomes a garbage. If we tell SB exactly how long the final result would be, it will save a lot growing garbages. But it is not easy to predict!

Compare to predict final String's length, predict the number of Strings you are going to connect is much more easier. So we can cache the Strings you want to connect, then at the last point(you call toString()) calculate the final String's length accurately, use this length to create a SB for connecting Strings, this can save a lot of garbages. Even though sometimes we still not able to predict how many Strings going to connect, we can use a growing String[] to cache Strings, since the String[] is much more smaller than the original char[](Most String contains more than 1 char in real world case.), a growing String[] is much more cheaper than growing char[]. This is exactly how our StringBundler works.

public StringBundler() {
    _array = new String[_DEFAULT_ARRAY_CAPACITY]; // _DEFAULT_ARRAY_CAPACITY = 16
}
public StringBundler(int arrayCapacity) {
    if (arrayCapacity <= 0) {
        throw new IllegalArgumentException();
    }
    _array = new String[arrayCapacity];
}
You can create a StringBundler with default array capacity which is 16 or give it required array capacity.
Whenever you call append(), you are not actually appending, the String is only cached in the array.
public StringBundler append(String s) {
    if (s == null) {
        s = StringPool.NULL;
    }
    if (_arrayIndex >= _array.length) {
        expandCapacity();
    }
    _array[_arrayIndex++] = s;
    return this;
}
If you are reaching the capacity, the internal String[] will grow.
protected void expandCapacity() {
    String[] newArray = new String[_array.length << 1];
    System.arraycopy(_array, 0, newArray, 0, _array.length);
    _array = newArray;
}
Expand the String[] is much cheaper than char[]. Because String[] is smaller, and grows less othen than char[].
When you finish all appending, call toString() to get the final result.
public String toString() {
    if (_arrayIndex == 0) {
        return StringPool.BLANK;
    }
    String s = null;
    if (_arrayIndex <= 3) {
        s = _array[0];
        for (int i = 1; i < _arrayIndex; i++) {
            s = s.concat(_array[i]);
        }
    }
    else {
        int length = 0;
        for (int i = 0; i < _arrayIndex; i++) {
            length += _array[i].length();
        }
        StringBuilder sb = new StringBuilder(length);
        for (int i = 0; i < _arrayIndex; i++) {
            sb.append(_array[i]);
        }
        s = sb.toString();
    }
    return s;
}

If the String number is less than 4(not including 4), it will use String.concat() to connect the String, otherwise it will caculate the final result's length first, then create a StringBuilder with this length, use this StringBuilder to connect Strings.

I suggest you to use String.concat() directly when you are sure about you only need to connect less than 4 Strings, even though StringBundler can do this for you, why bother to introduce the unneeded String[] and stack calls?

For more detail about StringBuilder see support.liferay.com/browse/LPS-6072

Ok, enough explanation, it is time to see the benchmark result, that the numbers tell you how much performance we can improve by using StringBundler!

We will compare performance for String.concat(), StringBuffer, StringBuilder, StringBundler with default init-capacity, StringBundler with explicit init-capacity.
The comparison includes two parts:

  1. Time consume for getting the same amount of job done.
  2. Garbage creation for getting the same amount of job done.

In the test all String length()==17, we connect Strings from 72 to 2, for each number we run 100K times.
For 1) we only take result between 40 to 2, because the jvm warn up may impact the result.
For 2) we take the whole range, since jvm warn up won't impact the total garbage generating number.

BTW, We use follow JVM parameter to generate gc log:
-XX:+UseSerialGC -Xloggc:gc.log -XX:+PrintGCDetails

We use SerialGC to eliminate multi-processors' influence.

The following picture show the time consume comparison:

From the picture you can see:

  1. String.concat() is the best when connecting 2 or 3 Strings
  2. StringBunlder is better than SB in general.
  3. StringBuilder is better than StringBuffer(because it saves a lot of unneed sync)

For 3) we may discuss it deeply in later blogs, there are a lot of similar cases in our code and JDK, a lot of sync protection are not necessary(at least in some cases), like JDK's io package. If we bypass these unneed sync, we can improve performance.

We analyze the gc log(gc log can't give 100% accurately garbage number, but it can show you the trend)

String.concat() 229858963K
StringBuffer   34608271K
StringBuilder   34608144K
StringBundler with default init-capacity   21214863K
StringBundler with explicit init-capacity   19562434K

 

From the statistics number you can see, StringBundler does save you a lot of String garbages.

My final advice is:

  1. Use String.concat() when you connect only 2 or 3 Strings.
  2. Use StringBuilder/StringBuffer when you connect more than 3(not including 3) Strings and you can predict the final result's length accurately.
  3. Use StringBunlder when you connect more than 3(not including 3) Strings but you can not predict the final result's length.
  4. Give StringBunlder an explicit init capacity when you can predict it.

If you are lazy, just use StringBunlder, it is the best choice for most cases, for other case even though it is not the best choice, it still performs well enough.

Blocking Cache

Company Blogs December 7, 2009 By Shuyang Zhou Staff

I have just came back from 2009 retreat, it was really fun. A lot of old faces and new faces, a lot of fun talking and genius ideas. I am glad i was there, i had a good time

In the retreat a few people told me they read my blogs which surprised me a lot, i only wrote two blogs and it was about almost one year ago, feel sorry about this. I used to plan to write a series for performance tuning, i still want to but maybe don't have enough time. But i do want people know, how many efforts liferay has spent on performance. Liferay is fast and getting better and better. I will write about things we have done which improved performance, people may already know this, but may not know the detail about how we made it, i will explain some detail(So you will trust we did improve it), today's topic "Blocking Cache".

We use cache(Ehcache) a lot in our codes to improve IO performace(database), in most of the case it works perfectly. But there is a special case, we may still waste some IO performance--concurrent hitting an absent cache entity.

We need two conditions to trigger this happen:
1)More than one thread hit cache with the same key.
2)These threads hit cache at the same time.
    (this is not a strong condition, it does not have to be exactly same time, if there is an overlap among threads' cache missing and populating window time, it will happen.)

When this happens we waste some IO performance, because multiple threads are doing the same jobs which actually only needs to be done once. So why not just ask one thread to do it, the others to wait there for result populated in cache by the "bad luck" working thread. This is where ehcache BlockingCache comes to help! When multiple threads hit cache with same key and an overlap window time, only the first arrived one can pass the cache getting a cache missing, then do its job hitting database and populating cache, the other threads will be blocked at the cache until cached element is ready. This seems like a good fix, but it will cause more serious problem--deadlock!

Because our cache system is complex, it is very common to look up multiple cache elements in one request, which means we may have multiple cache missings for one request(one thread). When multiple threads look up multiple cache elements in different orders, BlockingCache can cause deadlock easily.

We have two BlockingCache 1 and 2, and two threads A and B. A and B both need an element from 1 and 2 with same keys. But A needs 1 first, then 2, B needs 2 first, then 1.
Thread A locked BlockingCache 1 first, then try to lock BlockingCache 2, but right before it does that, thread B locked BlockingCache 2, so thread A has to wait thread B to release BlockingCache 2, which won't happen because thread B can't lock BlockingCache 1(thread A locked it). Deadlock!

This is a very classic deadlock case, there are two traditional ways to fix this:
1)Use a global lock to protect all caches.
    This kills all concurrency for cache, we will never use it.
2)Force all threads to access caches in the same order.
    This won't hurt concurrency, but will add very complex logic to all cache accessing code, too difficult to implement.

The traditional ways can not help us!
JVM does not support deadlock recover as database server, because there is no transaction in java code execution(That is too heavy). But in this special case we don't need deadlock recover, if we can make sure each thread locks no more than one BlockingCache, we are deadlock free. But we have to do a compromise here, because there is no way to do this without changing caching strategy. We change the strategy to when a thread tries to lock up its second BlockingCache, it has to give up its first locked BlockingCache first. This seems like a big performance lost, but actually not. Think about a case like this(this is very common), we have two threads, one has a complex BlockingCache requirement which means, it has to lock up a lot of BlockingCaches to finish its job. But the other thread's requirement is quite simple, maybe only needs to lock up 1 or 2 BlockingCaches, they just happen to require a same key for one cache. If we want to improve IO performance as much as possible, we should ask one of them to wait the other, but if we just happen to ask the simple thread to wait, it may take a long time to get the result, maybe even longer than hitting the database directly, because the complex thread really needs a long time to finish its job. We do save the whole system overhead, but we ruin certain thread's response time. The compromise is we pay some overhead for responsibility and we get deadlock free by the way, we are killing two birds with one stone!

We implement this by BlockingPortalCache, see LPS-5468.

BlockingPortalCache uses a ThreadLocal to record each thread's last lock, when a thread try to lock up a new one, it has to give up its last one, so all threads wait on its first lock(if any) can move on, either hitting database or locking next BlockingCache.

And BlockingPortalCache also uses CompeteLatch to improve concurrent performance for multi-core machine(4+), see LPS-3744. CompeteLatch scales very well on multi-core comparing to synchronized key word, and it is recycleable(if used properly), not as CountdownLatch.

So now you can safely turn on your BlockingPortalCache for performance, no need to worry about deadlock

Liferay JVM Tuning

Company Blogs March 24, 2009 By Shuyang Zhou Staff

When people say JVM tuning, most of the time they mean tuning GC. Before we start to tune JVM for Liferay Portal, let us talk about some basic conceptions of GC.


As we all know jvm can handle useless memory block automaticly, this release us from manually free useless memory blocks which is the NO.1 burden for all C and C++ programmers. With the help from GC our life is much easier, we don't need to worry about memory leak for every single line of our code. But this does not mean you can ignored freeing memory.

So let me ask you a very common question: Manually Memory Management VS. Automatic Memory Management, which one is better?

The answer will change depending on how you define better, and who is making the definition.
Let us see the first part, how do you define better?
If easy is better, of course, Automatic Memory Management is better. You don't need to do anything explicit memory management, JVM takes care of everything.
If flexible is better, of course, Manually Memory Management is better. Everything is under your control.

For the second part, who is making the definition?
If you are C and C++ guru, of course, Manually Memory Management is better.
If you are not, most people whom are reading this blog should be this case:), Automatic Memory Management is better.

If you are not very good at Manually Memory Management, you will be very easy to make following mistakes:

Dangling references: It is possible to deallocate the space used by an object to which some other object still has a reference. If the object with that (dangling) reference tries to access the original object, but the space has been reallocated to a new object, the result is unpredictable and not what was intended.

Memory leak:These leaks occur when memory is allocated and no longer referenced but is not released. For example, if you intend to free the space utilized by a linked list but you make the mistake of just deallocating the first element of the list, the remaining list elements are no longer referenced but they go out of the program’s reach and can neither be used nor recovered. If enough leaks occur, they can keep consuming memory until all available memory is exhausted.

Another big problem for memory management is the memory fragments. With Manually Memory Management it is very difficult to fix this problem, but with Automatic Memory Management it becomes very easy.
Suppose you have 4MB heap and you allocate three objects which are 1MB, 1MB and 1MB, so your heap should looks like this:

Then you free the second object, so your heap should looks like this:

Finally you need to allocate the 4th object which is 1.5MB, at this point you have 2MB free heap space, but you can't fit a 1.5MB object into it.

The memory fragments can cause fake out of memory, and can also make memory allocating very slow. Because you will need a link list to record all the free blocks, and do a search for every allocate request.

Ok, that is enough. For more information about JVM memory management you can read
memorymanagement_whitepaper
Let us start our real topic:Tune JVM for Liferay Portal.

1)Check your portal server's CPU and memory. Our portal server has 2 4-cores cpu, which means 8 cpus for JVM, and we have 8GB physical memory.
Because we have multiple cpus, we should try to use multi-thread to speed up gc.
In Sun JDK5, there are 4 build-in gc types: Serial Collector, Parallel Collector, Parallel Compacting Collector and Concurrent Mark-Sweep (CMS) Collector.
For our server, we choose Concurrent Mark-Sweep (CMS) Collector, because we have 8 cpus and we need a short gc pause time.(For more detail, please read
memorymanagement_whitepaper). Use -XX:+UseConcMarkSweepGC to turn on this option.

2)Fix your Xms and Xmx to a same value. Because our server is a dedicated server for Liferay Portal, once you know the suitable heap size for the whole JVM, there is no reason to define a range then let jvm to group the heap size. We should directly set the heap size to the suitable value. Use -Xms2048m -Xmx2048m to set the heap size.(i will explain why is 2048 later.)

3)Set the young generation size(Don't know what is young generation? again read
memorymanagement_whitepaper). For the same reason as above, we should fix the young generation size directly to the suitable size. Experimentally we make the young generation size 1/3 of the whole heap. Use -XX:NewSize=700m -XX:MaxNewSize=700m to set the young generation size.

4)Set PermSize. PermSpace is used to store Class Object, so the size depends on how many classes do you have. For our case 128MB is big enough. If you get a OutOfMemory error complain for there is no more space in PermSpace, you should increase this value. Use -XX:MaxPermSize=128m to set the PermSize.

5)Set the SurvivorRatio to young generation. There is no particular rule for this, the value is from watching(By VisualVM). You should try to make survivor size bigger than the peak memory use size. Here we set the ratio to 20. Use -XX:SurvivorRatio=20 to set the ratio.



Ok, now let us talk about why set the heap size to 2048MB, we are using 64bit JVM, we can support more than that heap size. The rule for choosing heap size is that fit is the best, never making it too big. The algorithm for gc tells us as the heap size increasing by a linear way, the time for gc will grow much more faster than linear. So we should just set the heap size to meet our need. In our case, by testing with VisualVM, we choose 2048MB for the heap size. From the gc time for young generation and old generation you can tell whether this heap size is suitable. Generally the average young generation gc time should around 20ms, the old generation gc time should around 200ms~400ms.

The finally startup parameters should look like this:
JAVA_OPTS="$JAVA_OPTS -XX:NewSize=700m -XX:MaxNewSize=700m -Xms2048m -Xmx2048m -XX:MaxPermSize=128m -XX:+UseConcMarkSweepGC -XX:SurvivorRatio=10"

For more info about jvm startup parameters, please read
jvm-options-list

Liferay Performance Tuning

Company Blogs March 16, 2009 By Shuyang Zhou Staff

I have been in LA more than one month for a new employee training. I graduated on Jan 15th 2009, then start my first job at liferay

I like this company and all people here, i really have a good time in LA(In my big boss's home Thanks Brian and Caris!). My coworkers are very nice, My bosses are kindness and full of knowledge. I can learn a lot from them.

I am very happy to get this job, at here i can realize my dream


In this blog i will start to talk about my first project Liferay Performance Tuning. I hope this can help all my coworkers and poeple who are interested in Liferay Portal to understand the performance things. Trust me, liferay can run very fast! Let us start to speed it up together.

PS. Most of the information here are for developers, some of them are also helpful for end user. All the improvements mentioned here sooner or later will be included in our new release.

Preparation:

Before we start to tune our portal, we need a tool to tell us how good or how bad the performance is. We can call it stress test tool or benchmark tool(technically they are different, but here we just use them to represent a measure standard). I won't leak the really result number here, because we will formally publish a benchmark for our portal soon. So if you want to know the numbers, you have to wait

 

We are using The Grinder as our stress test tool. We define a series of test scenarios, then code them as Jython test scripts, run them on a batch of load injection machines. All the load are injected to a portal server which has a connection to a database server. The Grinder can report how good or how bad the servers can handle the injection, then we can find out the bottleneck and kill it.

So we do this by iterations: Create test scenario, setup Jython script, run stress test, find the bottleneck, kill the bottleneck.

Hardware environment:
   
    Portal server:
    CPU:     two 4 cores
    Memory:    8GB
    HardDiver: 10000 rpm
    NetInterface: 1Gb/s

    Database server:
    CPU:     two 4 cores
    Memory:    8GB
    HardDiver: 10000 rpm
    NetInterface: 1Gb/s

    Network:1Gb/s

Software environment:
    OS: Cent OS
    JDK: Sun jdk5 update 17
    Portal: Liferay Portal 5.1.x
    Database: MySQL 5.1

 

Our first test scenario: Login
    In this scenario we will hit the homepage, post to login, go to a private page(this page a has portlet which need 1 second time to render his view, to simulate the complex backgroup logic), sleep 5s(to simulate real user delay), go to another private page(as the first one, has a 1s delay portal), sleep 5s, go to the final private page(as the first one, has a 1s delay portal), sleep 10s, finally logout.
   
    The whole iteration would take about 25s. We will create a batch of client threads to run this scenario again and again, grinder can report the time using for each part of this scenario.

Setup the Jython script:
    We are using The Grinder's proxy recorder to record the user actions, then modify the recorded code to fit our scenario. Also we create some SQL scripts to setup the user accounts for login.

Run the stress test:
    We use a client machine to run The Grinder console, another client machine to run The Grinder agent with several hundreds worker threads. All the machines are connected to a 1 Gb/s network.

Ok, that is for today, next time we will start to tune the JVM first, then move on to find out the bottleneck for our first test scenario, Hopefully we can dig it out and kill it

Showing 12 results.
Items 20
of 1