Hi Alexey,
Thanks so much for all your help. We think the problem is solved (there was
a network misconfiguration that once corrected stopped the event from
occurring). Now we would like to know just how this misconfiguration caused
the event to occur.
(HP-UX does suffer from the linux selector spin problem, but was not
directly related to the CPU spikes so we are not persuing the workaround as
yet because the channel close was causing long delays, stopping all
activity. Whether or not this is a symptom of the other problem, I don't
know).
Our contacts at HP looked at some system info while the spike was occurring
and said that one of the Grizzly worker threads was going "out of control"
on socket reads, reading a massive amount of data, and subsequently chewing
up CPU and thrashing in GC. I've looked at the code and if I understand it
correctly Grizzly reads all of the data on the socket, stashes that in the
input buffer (byte buffer), then starts to parse the input. If it
determines that the input is not a valid HTTP header, an exception gets
thrown. (Unfortunately for us we had Grizzly logging turned off, so we did
not see that exception (if that indeed was the case here)).
Apparently the spike happened because throughout the day Grizzly was getting
hammered with lots of data that was not HTTP. Almost like a denial of
service attack, but we don't know how re-routing the network traffic "fixed"
this. I see that really all we get when the exception is logged is the stack
trace. What I was wondering is how difficult would it be to capture more
information, such as the IP address of the sender, the amount of data that
was read, and perhaps a hex dump of the first 100 bytes of data in the log?
Again, thanks for all of your support. Grizzly rocks!
ddooley wrote:
>
> Hi Alexey,
>
> We tried to implement the selector spin workaround by adding "hp-ux" to
> the isLinux property. The results
> were indeed different...but unfortunately not better. Below is a snippet
> of the thread dump that shows a deadlock condition when attempting to
> create the new selector. As you will see, Thread 5 is deadlocked on
> Worker Thread 4 (object id 6e0e83d8).
>
> This test was run with 1.9.18e code base, as our management is not
> comfortable with running the 19 snapshot code on a live client's system.
>
> We are very confused by this...all tests in our QA lab with similar
> machines (although to be honest the client's is somewhat underpowered with
> only two CPUs and 14 gigs of memory).
>
> Again, just to refresh memories, without the above code change the client
> does appear to suffer from the Selector spin problem (CPU spikes and GC
> thrashing). Usually once it occurs it happens almost like clockwork every
> hour. If we run with the selector spin workaround set then instead of the
> spikes happening we get the deadlock condition at exactly the point where
> the CPU spikes would occur.
>
> I'm inclined to believe that Grizzly is not the problem here, at least not
> directly. I wish I understood NIO better. My thought is that there is a
> horrible bug in HP's VM (or OS) that only manifests itself on a system of
> this configuration. Its as if the system is not closing the socket in a
> timely manner. From what the client tells us, it can last up to ten
> minutes before finally clearing up (when not using the selector spin
> workaround).
>
> At this point, we would like to try the
> executePendingIOUsingSelectorThread, but management is not willing to put
> the 19 snapshot code on a live client's system. How difficult would it be
> to get the changed sources between 1.9.18e and 19 snapshot to just
> implement your original suggestion of
> "executePendingIOUsingSelectorThread"?
>
> Thanks again for all your help!
>
> --Dennis
>
>
> Full thread dump [Wed Dec 23 10:28:33 EST 2009] (Java HotSpot(TM) Server
> VM 1.5.0.18 jinteg:11.04.09-01:30 PA2.0 (aCC_AP) mixed mode):
>
> "Grizzly-30000-WorkerThread(4)" daemon prio=10 tid=0099c8e0 nid=22
> lwp_id=3164606 runnable [53cee000..53cee778]
> at sun.nio.ch.FileDispatcher.preClose0(Native Method)
> at sun.nio.ch.SocketDispatcher.preClose(SocketDispatcher.java:41)
> at
> sun.nio.ch.SocketChannelImpl.implCloseSelectableChannel(SocketChannelImpl.java:708)
> - locked <6e0e83d8> (a java.lang.Object)
> at
> java.nio.channels.spi.AbstractSelectableChannel.implCloseChannel(AbstractSelectableChannel.java:201)
> at
> java.nio.channels.spi.AbstractInterruptibleChannel.close(AbstractInterruptibleChannel.java:97)
> - locked <6e0e83b0> (a java.lang.Object)
> at sun.nio.ch.SocketAdaptor.close(SocketAdaptor.java:352)
> at
> com.sun.grizzly.TCPSelectorHandler.closeChannel(TCPSelectorHandler.java:1356)
> at
> com.sun.grizzly.BaseSelectionKeyHandler.doAfterKeyCancel(BaseSelectionKeyHandler.java:229)
> at
> com.sun.grizzly.BaseSelectionKeyHandler.cancel(BaseSelectionKeyHandler.java:216)
> at
> com.sun.grizzly.http.SelectorThreadKeyHandler.cancel(SelectorThreadKeyHandler.java:80)
> at
> com.sun.grizzly.filter.ReadFilter.postExecute(ReadFilter.java:287)
> at
> com.sun.grizzly.DefaultProtocolChain.postExecuteProtocolFilter(DefaultProtocolChain.java:164)
> at
> com.sun.grizzly.DefaultProtocolChain.execute(DefaultProtocolChain.java:103)
> at
> com.sun.grizzly.DefaultProtocolChain.execute(DefaultProtocolChain.java:88)
> at
> com.sun.grizzly.http.HttpProtocolChain.execute(HttpProtocolChain.java:76)
> at
> com.sun.grizzly.ProtocolChainContextTask.doCall(ProtocolChainContextTask.java:53)
> at
> com.sun.grizzly.SelectionKeyContextTask.call(SelectionKeyContextTask.java:57)
> at com.sun.grizzly.ContextTask.run(ContextTask.java:69)
> at
> com.sun.grizzly.util.AbstractThreadPool$Worker.doWork(AbstractThreadPool.java:330)
> at
> com.sun.grizzly.util.AbstractThreadPool$Worker.run(AbstractThreadPool.java:309)
> at java.lang.Thread.run(Thread.java:595)
>
> <other worker threads follow in wait states...>
>
> "Thread-5" daemon prio=10 tid=01660018 nid=16 lwp_id=3164591 waiting for
> monitor entry [53ff4000..53ff4678]
> at sun.nio.ch.SocketChannelImpl.kill(SocketChannelImpl.java:741)
> - waiting to lock <6e0e83d8> (a java.lang.Object)
> at
> sun.nio.ch.AbstractPollSelectorImpl.implClose(AbstractPollSelectorImpl.java:71)
> at sun.nio.ch.SelectorImpl.implCloseSelector(SelectorImpl.java:96)
> - locked <6ea10360> (a sun.nio.ch.Util$1)
> - locked <6ea10350> (a java.util.Collections$UnmodifiableSet)
> - locked <6ea10158> (a sun.nio.ch.PollSelectorImpl)
> at
> java.nio.channels.spi.AbstractSelector.close(AbstractSelector.java:91)
> at
> com.sun.grizzly.SelectorHandlerRunner.switchToNewSelector(SelectorHandlerRunner.java:519)
> at
> com.sun.grizzly.TCPSelectorHandler.workaroundSelectorSpin(TCPSelectorHandler.java:1520)
> - locked <58bee570> (a java.lang.Object)
> at
> com.sun.grizzly.SelectorHandlerRunner.doSelect(SelectorHandlerRunner.java:202)
> at
> com.sun.grizzly.SelectorHandlerRunner.run(SelectorHandlerRunner.java:130)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:651)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:676)
> at java.lang.Thread.run(Thread.java:595)
>
>
> Oleksiy Stashok wrote:
>>
>> Hi Dennis,
>>
>> yes, it looks like it could be the spin issue.
>> But for sure it should be checked on original configuration, where you
>> observed the issue.
>> If it will not help - please try my original guess with
>> "executePendingIOUsingSelectorThread" property.
>>
>> Thanks.
>>
>> WBR,
>> Alexey.
>>
>>
>> On Dec 18, 2009, at 17:22 , ddooley wrote:
>>
>>>
>>> Hi Alexey,
>>>
>>> After treating HP-UX like Linux in the test for isLinux in
>>> Controller.java
>>> (in the 1.9.19 snapshot code base), our QA engineer ran a test that
>>> made a
>>> total of 85 requests from 10 different threads (monitored by the
>>> profiler).
>>>
>>> This resulted in 12146 calls to getSpinRate(), 69 calls to
>>> resetSpinCounter(), and 20 calls to workaroundSelectorSpin() over
>>> the course
>>> of 55 seconds.
>>>
>>> Based on the result, in your opinion, would this classify as the Linux
>>> selector spin issue? Or would this be considered normal?
>>> Unfortunately for
>>> us, we don't have another system to run the test on. Our
>>> application only
>>> runs on the HP-UX system.
>>>
>>> If there is any more detail you would like, please let me know.
>>> This test
>>> was run on a system that does not exhibit the CPU spikes that
>>> prompted the
>>> original posting. We are setting up another test on a system that
>>> is more
>>> like the client experiencing the CPU spikes to see if the results
>>> are any
>>> different.
>>>
>>> Thanks for your help. I really appreciate it.
>>>
>>> --Dennis
>>>
>>>
>>> Oleksiy Stashok wrote:
>>>>
>>>> Hi Dennis,
>>>>
>>>>> the System.getProperty("os.name") is "HP-UX". Our contact at HP
>>>>> states that
>>>>> the selector spin test that fails (see link below) is a sun bug, and
>>>>> will be
>>>>> fixed in 1.7 which makes me believe all the more the issues are
>>>>> related. I
>>>>> wonder if HP-UX and Linux share the same sun code base?
>>>>>
>>>>> Just for fun, I have modified the grizzly Controller class to treat
>>>>> "HP-UX"
>>>>> the same as "linux" and we will be running some tests. I'll let you
>>>>> know
>>>>> the results.
>>>> Great, it should be my next suggestion :)))
>>>> Please let me know the result.
>>>>
>>>> WBR,
>>>> Alexey.
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: users-unsubscribe_at_grizzly.dev.java.net
>>>> For additional commands, e-mail: users-help_at_grizzly.dev.java.net
>>>>
>>>>
>>>>
>>>
>>> --
>>> View this message in context:
>>> http://old.nabble.com/Grizzly-Web-Server-on-HP-UX-PA-RISC-Performance-Issue-tp26733514p26845878.html
>>> Sent from the Grizzly - Users mailing list archive at Nabble.com.
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: users-unsubscribe_at_grizzly.dev.java.net
>>> For additional commands, e-mail: users-help_at_grizzly.dev.java.net
>>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe_at_grizzly.dev.java.net
>> For additional commands, e-mail: users-help_at_grizzly.dev.java.net
>>
>>
>>
>
>
--
View this message in context: http://old.nabble.com/Grizzly-Web-Server-on-HP-UX-PA-RISC-Performance-Issue-tp26733514p27067602.html
Sent from the Grizzly - Users mailing list archive at Nabble.com.