[gf-users] Re: _at_Schedule job being killed

From: Will Hartung <willh_at_mirthcorp.com>
Date: Fri, 12 Jun 2015 15:50:17 -0700

On Fri, Jun 12, 2015 at 2:08 PM, Steven Siebert <smsiebe_at_gmail.com> wrote:
> I only ask this because I didn't see you address it in your post, and your
> stack traces doesn't hint towards it: what does your job do?...does it
> directly (ie using a JPA entity manager) or indirectly (ie by a container
> managed EJB method) access Postgres?

It absolutely uses Postgres.

When everything is happy, it does involve, all told, raw JDBC, JPA, and JMS.

Looking closely, though, in this case, we don't get that far.

What I did see was, inevitably, an I/O error to Postgres, but not in
this thread.

[#|2015-06-12T12:30:00.007-0700|WARNING|glassfish3.1.2|com.mirth.ngshare.util.ConfigLock|_ThreadID=536;_ThreadName=Thread-2;|Unknown
exception trying to capture configuration lock for null|#]

That's our code whining there. The exception should have been logged,
but the logger statement was backward, so we eat that exception. Note
we're on thread 536

[#|2015-06-12T12:30:00.008-0700|INFO|glassfish3.1.2|javax.enterprise.resource.resourceadapter.com.sun.gjc.spi|_ThreadID=536;_ThreadName=Thread-2;|RAR7115:
Unable to set ClientInfo for connection|#]

[#|2015-06-12T12:30:00.009-0700|SEVERE|glassfish3.1.2|javax.enterprise.resource.resourceadapter.com.sun.enterprise.resource|_ThreadID=536;_ThreadName=Thread-2;|RAR5031:System
Exception
javax.resource.spi.LocalTransactionException: This connection has been closed.
    at com.sun.gjc.spi.LocalTransactionImpl.commit(LocalTransactionImpl.java:112)
    at com.sun.enterprise.resource.ConnectorXAResource.commit(ConnectorXAResource.java:124)
    at com.sun.enterprise.transaction.JavaEETransactionImpl.commit(JavaEETransactionImpl.java:501)
    at com.sun.enterprise.transaction.JavaEETransactionManagerSimplified.commit(JavaEETransactionManagerSimplified.java:855)
    at com.sun.ejb.containers.BaseContainer.completeNewTx(BaseContainer.java:5136)
    at com.sun.ejb.containers.BaseContainer.postInvokeTx(BaseContainer.java:4901)
    at com.sun.ejb.containers.BaseContainer.postInvoke(BaseContainer.java:2045)
    at com.sun.ejb.containers.BaseContainer.postInvoke(BaseContainer.java:1994)
    at com.sun.ejb.containers.BaseContainer.callEJBTimeout(BaseContainer.java:4088)
    at com.sun.ejb.containers.EJBTimerService.deliverTimeout(EJBTimerService.java:1832)
    at com.sun.ejb.containers.EJBTimerService.access$100(EJBTimerService.java:108)
    at com.sun.ejb.containers.EJBTimerService$TaskExpiredWork.run(EJBTimerService.java:2646)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
    at java.util.concurrent.FutureTask.run(FutureTask.java:262)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
Caused by: org.postgresql.util.PSQLException: This connection has been closed.
    at org.postgresql.jdbc2.AbstractJdbc2Connection.checkClosed(AbstractJdbc2Connection.java:822)
    at org.postgresql.jdbc2.AbstractJdbc2Connection.commit(AbstractJdbc2Connection.java:810)
    at com.sun.gjc.spi.LocalTransactionImpl.commit(LocalTransactionImpl.java:106)
    ... 16 more
|#]

[#|2015-06-12T12:30:00.010-0700|WARNING|glassfish3.1.2|javax.enterprise.system.container.ejb.com.sun.ejb.containers|_ThreadID=536;_ThreadName=Thread-2;|EJB5184:A
system exception occurred during an invocation on EJB
ReliableMessageJobBean, method: public void
com.mirth.ngshare.ejb.ReliableMessageJobBean.queuePendingMessages()|#]

[#|2015-06-12T12:30:00.010-0700|WARNING|glassfish3.1.2|javax.enterprise.system.container.ejb.com.sun.ejb.containers|_ThreadID=536;_ThreadName=Thread-2;|javax.ejb.EJBException:
Unable to complete container-managed transaction.
    at com.sun.ejb.containers.BaseContainer.completeNewTx(BaseContainer.java:5147)
    at com.sun.ejb.containers.BaseContainer.postInvokeTx(BaseContainer.java:4901)
    at com.sun.ejb.containers.BaseContainer.postInvoke(BaseContainer.java:2045)
    at com.sun.ejb.containers.BaseContainer.postInvoke(BaseContainer.java:1994)
    at com.sun.ejb.containers.BaseContainer.callEJBTimeout(BaseContainer.java:4088)
    at com.sun.ejb.containers.EJBTimerService.deliverTimeout(EJBTimerService.java:1832)
    at com.sun.ejb.containers.EJBTimerService.access$100(EJBTimerService.java:108)
    at com.sun.ejb.containers.EJBTimerService$TaskExpiredWork.run(EJBTimerService.java:2646)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
    at java.util.concurrent.FutureTask.run(FutureTask.java:262)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    at java.lang.Thread.run(Thread.java:745)
Caused by: javax.transaction.SystemException
    at com.sun.enterprise.transaction.JavaEETransactionImpl.commit(JavaEETransactionImpl.java:518)
    at com.sun.enterprise.transaction.JavaEETransactionManagerSimplified.commit(JavaEETransactionManagerSimplified.java:855)
    at com.sun.ejb.containers.BaseContainer.completeNewTx(BaseContainer.java:5136)
    ... 12 more
Caused by: javax.transaction.xa.XAException:
javax.resource.spi.LocalTransactionException: This connection has been
closed.
    at com.sun.enterprise.resource.ConnectorXAResource.handleResourceException(ConnectorXAResource.java:115)
    at com.sun.enterprise.resource.ConnectorXAResource.commit(ConnectorXAResource.java:126)
    at com.sun.enterprise.transaction.JavaEETransactionImpl.commit(JavaEETransactionImpl.java:501)
    ... 14 more
|#]

So, that's all she wrote for Thread 536.

In the code, we have classic JDBC code, effectively:

try {
    con = ds.getConnection();
    ...
} finally {
        try {
            if (con != null) {
                con.close();
            }
        } catch (SQLException se) {
            //ignored
        }
}

The getConnection is failing, con should be null, so we shouldn't be
trying to close it.

But notably, we're shouldn't be leaking any exceptions that the
scheduler code should see.

All of the other JPA and JMS fun is within that try, but got skipped
when getConnection failed.

On a separate thread, immediately after, we do see the Postgres error
we were quite likely getting:

[#|2015-06-12T12:30:02.089-0700|INFO|glassfish3.1.2|javax.enterprise.resource.resourceadapter.com.sun.gjc.spi|_ThreadID=379;_ThreadName=Thread-2;|RAR5106
: AutoCommit based validation detected invalid connection. Set
resource-adapter log-level to FINE for exception stack trace|#]

[#|2015-06-12T12:30:02.093-0700|INFO|glassfish3.1.2|javax.enterprise.resource.resourceadapter.com.sun.gjc.spi|_ThreadID=379;_ThreadName=Thread-2;|RAR7115:
Unable to set ClientInfo for connection|#]

[#|2015-06-12T12:30:02.094-0700|INFO|glassfish3.1.2|org.eclipse.persistence.session.file:/opt/glassfish3/glassfish/domains/domain1/applications/SharePortal-ear-1.0-SNAPSHOT/lib/SharePortal-lib-1.0-SNAPSHOT.jar_portalPU|_ThreadID=379;_ThreadName=Thread-2;|Communication
failure detected when attempting to perform read query outside of a
transaction. Attempting to retry query. Error was: Exception
[EclipseLink-4002] (Eclipse Persistence Services -
2.3.2.v20111125-r10461):
org.eclipse.persistence.exceptions.DatabaseException
Internal Exception: org.postgresql.util.PSQLException: An I/O error
occured while sending to the backend.
Error Code: 0
Call: SELECT share_net_client_key, client_site_key_guid, is_connected,
is_new_client, last_connect_date, last_disconnect_date,
latest_remote_ip, latest_remote_port, share_net_ip, share_site_org_key
FROM share_net_client
Query: ReadAllQuery(referenceClass=ShareNetClient sql="SELECT
share_net_client_key, client_site_key_guid, is_connected,
is_new_client, last_connect_date, last_disconnect_date,
latest_remote_ip, latest_remote_port, share_net_ip, share_site_org_key
FROM share_net_client").|#]

[#|2015-06-12T12:30:02.094-0700|INFO|glassfish3.1.2|javax.enterprise.resource.resourceadapter.com.sun.gjc.spi|_ThreadID=379;_ThreadName=Thread-2;|RAR5106
: AutoCommit based validation detected invalid connection. Set
resource-adapter log-level to FINE for exception stack trace|#]

[#|2015-06-12T12:30:02.096-0700|INFO|glassfish3.1.2|javax.enterprise.resource.resourceadapter.com.sun.gjc.spi|_ThreadID=379;_ThreadName=Thread-2;|RAR7115:
Unable to set ClientInfo for connection|#]

[#|2015-06-12T12:30:02.097-0700|INFO|glassfish3.1.2|org.eclipse.persistence.session.file:/opt/glassfish3/glassfish/domains/domain1/applications/SharePortal-ear-1.0-SNAPSHOT/lib/SharePortal-lib-1.0-SNAPSHOT.jar_portalPU|_ThreadID=379;_ThreadName=Thread-2;|Communication
failure detected when attempting to perform read query outside of a
transaction. Attempting to retry query. Error was: Exception
[EclipseLink-4002] (Eclipse Persistence Services -
2.3.2.v20111125-r10461):
org.eclipse.persistence.exceptions.DatabaseException

That's a completely different thread, unrelated to this processing.
But it does show that Postgres "went away" for the time being.
(Obviously some kind of infrastructure problem I need to look in to).

5 seconds later, we see the job being retried.

[#|2015-06-12T12:30:05.014-0700|WARNING|glassfish3.1.2|com.mirth.ngshare.util.ConfigLock|_ThreadID=569;_ThreadName=Thread-2;|Unknown
exception trying to capture configuration lock for null|#]

and since Postgres hasn't recovered in 5 seconds, more fireworks.

Here we see the timer code giving up on me completely:

[#|2015-06-12T12:30:05.016-0700|INFO|glassfish3.1.2|javax.enterprise.system.container.ejb.com.sun.ejb.containers|_ThreadID=569;_ThreadName=Thread-2;|EJB5119:Expunging
timer ['16@@1434056166645@@server@@domain1' 'TimedObject =
ReliableMessageJobBean' 'Application = SharePortal-ear-1.0-SNAPSHOT'
'BEING_DELIVERED' 'PERIODIC' 'Container ID = 93982301290954762' 'Thu
Jun 11 17:11:00 PDT 2015' '0' '0,15,30,45 # * # * # * # * # * # * #
null # null # null # true # queuePendingMessages # 0' ] after [2]
failed deliveries|#][

I guess it's set for 2 retries, not 3.

>
> First thought that crossed my mind was that your job is, through a container
> managed method, interacting with the database and (since you mentioned it's
> not using an XA connector itself) some other enterprise resource connector
> with an XA connector setup (perhaps something like JMS) - the PSQLException
> thrown during commit from a closed connection, being caught by the container
> and when the commit gets to the XAResource fails commit because of the
> previous commit failure, rolling back the tx and throwing a
> LocalTransactionException. The EJB timer should try to execute this 3 times
> by default (IIRC) and if getting an exception all 3 times kill that job.

I was thinking similarly, but like I said, it didn't make sense to me
that EJB Timer would XA with the EJB code. And if it tried, wouldn't
it complain since Postgres isn't XA?

Since I never got the connection (it failed outright), would that have
even been enlisted into, well, any transaction?

Now, to be fair, an EntityManager, while not used in this specific
case, IS allocated and injected. So perhaps it's trying to clean that
up, I don't know how lazy the EM is with it's connections.

> The closed PG connection could, for instance, result from a long-running job
> with a low connection timeout: Your EJB injects the entity manager on
> PostConstruct, does a query...does some long-running operation on the
> results, attempts to do something else with the db/commit after the
> configured connection timeout, and gets no love. The JDBC pool will only
> check the connection (if configured to do so) prior to injection -- it
> doesn't do anything fancy like provide a proxied connection to handle
> connection failures that may occur between the time it is retrieved from the
> pool and the time it is used by that code...though there are of course some
> things you can do both with JDBC config as well as the postgres proprietary
> driver config...

Another bit of insight: The bean in a Singleton, so my EM is injected
only once. But connection management should orbit the transaction
lifecycle, not the bean lifecycle (I would think). But it could have
easily tried to fetch a connection on transaction start (which
apparenlty did NOT fail if it did), and then try to close it on the
way out, even though it was unused.

So, getting down to brass tacks.

Is there any way to make my @Schedule bean more robust? While having
Postgres vanish off the net is not a good thing, ideally we should be
able to absorb that with a lot of complaining transactions until the
connectivity is restored. Unfortunate, but we can weather those.
Losing the @Schedule timer, though, requires we restart. Which is kind
of icky.

We already have a monitoring job. I'd truly like to not have to fall
back on EJB Timer and manage all of this myself. At least in Java 6 we
can have timer jobs that are not persistent (what a pain that was to
manage before -- start the container, check if jobs exist, submit them
if they don't, yada yada....just awful).

Can I boost the retry count? lengthen the retry delay? (is the 5s I
saw between attempts just an internal delay, or was that something
else)? I can, for example, lookup the EntityManager via JNDI each time
rather than inject it if I need to. I always strive to keep my job
methods from leaking, well, anything -- I catch Throwable and log it,
for example.

Thank you again for the quick response.

Regards,

Will Hartung
(willh_at_mirthcorp.com)

-- 
This message, and any documents attached hereto, may contain confidential 
or proprietary information intended only for the use of the addressee(s) 
named above or may contain information that is legally  privileged. If you 
are not the intended addressee, or the person responsible for delivering it 
to the intended addressee, you are hereby notified that reading, 
disseminating, distributing or copying this message is strictly prohibited. 
If you have received this message by mistake, please immediately notify us 
by replying to the message and delete the original message and any copies 
immediately thereafter.  Thank you for your cooperation.