[Please read part 2 here.]
Our OC4J instance would restart itself every once in a while - no culprits found looking through unix' history. My next thought was to check the crontab but my user doesn't have access to that. Weighing bureaucracy (soliciting the unix admins to check it or give me access) versus pursuing other means of inquiry: of course I chose the latter!
From the opmnlog file:
13/02/25 10:48:15 [ons-connect] Passive connection 16438,xxx.xxx.xxx.xxx,6233 missing connect
13/02/25 11:45:28 [ons-connect] Passive connection 1667c, xxx.xxx.xxx.xxx,6233 missing connect
13/02/25 12:06:18 [ons-connect] Passive connection 16751, xxx.xxx.xxx.xxx,6233 missing connect
13/02/25 14:22:19 [libopmnoc4j] OC4J ping attempt timed out
13/02/25 14:22:19 [libopmnoc4j] Process Ping Failed: default_group~oc4j_esbas~default_group~1 (1786977551:2786)
13/02/25 14:23:09 [libopmnoc4j] OC4J ping attempt timed out
13/02/25 14:23:09 [libopmnoc4j] Process Ping Failed: default_group~oc4j_esbas~default_group~1 (1786977551:2786)
13/02/25 14:23:59 [libopmnoc4j] OC4J ping attempt timed out
13/02/25 14:23:59 [libopmnoc4j] Process Ping Failed: default_group~oc4j_esbas~default_group~1 (1786977551:2786)
13/02/25 14:23:59 [libopmnoc4j] Process Unreachable: default_group~oc4j_esbas~default_group~1 (1786977551:2786)
13/02/25 14:23:59 [pm-process] Restarting Process: default_group~oc4j_esbas~default_group~1 (1786977551:2786)
13/02/25 14:28:59 [pm-process] Stopping Process: default_group~oc4j_esbas~default_group~1 (1786977551:2786)
13/02/25 14:28:59 [libopmnoc4j] Forcefully Terminating Process: default_group~oc4j_esbas~default_group~1 (1786977551:2786)
13/02/25 14:29:01 [pm-process] Starting Process: default_group~oc4j_esbas~default_group~1 (1786977551:0)
13/02/25 14:30:03 [pm-process] Process Alive: default_group~oc4j_esbas~default_group~1 (1786977551:9531)
Classic Ping-Echo tactic to satisfy quality attributes related to availability in system architectures.
Anything could be causing those ping timeouts, but at least now I know I don't have to deal with the sysadmins. Neat.
In the instance's message log I see a few instances of the following occurring at around the same timeframe:
<MSG_TEXT>Failed to commit transaction</MSG_TEXT>
<SUPPL_DETAIL><![CDATA[oracle.tip.esb.server.common.exceptions.BusinessEventRetriableException: Failed to commit transaction; transaction status is "6"
Caused by: javax.transaction.RollbackException: Timed out
... 6 more
This message eventually turned out to be a dead end. My immediate hypothesis is that the server is under load and thus transactions are starting to hang and being rolled back, as well as being unable to respond to OPMN's ping requests. Threads can get stuck due to many different causes, such as waiting for some I/O to complete, or entering a loop without exit condition, or competing for shared resources with other threads, etc. To find out what exactly was happening, I first need to identify the components involved in this transaction. And this is where I stumbled on my first roadblock. How do I trace this message back to a component. It is obvious that the CORRELATION_DATAelement is somehow pointing me in some direction, but how do I interpret its UNIQUE_ID? I have yet to find a clear description of this element's contents. From Oracle:
Using ODL, diagnostic messages are written to log files using XML format and each message includes a HEADER element containing information about the message, optionally a CORRELATION_DATA element containing information to assist in correlating messages across components, and a PAYLOAD element containing the message text including optional arguments and associated values.
... and that's it. My hypothesis is that the UNIQUE_IDelement's text value is a string of values delimited by colons, and each value serves as a unique identifier for every instance of any components involved in the handling of this message or operation. So far I have not been able to prove this. I haven't conclusively linked any of those values to any BPEL or ESB instance ID. If anyone out there can point to a clear description of the CORRELATION_DATAelement and what those values above would mean, it would be very welcome!
Next thought: monitor GC performance. It is possible that the garbage collecting mechanism is occasionally interfering with the OC4J instance's ability to respond to OPMN's ping requests. If, say, a full garbage collection is taking place (which can potentially last up to tens of seconds) while OPMN's ping arrives, then yes, a timeout can definitely happen.
Which is why I have added these flags to the process' java-optionsin opmn.xml:
-XX:-PrintGCDetails -XX:-PrintGCTimeStamps -Xloggc:gc.log
Now it's just a matter of waiting for the next event...