|
|
|
|||||
|
IceGrid::ActivationTimedOut
Hi,
I'm having some problems with IceGrid. I have written a program to ask IceGrid to start and stop servers. All servers are configured with activation=manual in the node. Occasionally I see a server activation time out. At the moment I'm just testing with super-simple dummy servers, so I'm pretty certain there's no problem in the server itself. The servers always start instantly when I run them manually. The symptom is that when my program calls startServer(), it hangs for 15sec (=IceGrid.Node.WaitTime), then throws a ServerStartException with reason="The server activation timed out". The state of the server (as reported by getServerState()) then shifts to ActivationTimedOut. Subsequent calls to startServer() return immediately with the same exception. When I look at the output of the server, it seems that it did start normally. So my questions are: 1) What are the criteria for successful server activation? Ie what is IceGrid looking for that it's not seeing? The documentation for IceGrid seems a little thin on this kind of stuff. 2) What can my program do to recover from this? Ie what can I do when the server state is ActivationTimedOut? Thanks very much, Alex
__________________
Alex Brooks Australian Centre for Field Robotics University of Sydney www.cas.edu.au project: orca-robotics.sf.net |
|
||||||
|
Hi,
IceGrid waits for all the server object adapters with the attribute "server-lifetime" set to true (the default) to be activated to change the state of the server to "Active". If these object adapters are not activated within the timeout period, IceGrid considers that the server activation timed out. Does your server have registered adapters which are not activated or only activated after some time? If that's the case, you should set the server-lifetime attribute to "false" in the deployment descriptor if you don't want IceGrid to wait for the activation of these adapters. See here for more information on this attribute. Cheers, Benoit. |
|
|||||
|
Hi Benoit,
OK, thanks. This explains what IceGrid is looking for, but I don't understand why it's seeing a problem. My servers have exactly one adapter, which they activate almost immediately. I have trace statements either side of the call to 'activate()', so I'm sure that this is being executed without an exception being thrown. I don't see anything interesting in the node's output, with IceGrid.Node.Trace.Activator=2. Do you have any ideas what else I might look for? Alex P.S. Also, I have one request for IceGrid Exceptions: Would it make sense to output the 'reason' field in the streaming operator? For several exception types I just want to print a meaningful error message, then have the code take the same action. For eg IceGrid::ServerStartException, the streaming operator simply outputs "ServerStartException". However, more information is available in the 'reason' field of ServerStartException. The same is true for other exception types, eg. NodeUnreachableException. To get a meaningful error message I have to catch each exception type explicitly and print the 'reason' field manually.]
__________________
Alex Brooks Australian Centre for Field Robotics University of Sydney www.cas.edu.au project: orca-robotics.sf.net |
|
|||||
|
Sorry for the confusion... I must have gotten confused with all the server starting and stopping and been looking at the wrong startServer instance (I have to run things for about half an hour on average to reproduce this fault): the server does _not_ activate successfully.
The server hangs on the activate() call. Then IceGrid does the correct thing and terminates the failing-to-activate server. So the question is why the server hangs on activate(). I have two nodes: xray and yankee. The xray node is colocated with the registry. Everything is running locally, in order can rule out the network as a problem. I have seen the problem on both xray and yankee. I set the trace levels you suggested, and see (long paths removed): Code:
[ 07/02/07 23:58:12.406 Server: changed server `xray.hibrick' state to `Activating' ] [ 07/02/07 23:58:12.406 Activator: activating server `xray.hibrick' path = xxxxx/TestSystem/orca2/bin/brick pwd = xxxxx/test uid/gid = 1000/1000 args = xxxxx/TestSystem/orca2/bin/brick --Ice.Config=xxxxx/nodedb/servers/xray.hibrick/config/config ] [ 07/02/07 23:58:12.441 Server: changed server `xray.hibrick' state to `WaitForActivation' ] orca: Ice version: 3.2.0 libOrcaIce version: 2.3.0+ orca: Brick: Loaded command line properties orca: Brick: Loaded component properties from 'xxxxx/test/nodedb/servers/xray.hibrick/config/config' orca: Brick: Loaded global properties from '/home/users/a.brooks/.orcarc' orca: Brick: Loaded factory default properties [ 07/02/07 23:58:12.459 Locator: registered server `yankee.lowbrick' process proxy: `8E2EC594-2668-4420-9E1E-8BFC36A93AC2 -t:tcp -h 172.17.7.39 -p 45210' ] [ 07/02/07 23:58:27.414 Server: changed server `xray.hibrick' state to `ActivationTimeout' ] [ 07/02/07 23:58:27.414 Server: server `xray.hibrick' activation timed out ] [ 07/02/07 23:58:27.414 Adapter: server `xray.hibrick' adapter `xray/hibrick' activation failed: server activation timed out ] ... [ 07/02/07 23:58:29.481 Server: changed server `xray.hibrick' state to `Deactivating' ] [ 07/02/07 23:58:29.481 Server: changed server `xray.hibrick' state to `DeactivatingWaitForProcess' ] [ 07/02/07 23:58:44.484 Activator: sent SIGKILL to server `xray.hibrick' (pid = 15050) ] [ 07/02/07 23:58:44.486 Activator: detected termination of server `xray.hibrick' signal = SIGKILL ] [ 07/02/07 23:58:44.486 Adapter: server `xray.hibrick' adapter `xray/hibrick' deactivated ] [ 07/02/07 23:58:44.487 Server: changed server `xray.hibrick' state to `Inactive' ] I think I can rule out configuration issues like name mismatches since it works 99% of the time. For comparison, here's what I see for a normal server activation: Code:
[ 07/02/07 23:58:10.233 Server: changed server `xray.midbrick' state to `Activating' ] [ 07/02/07 23:58:10.233 Activator: activating server `xray.midbrick' path = xxxxx/test/nodedb/distrib/TestSystem/orca2/bin/brick pwd = xxxxx/test uid/gid = 1000/1000 args = xxxxx/test/nodedb/distrib/TestSystem/orca2/bin/brick --Ice.Config=xxxxx/test/nodedb/servers/xray.midbrick/config/config ] orca: Ice version: 3.2.0 libOrcaIce version: 2.3.0+ orca: Brick: Loaded command line properties orca: Brick: Loaded component properties from 'xxxxx/test/nodedb/servers/xray.midbrick/config/config' orca: Brick: Loaded global properties from '/home/users/a.brooks/.orcarc' orca: Brick: Loaded factory default properties [ 07/02/07 23:58:10.279 Server: changed server `xray.midbrick' state to `WaitForActivation' ] [ 07/02/07 23:58:10.294 Server: changed server `xray.midbrick' state to `Active' ] [ 07/02/07 23:58:10.294 Adapter: server `xray.midbrick' adapter `xray/midbrick' activated: dummy -t:tcp -h 172.17.7.39 -p 37855 ] [ 07/02/07 23:58:10.300 Locator: registered adapter `xray/midbrick' endpoints: `dummy -t:tcp -h 172.17.7.39 -p 37855' ] [ 07/02/07 23:58:10.303 Locator: registered server `xray.midbrick' process proxy: `99B8A2DD-D9CE-4FA8-A91E-D10142943AA8 -t:tcp -h 172.17.7.39 -p 37855' ] Thanks, Alex
__________________
Alex Brooks Australian Centre for Field Robotics University of Sydney www.cas.edu.au project: orca-robotics.sf.net |
|
||||||
|
From your traces:
Code:
[ 07/02/07 23:58:12.441 Server: changed server `xray.hibrick' state to `WaitForActivation' ] ... [ 07/02/07 23:58:12.459 Locator: registered server `yankee.lowbrick' process proxy: `8E2EC594-2668-4420-9E1E-8BFC36A93AC2 -t:tcp -h 172.17.7.39 -p 45210' ] I'm afraid, I don't see why the object adapter activation would hang. The only possible cause would be a network issue. For example, the server could try to connect to an unreachable address to register the object adapter endpoints with the locator registry (setting Ice.Trace.Network=2 in the server configuration file will show if that's the case). If you're sure there's no network issues, the best way to look into this would be to reproduce the hang and:
Or, to make it simple, just get the strack traces of the server, registry and node when you get the hang I'll be happy to look at these traces to see if I can figure out what's wrong.Also, could you remind us which platform/OS you're using? And do you use the IceGrid registry replication? Thanks, Cheers, Benoit. |
|
|||||
|
After an uphill battle including working around a gdb bug, I've characterised this much better:
I don't think it's an IceGrid-related problem. I can reproduce the problem by starting my server from a set of scripts which just 'manually' start and stop the server repeatedly (ie not through a node). It's just so infrequent that I'd never noticed it until I started stress-testing my IceGrid-based server starter/stopper. The problem seems to occur when my server tries to simultaneously call activate() and IceGrid::RegistryPrx::checkedCast(), from two different threads. Some kind of deadlock occurs occasionally, which causes the program to hang forever (note: I have small values set for Ice.Override.Timeout and Ice.Override.ConnectTimeout). I've tried to reproduce the problem by modifying one of the Ice demo programs, but no luck yet. I'm not sure if this is because there's something fundamentally different about my server or just because the timings are particular. I'll keep walking my program towards the demo code and vice-versa, but at the same time I wonder if you have any ideas about what might be going on? I think it should be OK to make these two calls simultaneously, shouldn't it? The stack-trace of my program is attached below. And to answer your questions: - I'm running Linux (debian etch) on i386 hardware. - I'm not using any registry replication. Thanks again, Alex Code:
(gdb) thread 1
[Switching to thread 1 (Thread -1492038784 (LWP 4204))]#0 0xffffe410 in __kernel_vsyscall ()
(gdb) bt
#0 0xffffe410 in __kernel_vsyscall ()
#1 0xa711f5d6 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/i686/cmov/libpthread.so.0
#2 0xa721e14d in pthread_cond_wait () from /lib/i686/cmov/libc.so.6
#3 0xa7e48053 in IceUtil::Cond::waitImpl<IceUtil::RecMutex> (this=0x806c104, mutex=@0x806c134)
at ../../include/IceUtil/Cond.h:203
#4 0xa7e48123 in IceUtil::Monitor<IceUtil::RecMutex>::wait (this=0x806c104)
at ../../include/IceUtil/Monitor.h:152
#5 0xa7e457ff in IceInternal::ObjectAdapterFactory::waitForShutdown (this=0x806c0f8)
at ObjectAdapterFactory.cpp:67
#6 0xa7dbb368 in Ice::CommunicatorI::waitForShutdown (this=0x806a550) at CommunicatorI.cpp:119
#7 0xa7725e6b in orcaice::Application::run (this=0xaf971a24, argc=2, argv=0xaf971ad4)
at /home/users/a.brooks/orca2/src/libs/orcaice/application.cpp:208
#8 0xa773a2db in orcaice::IceApplication::main (this=0xaf971a24, argc=2, argv=0xaf971ad4,
initData=@0xaf9718d4) at /home/users/a.brooks/orca2/src/libs/orcaice/iceapplication.cpp:369
#9 0xa77270a2 in orcaice::Application::main (this=0xaf971a24, argc=2, argv=0xaf971ad4)
at /home/users/a.brooks/orca2/src/libs/orcaice/application.cpp:149
#10 0x0804c2ba in main (argc=2, argv=0xaf971ad4)
at /home/users/a.brooks/orca2/src/tests/brick/main.cpp:23
(gdb) thread 2
[Switching to thread 2 (Thread -1525617776 (LWP 4209))]#0 0xffffe410 in __kernel_vsyscall ()
(gdb) bt
#0 0xffffe410 in __kernel_vsyscall ()
#1 0xa7121dbe in __lll_mutex_lock_wait () from /lib/i686/cmov/libpthread.so.0
#2 0xa711dad4 in _L_mutex_lock_414 () from /lib/i686/cmov/libpthread.so.0
#3 0xa711d632 in pthread_mutex_lock () from /lib/i686/cmov/libpthread.so.0
#4 0xa721e2b6 in pthread_mutex_lock () from /lib/i686/cmov/libc.so.6
#5 0xa7c6c192 in IceUtil::RecMutex::lock (this=0x806c134) at RecMutex.cpp:243
#6 0xa7e46f94 in IceUtil::Monitor<IceUtil::RecMutex>::lock (this=0x806c104)
at ../../include/IceUtil/Monitor.h:89
#7 0xa7e46fdd in LockT (this=0xa510dc00, mutex=@0x806c104) at ../../include/IceUtil/Lock.h:49
#8 0xa7e44bc6 in IceInternal::ObjectAdapterFactory::findObjectAdapter (this=0x806c0f8,
proxy=@0xa510dc64) at ObjectAdapterFactory.cpp:172
#9 0xa7e7a2e0 in IceProxy::Ice::Object::__getDelegate (this=0x806c244) at Proxy.cpp:1052
#10 0xa7e354c8 in IceProxy::Ice::Locator::getRegistry (this=0x806c240, __ctx=0x0)
at Locator.cpp:1100
#11 0xa7e2c325 in IceProxy::Ice::Locator::getRegistry (this=0x806c240)
at ../../include/Ice/Locator.h:466
#12 0xa7e240d7 in IceInternal::LocatorInfo::getLocatorRegistry (this=0x806cd58)
at LocatorInfo.cpp:296
#13 0xa7e49373 in Ice::ObjectAdapterI::updateLocatorRegistry (this=0x806c568,
locatorInfo=@0xa510dfbc, proxy=@0xa510dfcc, registerProcess=false) at ObjectAdapterI.cpp:1113
#14 0xa7e4d86d in Ice::ObjectAdapterI::activate (this=0x806c568) at ObjectAdapterI.cpp:113
#15 0xa772ac34 in orcaice::Component::activate (this=0xaf9719dc)
at /home/users/a.brooks/orca2/src/libs/orcaice/component.cpp:209
#16 0xa7735a70 in orcaice::Context::activate (this=0x806d010)
at /home/users/a.brooks/orca2/src/libs/orcaice/context.cpp:42
#17 0x0804e237 in brick::Handler::run (this=0x806cfd0)
at /home/users/a.brooks/orca2/src/tests/brick/handler.cpp:47
#18 0xa7c6e203 in startHook (arg=0x806cfd0) at Thread.cpp:357
#19 0xa711b31b in start_thread () from /lib/i686/cmov/libpthread.so.0
#20 0xa72118ee in clone () from /lib/i686/cmov/libc.so.6
(gdb) thread 3
[Switching to thread 3 (Thread -1517225072 (LWP 4208))]#0 0xffffe410 in __kernel_vsyscall ()
(gdb) bt
#0 0xffffe410 in __kernel_vsyscall ()
#1 0xa7121dbe in __lll_mutex_lock_wait () from /lib/i686/cmov/libpthread.so.0
#2 0xa711dabc in _L_mutex_lock_88 () from /lib/i686/cmov/libpthread.so.0
#3 0xa711d4ad in pthread_mutex_lock () from /lib/i686/cmov/libpthread.so.0
#4 0xa721e2b6 in pthread_mutex_lock () from /lib/i686/cmov/libc.so.6
#5 0xa7dc96d4 in IceUtil::Mutex::lock (this=0x806c250) at ../../include/IceUtil/Mutex.h:305
#6 0xa7dde5fd in LockT (this=0xa590ea8c, mutex=@0x806c250) at ../../include/IceUtil/Lock.h:49
#7 0xa7e7a21a in IceProxy::Ice::Object::__getDelegate (this=0x806c244) at Proxy.cpp:1042
#8 0xa7e35934 in IceProxy::Ice::Locator::findObjectById (this=0x806c240, id=@0xa4800504, __ctx=0x0)
at Locator.cpp:1050
#9 0xa7e2b1d2 in IceProxy::Ice::Locator::findObjectById (this=0x806c240, id=@0xa4800504)
at ../../include/Ice/Locator.h:436
#10 0xa7e26447 in IceInternal::LocatorInfo::getEndpoints (this=0x806cd58, ref=@0xa590ee6c, ttl=-1,
cached=@0xa590ee5f) at LocatorInfo.cpp:351
#11 0xa7e4dec0 in Ice::ObjectAdapterI::isLocal (this=0x806c568, proxy=@0xa590ef74)
at ObjectAdapterI.cpp:596
#12 0xa7e44c39 in IceInternal::ObjectAdapterFactory::findObjectAdapter (this=0x806c0f8,
proxy=@0xa590ef74) at ObjectAdapterFactory.cpp:183
#13 0xa7e7a2e0 in IceProxy::Ice::Object::__getDelegate (this=0xa4800548) at Proxy.cpp:1052
#14 0xa7e7b63f in IceProxy::Ice::Object::ice_isA (this=0xa4800548, typeId=@0xa7685768, context=0x0)
at Proxy.cpp:139
#15 0xa7e7f6f8 in IceProxy::Ice::Object::ice_isA (this=0xa4800548, typeId=@0xa7685768)
at ../../include/Ice/Proxy.h:96
#16 0xa7746192 in IceInternal::checkedCastImpl<IceInternal::ProxyHandle<IceProxy::IceGrid::Registry> > (b=@0xa590f1e4, context=0x0) at /opt/Ice-3.2/include/Ice/Proxy.h:436
#17 0xa7746298 in IceInternal::checkedCastHelper<IceProxy::IceGrid::Registry, IceProxy::Ice::Object>
(b=@0xa590f1e4, ctx=0x0) at /opt/Ice-3.2/include/Ice/ProxyHandle.h:91
#18 0xa77462db in IceInternal::ProxyHandle<IceProxy::IceGrid::Registry>::checkedCast<IceProxy::Ice::Object> (r=@0xa590f1e4) at /opt/Ice-3.2/include/Ice/ProxyHandle.h:244
#19 0xa7743838 in orcaice::ComponentThread::tryRegisterHome (this=0x806d730)
at /home/users/a.brooks/orca2/src/libs/orcaice/detail/componentthread.cpp:84
#20 0xa77442c5 in orcaice::ComponentThread::run (this=0x806d730)
at /home/users/a.brooks/orca2/src/libs/orcaice/detail/componentthread.cpp:51
#21 0xa7c6e203 in startHook (arg=0x806d730) at Thread.cpp:357
#22 0xa711b31b in start_thread () from /lib/i686/cmov/libpthread.so.0
#23 0xa72118ee in clone () from /lib/i686/cmov/libc.so.6
(gdb) thread 4
[Switching to thread 4 (Thread -1508832368 (LWP 4207))]#0 0xffffe410 in __kernel_vsyscall ()
(gdb) bt
#0 0xffffe410 in __kernel_vsyscall ()
#1 0xa7211f56 in epoll_wait () from /lib/i686/cmov/libc.so.6
#2 0xa7eb9447 in IceInternal::ThreadPool::run (this=0x806ca30) at ThreadPool.cpp:462
#3 0xa7ebb058 in IceInternal::ThreadPool::EventHandlerThread::run (this=0x806c7a8)
at ThreadPool.cpp:1066
#4 0xa7c6e203 in startHook (arg=0x806c7a8) at Thread.cpp:357
#5 0xa711b31b in start_thread () from /lib/i686/cmov/libpthread.so.0
#6 0xa72118ee in clone () from /lib/i686/cmov/libc.so.6
(gdb) thread 5
[Switching to thread 5 (Thread -1500439664 (LWP 4206))]#0 0xffffe410 in __kernel_vsyscall ()
(gdb) bt
#0 0xffffe410 in __kernel_vsyscall ()
#1 0xa711f85c in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/i686/cmov/libpthread.so.0
#2 0xa721e194 in pthread_cond_timedwait () from /lib/i686/cmov/libc.so.6
#3 0xa7de09b5 in IceUtil::Cond::timedWaitImpl<IceUtil::Mutex> (this=0x806c44c, mutex=@0x806c47c,
timeout=@0x806c49c) at ../../include/IceUtil/Cond.h:224
#4 0xa7de0aaa in IceUtil::Monitor<IceUtil::Mutex>::timedWait (this=0x806c44c, timeout=@0x806c49c)
at ../../include/IceUtil/Monitor.h:180
#5 0xa7de0bfc in IceInternal::ConnectionMonitor::run (this=0x806c428) at ConnectionMonitor.cpp:78
#6 0xa7c6e203 in startHook (arg=0x806c428) at Thread.cpp:357
#7 0xa711b31b in start_thread () from /lib/i686/cmov/libpthread.so.0
#8 0xa72118ee in clone () from /lib/i686/cmov/libc.so.6
(gdb) thread 6
[Switching to thread 6 (Thread -1492046960 (LWP 4205))]#0 0xffffe410 in __kernel_vsyscall ()
(gdb) bt
#0 0xffffe410 in __kernel_vsyscall ()
#1 0xa716b35e in ?? () from /lib/i686/cmov/libc.so.6
#2 0xa716b3ff in sigwait () from /lib/i686/cmov/libc.so.6
#3 0xa7c5276f in sigwaitThread () at CtrlCHandler.cpp:124
#4 0xa711b31b in start_thread () from /lib/i686/cmov/libpthread.so.0
#5 0xa72118ee in clone () from /lib/i686/cmov/libc.so.6
(gdb) thread 7
Thread ID 7 not known.
(gdb)
__________________
Alex Brooks Australian Centre for Field Robotics University of Sydney www.cas.edu.au project: orca-robotics.sf.net |
|
||||||
|
Hi Alex,
I'm afraid you found a deadlock in the Ice adapter activation code . We'll look into fixing this.You should be able to work-around the problem by explicitly disabling collocation optimization for the locator proxy. If you use Ice 3.2.0, you can disable collocation optimization for the locator proxy by setting the following property: You can also disable it with the following code (you should do this before creating the object adapter): Code:
communicator->setDefaultLocator(Ice::LocatorPrx::uncheckedCast(communicator->getDefaultLocator()->ice_collocationOptimized(false))); Benoit. |
|
|||||
|
Hi Benoit,
Great, I'm relieved that I can stop tracking this thing down! I used the second of your two suggestions, and my test system has been running for a few hours now without any failures. Thanks for all the help, Alex
__________________
Alex Brooks Australian Centre for Field Robotics University of Sydney www.cas.edu.au project: orca-robotics.sf.net |
![]() |
| Currently Active Users Viewing This Thread: 1 (0 members and 1 guests) | |
| Thread Tools | |
| Display Modes | Rate This Thread |
|
|