While writing a restart script for our IceGrid installation, I periodically got the following errors:Code:Ice: /opt/Ice-3.3.0 OS1: Mac 10.4 Darwin mac 8.11.1 Darwin Kernel Version 8.11.1: Wed Oct 10 18:23:28 PDT 2007; root:xnu-792.25.20~1/RELEASE_I386 i386 i386 OS2: Linux necromancer 2.6.23-gentoo-r8 #2 SMP Tue Feb 19 19:27:38 GMT 2008 x86_64 Intel(R) Xeon(R) CPU X5355 @ 2.66GHz GenuineIntel GNU/Linux
orCode:icegridnode: failure occurred in daemon: service caught unhandled Ice exception: MapI.cpp:1211: Freeze::DatabaseException: Db::put: Operation not permitted
The grid is a single node with a collocated registry and several servers. After calling "icegridadmin -e 'node shutdown master'", a call is made to "icegridnode --deploy app.xml" causing the failure.Code:icegridnode: failure occurred in daemon: service caught unhandled Ice exception: SharedDbEnv.cpp:546: Freeze::DatabaseException: DbEnv::open: DB_RUNRECOVERY: Fatal error, run database recovery
It wasn't straight-forward to come up with an exact reproducing test case, because of a rather large Java process which takes significant time to shutdown.
To simulate that, I had to add ThreadControl::sleep() calls and run "icegridnode --deploy" several times in the background, which we obviously are not doing in production. Even then, the following script will not always fail with Freeze errors, but tends to about 20% of the time.
Even if such concurrent usage cannot be supported through some form of locking, it would be helpful if someone could explain why in the above script multiple calls seems to "recover" -- i.e. calling test.sh after a failing call to test.sh may succeed -- while in my production use, once the database is corrupt, I'm forced to "rm -rf var" before icegridnode will succeed.Code:set -e set -x ## ## Cleanup from previous ## killall icegridnode && { sleep 3 killall -9 icegridnode || echo good } || echo None found ## rm -rf var # # Using the pid for the app doesn't work since # failed calls overwrite the pid before they are # successful. # # test -e var/test.pid && kill -9 `cat var/test.pid` || echo Already stopped # test -e var/app.pid && kill -9 `cat var/app.pid` || echo Already stopped ## ## Create test files ## mkdir -p var etc lib cat<<EOF > etc/app.xml <icegrid> <application name="test"> <node name="test"> <server id="test" exe="lib/test" activation="always"> <adapter name="adapter" register-process="true" endpoints="tcp"/> </server> </node> </application> </icegrid> EOF cat<<EOF > etc/app.cfg IceGrid.Registry.Client.Endpoints=tcp -h 127.0.0.1 -p 4061 IceGrid.Registry.Server.Endpoints=tcp -h 127.0.0.1 IceGrid.Registry.Internal.Endpoints=tcp -h 127.0.0.1 ## IceGrid.Registry.SessionManager.Endpoints=tcp -h 127.0.0.1 IceGrid.Registry.Data=var IceGrid.Registry.DynamicRegistration=0 IceGrid.Registry.PermissionsVerifier=IceGrid/NullPermissionsVerifier IceGrid.Registry.AdminPermissionsVerifier=IceGrid/NullPermissionsVerifier IceGrid.Node.CollocateRegistry=1 IceGrid.Node.Endpoints=tcp -h 127.0.0.1 IceGrid.Node.Name=test IceGrid.Node.Data=var Ice.StdOut=var/out Ice.StdErr=var/err EOF cat<<EOF > etc/internal.cfg Ice.Default.Locator=IceGrid/Locator:tcp -h 127.0.0.1 -p 4061 IceGridAdmin.Username=root IceGridAdmin.Password=password EOF cat<<EOF > lib/test.cpp #include <fstream> #include <iostream> #include <Ice/Ice.h> #include <IceUtil/Time.h> #include <IceUtil/Thread.h> using namespace std; int main(int argc, char* argv[]) { ofstream out; out.open("var/test.pid"); out << getpid(); out.close(); int status = 0; Ice::CommunicatorPtr ic; ic = Ice::initialize(argc, argv); Ice::ObjectAdapterPtr adapter = ic->createObjectAdapter("adapter"); ic->waitForShutdown(); IceUtil::ThreadControl self; self.sleep(IceUtil::Time::seconds(6)); ic->destroy(); cout << "finished" << endl; return 0; } EOF CXXFLAGS="-I/opt/Ice-3.3.0/include" LDFLAGS="-L/opt/Ice-3.3.0/lib -lIce -lIceUtil" make lib/test start(){ icegridnode --nochdir --daemon --pidfile var/app.pid --Ice.Config=etc/internal.cfg,etc/app.cfg --deploy etc/app.xml } test(){ echo -n "." RESULT=`start 2>&1 || true` echo $RESULT | grep Freeze } start icegridadmin --Ice.Config=etc/internal.cfg -e "node shutdown test" # Causes the same problem: # FIRST=`cat var/app.pid` # kill $FIRST& set +x test& test& test& test& test& test& test& test wait
Hope all of that can be of some use, but let me know if you need more information.
~Josh
P.S. As mentioned in the script, something I encountered while testing this was that the pidfile for icegridnode is overwritten even if the previous icegridnode is still active. I'm not sure if this is the intended behavior.

Reply With Quote