Results 1 to 9 of 9

Thread: Sporadic Freeze errors on concurrent icegridnode access

  1. #1
    joshmoore is offline Registered User
    Name: Josh Moore
    Organization: Glencoe Software, Inc.
    Project: OMERO, http://trac.openmicroscopy.org.uk/omero
    Join Date
    Feb 2007
    Location
    Germany
    Posts
    115

    Sporadic Freeze errors on concurrent icegridnode access

    Code:
    Ice: /opt/Ice-3.3.0
    OS1: Mac 10.4 Darwin mac 8.11.1 Darwin Kernel Version 8.11.1: Wed Oct 10 18:23:28 PDT 2007; root:xnu-792.25.20~1/RELEASE_I386 i386 i386
    OS2: Linux necromancer 2.6.23-gentoo-r8 #2 SMP Tue Feb 19 19:27:38 GMT 2008 x86_64 Intel(R) Xeon(R) CPU X5355 @ 2.66GHz GenuineIntel GNU/Linux
    While writing a restart script for our IceGrid installation, I periodically got the following errors:
    Code:
    icegridnode: failure occurred in daemon:
    service caught unhandled Ice exception:
    MapI.cpp:1211: Freeze::DatabaseException:
    Db::put: Operation not permitted
    or
    Code:
    icegridnode: failure occurred in daemon:
    service caught unhandled Ice exception:
    SharedDbEnv.cpp:546: Freeze::DatabaseException:
    DbEnv::open: DB_RUNRECOVERY:
    Fatal error, run database recovery
    The grid is a single node with a collocated registry and several servers. After calling "icegridadmin -e 'node shutdown master'", a call is made to "icegridnode --deploy app.xml" causing the failure.

    It wasn't straight-forward to come up with an exact reproducing test case, because of a rather large Java process which takes significant time to shutdown.

    To simulate that, I had to add ThreadControl::sleep() calls and run "icegridnode --deploy" several times in the background, which we obviously are not doing in production. Even then, the following script will not always fail with Freeze errors, but tends to about 20% of the time.

    Code:
    set -e
    set -x
    
    ##
    ## Cleanup from previous
    ##
    killall icegridnode && {
      sleep 3
      killall -9 icegridnode || echo good
    } || echo None found
    
    ## rm -rf var
    
    #
    # Using the pid for the app doesn't work since
    # failed calls overwrite the pid before they are
    # successful.
    #
    # test -e var/test.pid && kill -9 `cat var/test.pid` || echo Already stopped
    # test -e var/app.pid && kill -9 `cat var/app.pid` || echo Already stopped
    
    ##
    ## Create test files
    ##
    mkdir -p var etc lib
    cat<<EOF > etc/app.xml
    <icegrid>
      <application name="test">
        <node name="test">
         <server id="test" exe="lib/test" activation="always">
            <adapter name="adapter" register-process="true" endpoints="tcp"/>
          </server>
        </node>
      </application>
    </icegrid>
    EOF
    cat<<EOF > etc/app.cfg
    IceGrid.Registry.Client.Endpoints=tcp -h 127.0.0.1 -p 4061
    IceGrid.Registry.Server.Endpoints=tcp -h 127.0.0.1
    IceGrid.Registry.Internal.Endpoints=tcp -h 127.0.0.1
    ## IceGrid.Registry.SessionManager.Endpoints=tcp -h 127.0.0.1
    IceGrid.Registry.Data=var
    IceGrid.Registry.DynamicRegistration=0
    IceGrid.Registry.PermissionsVerifier=IceGrid/NullPermissionsVerifier
    IceGrid.Registry.AdminPermissionsVerifier=IceGrid/NullPermissionsVerifier
    IceGrid.Node.CollocateRegistry=1
    IceGrid.Node.Endpoints=tcp -h 127.0.0.1
    IceGrid.Node.Name=test
    IceGrid.Node.Data=var
    Ice.StdOut=var/out
    Ice.StdErr=var/err
    EOF
    
    cat<<EOF > etc/internal.cfg
    Ice.Default.Locator=IceGrid/Locator:tcp -h 127.0.0.1 -p 4061
    IceGridAdmin.Username=root
    IceGridAdmin.Password=password
    EOF
    
    cat<<EOF > lib/test.cpp
    #include <fstream>
    #include <iostream>
    #include <Ice/Ice.h>
    #include <IceUtil/Time.h>
    #include <IceUtil/Thread.h>
    using namespace std;
    int main(int argc, char* argv[]) {
        ofstream out;
        out.open("var/test.pid");
        out << getpid();
        out.close();
        int status = 0;
        Ice::CommunicatorPtr ic;
        ic = Ice::initialize(argc, argv);
        Ice::ObjectAdapterPtr adapter = ic->createObjectAdapter("adapter");
        ic->waitForShutdown();
        IceUtil::ThreadControl self;
        self.sleep(IceUtil::Time::seconds(6));
        ic->destroy();
        cout << "finished" << endl;
        return 0;
    }
    EOF
    CXXFLAGS="-I/opt/Ice-3.3.0/include" LDFLAGS="-L/opt/Ice-3.3.0/lib -lIce -lIceUtil" make lib/test
    
    start(){
      icegridnode --nochdir --daemon --pidfile var/app.pid --Ice.Config=etc/internal.cfg,etc/app.cfg --deploy etc/app.xml
    }
    
    test(){
      echo -n "."
      RESULT=`start 2>&1 || true`
      echo $RESULT | grep Freeze
    }
    
    start
    icegridadmin --Ice.Config=etc/internal.cfg -e "node shutdown test"
    # Causes the same problem:
    # FIRST=`cat var/app.pid`
    # kill $FIRST&
    set +x
    test&
    test&
    test&
    test&
    test&
    test&
    test&
    test
    wait
    Even if such concurrent usage cannot be supported through some form of locking, it would be helpful if someone could explain why in the above script multiple calls seems to "recover" -- i.e. calling test.sh after a failing call to test.sh may succeed -- while in my production use, once the database is corrupt, I'm forced to "rm -rf var" before icegridnode will succeed.

    Hope all of that can be of some use, but let me know if you need more information.
    ~Josh

    P.S. As mentioned in the script, something I encountered while testing this was that the pidfile for icegridnode is overwritten even if the previous icegridnode is still active. I'm not sure if this is the intended behavior.

  2. #2
    joshmoore is offline Registered User
    Name: Josh Moore
    Organization: Glencoe Software, Inc.
    Project: OMERO, http://trac.openmicroscopy.org.uk/omero
    Join Date
    Feb 2007
    Location
    Germany
    Posts
    115
    I've just seen another way to reproduce this, but don't have it isolated yet:
    Code:
    josh@mac:~/code/omero.git/dist$ bin/omero admin stop 
    icegridadmin: could not contact the default locator:
    Network.cpp:1218: Ice::ConnectionRefusedException:
    connection refused: Connection refused
    Was the server already stopped?
    Waiting on shutdown. Use CTRL-C to exit
    icegridadmin: could not contact the default locator:
    Network.cpp:1218: Ice::ConnectionRefusedException:
    connection refused: Connection refused
    
    josh@mac:~/code/omero.git/dist$ pstree | grep node
     | | |   \--- 09742 josh grep node
    josh@mac:~/code/omero.git/dist$ bin/omero admin start
    No descriptor given. Using etc/grid/default.xml
    icegridnode: failure occurred in daemon:
    service caught unhandled Ice exception:
    SharedDbEnv.cpp:546: Freeze::DatabaseException:
    DbEnv::open: DB_RUNRECOVERY: Fatal error, run database recovery
    
    josh@mac:~/code/omero.git/dist$ bin/omero admin start
    No descriptor given. Using etc/grid/default.xml
    icegridnode: failure occurred in daemon:
    service caught unhandled Ice exception:
    SharedDbEnv.cpp:546: Freeze::DatabaseException:
    DbEnv::open: DB_RUNRECOVERY: Fatal error, run database recovery
    
    josh@mac:~/code/omero.git/dist$ bin/omero admin start
    No descriptor given. Using etc/grid/default.xml
    icegridnode: failure occurred in daemon:
    service caught unhandled Ice exception:
    SharedDbEnv.cpp:546: Freeze::DatabaseException:
    DbEnv::open: DB_RUNRECOVERY: Fatal error, run database recovery
    
    josh@mac:~/code/omero.git/dist$ bin/omero admin start
    No descriptor given. Using etc/grid/default.xml
    icegridnode: failure occurred in daemon:
    service caught unhandled Ice exception:
    SharedDbEnv.cpp:546: Freeze::DatabaseException:
    DbEnv::open: DB_RUNRECOVERY: Fatal error, run database recovery
    Stop calls "icegridadmin -e 'node shutdown master" followed by "icegridadmin -e 'node ping master'". Start calls "icegridnode --deploy" followed by another ping. Since there are no icegridnode processes after the stop, this would imply that concurrent access is not strictly required.

  3. #3
    benoit's Avatar
    benoit is offline ZeroC Staff
    Name: Benoit Foucher
    Organization: ZeroC, Inc.
    Project: Ice
    Join Date
    Feb 2003
    Location
    Rennes, France
    Posts
    2,196
    Hi Josh,

    You should make sure that the previous IceGrid registry/node isn't running anymore before starting a new one. Otherwise, as you discovered, this might corrupt the database environment. So your scripts should wait for the process to be gone before trying to restart a new IceGrid registry/node.

    While the IceGrid registry has a check to prevent this, it's not 100% foolproof: the check just verifies that no other registry is running by trying to connect to the client endpoint. If you start concurrently multiple registries which are using the same database directory, this check will likely fail. Similarly, this check doesn't work if the previous registry is being shutdown. I believe this explains why you sometime see the Freeze error and sometime you don't see it.

    As for the pidfile, it's the intended behavior, the purpose of --pidfile is just to write the PID of the process to the given file not to prevent multiple runs of the same service.

    Cheers,
    Benoit.

  4. #4
    joshmoore is offline Registered User
    Name: Josh Moore
    Organization: Glencoe Software, Inc.
    Project: OMERO, http://trac.openmicroscopy.org.uk/omero
    Join Date
    Feb 2007
    Location
    Germany
    Posts
    115
    Benoit,

    Understood. My attempts to prevent db corruption, however, were also not sufficient, and users are now seeing this in the wild. I assume my only recourse will be to implement file-based locking within our python scripts.

    If possible, please consider this a RFE to have the locks pushed down into the Ice executables. Failing that, it would be good to know where else (storm, freeze, etc.) I will need to include locking.

    Best wishes,
    ~Josh.

  5. #5
    joshmoore is offline Registered User
    Name: Josh Moore
    Organization: Glencoe Software, Inc.
    Project: OMERO, http://trac.openmicroscopy.org.uk/omero
    Join Date
    Feb 2007
    Location
    Germany
    Posts
    115
    Another update Benoit,

    the OMERO user with the initial problem has come back again looking for a work around for his corrupted registry dbs. See this thread. Of course, this is an extreme case involving power outages, but I'd certainly like to be able to keep our users from having to manually delete var/registry.

    Other than checking for a non-zero exit code and reading the log file, can you suggest any ways of working around this?

    Optimal would be some form of locking which prevents this from happening. Or a self-healing step, with warning perhaps. If that required a flag "--recreate", that'd be fine.** Another option might be a way to verify the registry db before calling icegridnode: --verify, or should I just try to open the db with a small application?

    Any advice would be very welcome.
    Cheers,
    ~Josh

    (** Speaking of which, we use collocated registries along with the "--deploy" flag if that's possibly related.)

  6. #6
    benoit's Avatar
    benoit is offline ZeroC Staff
    Name: Benoit Foucher
    Organization: ZeroC, Inc.
    Project: Ice
    Join Date
    Feb 2003
    Location
    Rennes, France
    Posts
    2,196
    Hi Josh,

    Fixing this is on our TODO list. In the meantime, the only option I can think of to prevent this from happening would be for your script to use a file lock. Your script could for example open and lock a file, if the file doesn't exist already, your script would start the IceGrid node and write the node PID into this file and then close and unlock the file. If the file already exists, the script would first check if the process with the PID from the file is still running or not, if not running, it would start again the IceGrid node but if it's still running it would print an error message.

    Cheers,
    Benoit.

  7. #7
    joshmoore is offline Registered User
    Name: Josh Moore
    Organization: Glencoe Software, Inc.
    Project: OMERO, http://trac.openmicroscopy.org.uk/omero
    Join Date
    Feb 2007
    Location
    Germany
    Posts
    115
    Hi Benoit,

    for the initial issue, of course you're right. To prevent user error via a restart mechanism, a lock would work. (For the moment, I've just prevented quick restarts).

    This latest issue, however, wouldn't be fixed by locking, I don't think, since the corruption is happening due to power loss. There may not be much the icegrid{node,registry} executables can due to recover, but I was hoping it was something your planned fix could take into account.

    Best wishes,
    ~Josh.

  8. #8
    benoit's Avatar
    benoit is offline ZeroC Staff
    Name: Benoit Foucher
    Organization: ZeroC, Inc.
    Project: Ice
    Join Date
    Feb 2003
    Location
    Rennes, France
    Posts
    2,196
    The IceGrid registry already automatically recovers the database environment on startup. This should be sufficient most of the time to recover from errors such as ungraceful shutdown of the IceGrid registry.

    If the registry still fails to start after such the "standard" recovery, this indicates a more serious issue (file system got corrupted, disk failure, etc) and fatal database recovery is needed. This might also occur if two registry are started simultaneously (this can be prevented using some file locking as mentioned above in my other posts). In any case, such a recovery will most likely require a backup of the registry database environment log and eventually database files.

    If you prefer to do the recovery manually before starting the icegridregistry you can use the db_recover utility from BerkeleyDB, for example:
    Code:
        $ db_recover -h db/registry
    You can perform catastrophic/fatal recovery by using the "-c" option with the db_recover utility or by starting the IceGrid registry with the --Freeze.DbEnv.Registry.DbRecoverFatal option. In any case, such a fatal recovery will most likely require some intervention on the part of the user to restore some of the database environment files from a backup.

    For more information on Berkeley DB recovery procedures see this link. For information on how to backup and restore a Freeze database environment, check out this section in the Ice manual.

    Cheers,
    Benoit.

  9. #9
    joshmoore is offline Registered User
    Name: Josh Moore
    Organization: Glencoe Software, Inc.
    Project: OMERO, http://trac.openmicroscopy.org.uk/omero
    Join Date
    Feb 2007
    Location
    Germany
    Posts
    115
    Thanks, Benoit!

    That gives me several things to try.

    Would there be any negative results of using "Freeze.DbEnv.Registry.DbRecoverFatal" along with "icegridnode --deploy" in a collocated-registry/node?

    ~Josh.

Thread Information

Users Browsing this Thread

There are currently 1 users browsing this thread. (0 members and 1 guests)

Similar Threads

  1. concurrent calls to IceGrid::Admin?
    By n2503v in forum Help Center
    Replies: 3
    Last Post: 05-15-2008, 04:30 AM
  2. throughput when calling concurrent calls.
    By aozarov in forum Help Center
    Replies: 13
    Last Post: 03-25-2008, 08:30 AM
  3. concurrency access with Freeze Map
    By fanson in forum Help Center
    Replies: 14
    Last Post: 11-09-2007, 05:29 AM
  4. Replies: 1
    Last Post: 08-18-2006, 08:44 AM
  5. Why concurrent access to FreezeMap is so slow?
    By kingbo in forum Help Center
    Replies: 4
    Last Post: 05-27-2006, 12:06 AM

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •