Troubleshooting

Mesos

 

At scheduler startup, if specified in config (mesos/reconcile), framework will try to reconcile GoDocker running jobs with Mesos tasks to sync task status.

 

In case of Mesos framework deletion in Mesos master, it is possible to force scheduler to use a new framework Id.

In redis:

del god:mesos:frameworkId

 

If a job is terminated in Mesos but still shown as running in GoDocker (status update sync issue), one can force the update of the job in GoDocker

In redis:

set god:mesos:over:TASK_ID 7  # TASK_ID = identifier of the job, 7: mark as failed, 2: mark as OK

 

If Mesos/GoDocker are completly out of sync with any reason, and there are too many tasks to handle the above trick.

Switch GoDocker to maintenance and wait for any running job to complete. Once, on mesos side, all jobs are over, if you still have some jobs running in GoDocker, you can delete those jobs.

Stop scheduler and watchers processes.

In mongodb:

 

db.jobs.remove({'status.primary': 'running'}) # Will delete all jobs in running status

Then in redis:

del god:jobs:running  # Clear the running jobs queue used by watchers/executors

 

If a job fails to be killed by mesos executor, go to the mesos slave and stop the Docker container

docker stop XYZ