Hi All, I'M Running Into An Issue Where Clear Ml Tasks Being Executed By Services Workers On Self-Hosted Server Are Automatically Terminating. The Message Says "Process Terminated By User", Despite Us Not Aborting Tasks Through The Ui. E.G. (Following D

Answered

Hi all,

I'm running into an issue where Clear ML tasks being executed by services workers on self-hosted server are automatically terminating.

The message says "process terminated by user", despite us not aborting tasks through the UI. E.g. (following docker logs for clearml-agent-services ):

 Starting Task Execution:
 
 Process terminated by user
 clearml_agent: ERROR: [Errno 2] No such file or directory: '/tmp/.clearmlagent_1_5rih9irv.tmp'

The error almost seems random, sometimes tasks will run properly, run partially or self-termite almost instantly.

I've just upgraded the server to:

WebApp: 1.14.1-448 • Server: 1.14.1-448 • API: 2.28

Also tried:

allegroai/clearml-agent-services:latest (1.1.1)
allegroai/clearml-agent-services:services-1.3.0-77 (1.6.1)

 (1.7.0)

But still facing the same issue.

Has anybody experienced issues with this lately?

  				
Posted 
	one year ago

					More  		
  Report
		
					ZealousCoyote89
				
					0
					 × 1

Votes Newest

Answers 11

Just user abort by the looks of things:

  				
Posted 
	one year ago

					More  		
  Report
		
					ZealousCoyote89
				
					0
					 × 1

To me it looks as if somebody were going in to the UI and hitting abort on the task but that's definitely not the case

  				
Posted 
	one year ago

					More  		
  Report
		
					ZealousCoyote89
				
					0
					 × 1

Hi CostlyOstrich36

We've got quite a bit of sensitive info in the logs - I'll see what I can grab

  				
Posted 
	one year ago

					More  		
  Report
		
					ZealousCoyote89
				
					0
					 × 1

Does this help at all? (I can go a lil further back, just scanning through for any potential sensitive info!)

  				
Posted 
	one year ago

					More  		
  Report
		
					ZealousCoyote89
				
					0
					 × 1

Hi ZealousCoyote89 , can you please add the full log?

  				
Posted 
	one year ago

					More  		
  Report
		
					CostlyOstrich36
				
					0

Thanks SuccessfulKoala55 - Yeah I found that allegroai/clearml-agent-services:latest was running clearml-agent==1.1.1 . Tried plugging various other images into docker-compose.yml & restarting to see if versions clearml-agent==1.6.1 or clearml-agent==1.7.0 would fix the issue but no luck unfortunately 😕

  				
Posted 
	one year ago

					More  		
  Report
		
					ZealousCoyote89
				
					0
					 × 1

Hi ZealousCoyote89 ! Do you have any info under STATUS REASON ? See the screenshot for an example:

  				
Posted 
	one year ago

					More  		
  Report
		
					SmugDolphin23
				
					0

Hi ZealousCoyote89 , make sure you update the agent inside the services docker, as this image is probably running a very old version

  				
Posted 
	one year ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

This should be the full log cleaned

  				
Posted 
	one year ago

					More  		
  Report
		
					ZealousCoyote89
				
					0
					 × 1

Any time I run the agent locally via:

clearml-agent daemon --queue services --services-mode --cpu-only --docker --foreground

It works without fail so I've tried removing the clearml mount from agent-services in docker-compose.yml :

      CLEARML_WORKER_ID: "clearml-services"
      # CLEARML_AGENT_DOCKER_HOST_MOUNT: "/opt/clearml/agent:/root/.clearml"
      SHUTDOWN_IF_NO_ACCESS_KEY: 1
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
      # - /opt/clearml/agent:/root/.clearml

I know there's some downfalls to doing this but it seems to prevent the Process terminated by user issue I was seeing. Like I said, the issue appeared randomly so this could just be a coincidence.

Maybe some of the cached files could have been leading to the issue?

  				
Posted 
	one year ago

					More  		
  Report
		
					ZealousCoyote89
				
					0
					 × 1

Hi ZealousCoyote89 , I must admit I've not seen this behavior before occurring randomly, but I don't think the cache can be the result

  				
Posted 
	one year ago

					More  		
  Report
		
					SuccessfulKoala55
				
					0
					 × 1

Write your answer

1K Views

11 Answers

one year ago