Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi All, I'M Running Into An Issue Where Clear Ml Tasks Being Executed By Services Workers On Self-Hosted Server Are Automatically Terminating. The Message Says "Process Terminated By User", Despite Us Not Aborting Tasks Through The Ui. E.G. (Following D

Hi all,

I'm running into an issue where Clear ML tasks being executed by services workers on self-hosted server are automatically terminating.

The message says "process terminated by user", despite us not aborting tasks through the UI. E.g. (following docker logs for clearml-agent-services ):

 Starting Task Execution:
 
 Process terminated by user
 clearml_agent: ERROR: [Errno 2] No such file or directory: '/tmp/.clearmlagent_1_5rih9irv.tmp'

The error almost seems random, sometimes tasks will run properly, run partially or self-termite almost instantly.

I've just upgraded the server to:

WebApp: 1.14.1-448 • Server: 1.14.1-448 • API: 2.28

Also tried:

allegroai/clearml-agent-services:latest (1.1.1)
allegroai/clearml-agent-services:services-1.3.0-77 (1.6.1)

 (1.7.0)

But still facing the same issue.

Has anybody experienced issues with this lately?

  
  
Posted 10 months ago
Votes Newest

Answers 11


Thanks @<1523701087100473344:profile|SuccessfulKoala55> - Yeah I found that allegroai/clearml-agent-services:latest was running clearml-agent==1.1.1 . Tried plugging various other images into docker-compose.yml & restarting to see if versions clearml-agent==1.6.1 or clearml-agent==1.7.0 would fix the issue but no luck unfortunately 😕

  
  
Posted 10 months ago

Hi @<1534706830800850944:profile|ZealousCoyote89> ! Do you have any info under STATUS REASON ? See the screenshot for an example:
image

  
  
Posted 10 months ago

Just user abort by the looks of things:
image

  
  
Posted 10 months ago

This should be the full log cleaned

  
  
Posted 10 months ago

Does this help at all? (I can go a lil further back, just scanning through for any potential sensitive info!)

  
  
Posted 10 months ago

Hi @<1534706830800850944:profile|ZealousCoyote89> , can you please add the full log?

  
  
Posted 10 months ago

To me it looks as if somebody were going in to the UI and hitting abort on the task but that's definitely not the case

  
  
Posted 10 months ago

Any time I run the agent locally via:

clearml-agent daemon --queue services --services-mode --cpu-only --docker --foreground

It works without fail so I've tried removing the clearml mount from agent-services in docker-compose.yml :

      CLEARML_WORKER_ID: "clearml-services"
      # CLEARML_AGENT_DOCKER_HOST_MOUNT: "/opt/clearml/agent:/root/.clearml"
      SHUTDOWN_IF_NO_ACCESS_KEY: 1
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
      # - /opt/clearml/agent:/root/.clearml

I know there's some downfalls to doing this but it seems to prevent the Process terminated by user issue I was seeing. Like I said, the issue appeared randomly so this could just be a coincidence.

Maybe some of the cached files could have been leading to the issue?

  
  
Posted 10 months ago

Hi @<1534706830800850944:profile|ZealousCoyote89> , I must admit I've not seen this behavior before occurring randomly, but I don't think the cache can be the result

  
  
Posted 10 months ago

Hi @<1523701070390366208:profile|CostlyOstrich36>

We've got quite a bit of sensitive info in the logs - I'll see what I can grab

  
  
Posted 10 months ago

Hi @<1534706830800850944:profile|ZealousCoyote89> , make sure you update the agent inside the services docker, as this image is probably running a very old version

  
  
Posted 10 months ago
1K Views
11 Answers
10 months ago
10 months ago
Tags