Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Answered
Hi All, I'M Running Into An Issue Where Clear Ml Tasks Being Executed By Services Workers On Self-Hosted Server Are Automatically Terminating. The Message Says "Process Terminated By User", Despite Us Not Aborting Tasks Through The Ui. E.G. (Following D

Hi all,

I'm running into an issue where Clear ML tasks being executed by services workers on self-hosted server are automatically terminating.

The message says "process terminated by user", despite us not aborting tasks through the UI. E.g. (following docker logs for clearml-agent-services ):

 Starting Task Execution:
 
 Process terminated by user
 clearml_agent: ERROR: [Errno 2] No such file or directory: '/tmp/.clearmlagent_1_5rih9irv.tmp'

The error almost seems random, sometimes tasks will run properly, run partially or self-termite almost instantly.

I've just upgraded the server to:

WebApp: 1.14.1-448 • Server: 1.14.1-448 • API: 2.28

Also tried:

allegroai/clearml-agent-services:latest (1.1.1)
allegroai/clearml-agent-services:services-1.3.0-77 (1.6.1)

 (1.7.0)

But still facing the same issue.

Has anybody experienced issues with this lately?

  
  
Posted 9 months ago
Votes Newest

Answers 11


Hi @<1534706830800850944:profile|ZealousCoyote89> , I must admit I've not seen this behavior before occurring randomly, but I don't think the cache can be the result

  
  
Posted 9 months ago

Any time I run the agent locally via:

clearml-agent daemon --queue services --services-mode --cpu-only --docker --foreground

It works without fail so I've tried removing the clearml mount from agent-services in docker-compose.yml :

      CLEARML_WORKER_ID: "clearml-services"
      # CLEARML_AGENT_DOCKER_HOST_MOUNT: "/opt/clearml/agent:/root/.clearml"
      SHUTDOWN_IF_NO_ACCESS_KEY: 1
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock
      # - /opt/clearml/agent:/root/.clearml

I know there's some downfalls to doing this but it seems to prevent the Process terminated by user issue I was seeing. Like I said, the issue appeared randomly so this could just be a coincidence.

Maybe some of the cached files could have been leading to the issue?

  
  
Posted 9 months ago

Thanks @<1523701087100473344:profile|SuccessfulKoala55> - Yeah I found that allegroai/clearml-agent-services:latest was running clearml-agent==1.1.1 . Tried plugging various other images into docker-compose.yml & restarting to see if versions clearml-agent==1.6.1 or clearml-agent==1.7.0 would fix the issue but no luck unfortunately 😕

  
  
Posted 9 months ago

Hi @<1534706830800850944:profile|ZealousCoyote89> , make sure you update the agent inside the services docker, as this image is probably running a very old version

  
  
Posted 9 months ago

To me it looks as if somebody were going in to the UI and hitting abort on the task but that's definitely not the case

  
  
Posted 9 months ago

Just user abort by the looks of things:
image

  
  
Posted 9 months ago

Hi @<1534706830800850944:profile|ZealousCoyote89> ! Do you have any info under STATUS REASON ? See the screenshot for an example:
image

  
  
Posted 9 months ago

This should be the full log cleaned

  
  
Posted 9 months ago

Does this help at all? (I can go a lil further back, just scanning through for any potential sensitive info!)

  
  
Posted 9 months ago

Hi @<1523701070390366208:profile|CostlyOstrich36>

We've got quite a bit of sensitive info in the logs - I'll see what I can grab

  
  
Posted 9 months ago

Hi @<1534706830800850944:profile|ZealousCoyote89> , can you please add the full log?

  
  
Posted 9 months ago
940 Views
11 Answers
9 months ago
9 months ago
Tags