Reputation
Badges 1
19 × Eureka!Actually the agent will use the default values for the agent section if you have a clearml.init file - what do you get if you run the agent like that?
Sure, no problem - I'll go think of something that will help us get this error more easily next time 😄
Task.init()
will know if you're running locally (and so a new task should be created) or remotely (in case the current task is the one being executed remotely)
I cannot ssh into the machine
That's very strange - since the server runs in docker, I don't see how it can cause the EC2 instance to be unavailable - can you check the EC2 panel to see what might be the problem?
Yeah, but I wouldn't recommend doing it 🙂
RBAC is something you have in the paid versions 🙂
@<1523701760676335616:profile|EnviousPanda91> this issue is not related to the ssh port forwarding - it looks like you can get to the server, but the issue is that some of the server components are not starting up properly. From the apiserver log you attached it seems like the apiserver component is not able to connect to the elasticsearch database
you can hack it with something like:Task._get_default_session().send_request("users", "get_all", json={"id": [<user-id>]})
Hi JitteryCoyote63 ,
This is behavior is actually a result of a cleanup service running inside the Trains Server, called the non-responsive tasks watchdog . This service is meant to clean up any dangling tasks/experiments that were forgotten in an invalid or running state and did not report for a long time (for example, when you run a development code and simply abort it in your debugger).
The non-responsive timeout (after which such experiments are deemed non-responsive) is currently set t...
btw - what do you mean by "So I could not stop allegro."? Can't you do docker-compose down
?
Can you perhaps share a screenshot?
Well, this script setting is only for docker, but in the "default" mode, you're basically setting up the agent, so you can set the correct value to the pip.conf file when you set up the agent, once
I'm not sure this is supported in the Google machine spec
By the way, output_uri is also documented as part of the Task.init() docstring ( None )
Well, I'm not sure, but this error is related to a null
value sent as the task's container
field (which should be perfectly legal, of course)
Hi VivaciousPenguin66 , this looks like an internal error indeed...
Probably the apiserver component and the fileserver component...
Would you recommend doing both then? :-)
You will need to if you want the SDK to be able to actually access this storage - on is to let the SDK know which is the default storage, the other is to provide details on how to access it
Well, actually it's used by both ClearML SDK and agent since both start with credentials but generate a token as soon as possible (more secure and faster)
But from what you're saying it seems like the agent simply cannot communicate with the server and what you see is simply the agent waiting indefinitely
Yeah, the server can run anywhere 🙂
Can you share the complete task log?
Hi GiganticMole91 , how did you set up your clearml.conf file?