Reputation
Badges 1
981 × Eureka!SInce it fails on the first machine (clearml-server), I try to run it on another, on-prem machine (also used as an agent)
CostlyOstrich36 Were you able to reproduce it? That’s rather annoying 😅
I want the clearml-agent/instance to stop right after the experiment/training is “paused” (experiment marked as stopped + artifacts saved)
what about the stacktrace of the error:Error: Can not start new instance, An error occurred (InvalidParameterValue) when calling the RunInstances operation: Invalid availability zone: [eu-west-2]?
But I can do:
` $ python
import torch
torch.cuda.is_available()
True
torch.backends.cudnn.version()
8005 `
To help you debugging this: in the /dashboard endpoint, all projects were still there, but empty (no experiment inside). No experiments archived as well.
Still failing with the same error 😞
Hi CostlyOstrich36 , one more observation: it looks like when I don’t open the experiment in the webUI before it is finished, then I get all the logs correctly. It is when I open the experiment in the webUI while it is running that I don’t see all the logs.
So it looks like there is an effect of caching (the logs are retrieved only once, when I open the experiment for the first time), and not afterwards (or rarely). Is that possible?
Yes! not a strong use case though, rather I wanted to ask if it was supported somehow
but if the task is now running on an agent, isn’t is possible source of conflict? I would expect that after calling Task.enqueue(exit=True), the local task is closed and no processes related to it is running
Thanks a lot AgitatedDove14 !
mmmmh I just restarted the experiment and it seems to work now. I am not sure why that happened. From this SO it could be related to size of the repo. Might be a good idea to clone with --depth 1 in the agents?
Or more generally, try to catch this error and retry a few times?
Hoo I found:user@trains-agent-1: ps -ax 5199 ? Sl 29:25 python3 -m trains_agent --config-file ~/trains.conf daemon --queue default --log-level DEBUG --detached 6096 ? Sl 30:04 python3 -m trains_agent --config-file ~/trains.conf daemon --queue default --log-level DEBUG --detached
Will the from clearml import Task raise an error if no clearml.conf exists? Or only when actual features requiring to define the server (such as Task.init ) will be called
Yes, it did spin two instances for the same task
Thanks for your inputs, I will try that! For completion, here is how I retrieve the parameters:
` from trains import Task
task = Task.init("test", "test")
parent_task = Task.get_task(task.parent)
task.get_logger().report_text(task.get_parameters())
artifact_name = task.get_parameter("General/artifact_name")
artifact = parent_task.artifacts[artifact_name].get() `
the first problem I had, that didn’t gave useful infos, was that docker was not installed in the agent machine x)
Well, as long as you’re using a single node, it should indeed alleviate the shard disk size limit, but I’m not sure ES will handle that too well. In any case, you can’t change that for existing indices, you can modify the mapping template and reindex the existing index (you’ll need to index to another name, delete the original and create an alias to the original name as the new index can’t be renamed...)
Ok thanks!
Well, as long as you use a single node, multiple shards offer no sca...
the instances takes so much time to start, like 5 mins
meaning the RestAPI returns nothing, is that correct
Yes exactly, this is the response from the api server when I try to scroll down on the console to get more logs
This is no coincidence - Any data versioning tool you will find are somehow close to how git works (dvc, etc.) since they aim to solve a similar problem. In the end, datasets are just files.
Where clearml-data stands out imo is the straightfoward CLI combined with the Pythonic API that allows you to register/retrieve datasets very easily
Nevermind, i was able to make it work, but no idea how
Yes, it works now! Yay!
thanks for your help!