Reputation
Badges 1
981 × Eureka!how would it interact with the clearml-server api service? would it be completely transparent?
The jump in the loss when resuming at iteration 31 is probably another issue -> for now I can conclude that:
I need to set sdk.development.report_use_subprocess = false I need to call task.set_initial_iteration(0)
I see that I have several volumes:
` $ docker volume ls
DRIVER VOLUME NAME
local 5b0bfe5ab1a3d645bd635b2fb6f2aefd2b657d566019343c8305959903996c67
local 43b60287d60db798dc9d1defe1d7d861334c9c8299aefad6da2f20db278cfc5b
local 1406d50aa65ab55d323500d1fb23f19adfc6e721261ab6103a59d20e82146099
local 7367a215bd42a4e888e5d88ce708bf74aedc48a6e9417c72a19739cb80f25e6d
local 7413c39f5e4b6568304832d9d2e925ebdbf47ad31ad22d77830d3618af79237b
local a55cb71edff48c2138a5da9d8d1e26df3b...
Alright, experiment finished properly (all models uploaded). I will restart it to check again, but seems like the bug was introduced after that
I am using an old version of the aws autoscaler, so the instance has the following user data executed:echo "{clearml_conf}" >>/root/clearml.conf ... python -m clearml_agent --config-file '/root/clearml.conf' daemon --detached --queue '{queue}' --docker --cpu-only
(Btw the instance listed in the console has no name, it it normal?)
But I am not sure it will connect the parameters properly, I will check now
I have 11.0 installed but on another machine with 11.0 installed as well, trains downloads torch for cuda 10.1, I guess this is because no wheel exists for torch==1.3.1 and cuda 11.0
I managed to do it by using logger.report_scalar, thanks!
I tried removing type=str but I got same problem 😕
Hi DilapidatedDucks58 , I did that already, but I am reusing the same experiment instead of merging two experiments. Step 4 can be seen as:
Update the experiment status to stopped (if it is failed, you won’t be able to re-enqueue it) Set a parameter of that task to point to the latest checkpoint and load it (you can also infer it directy: I simply add a tag to the task resume , and check at runtime if this tag exists, if yes, I fetch the latest checkpoint of the task) Use https://clea...
Not sure about that, I think you guys solved it with your PipelineController implementation. I would need to test it before giving any feedback 🙂
I did change the replica setting on the same index yes, I reverted it back from 1 to 0 afterwards
Here is the minimal reproducable example.
Run test_task_a.py - It will register a dummy artifact, create a new task, set a parameter in that task and enqueue it test_task_b will try to retrieve parameter from parent task and fail
Hi SuccessfulKoala55 , super that’s what I was looking for
ok, now I actually remember why I used _update_requirements instead of add_requirements: The first overwrites all the other, the later only add to the already detected packages. Since my deps are listed in the dependencies of my setup.py, I don't want clearml to list the dependencies of the current environment
Will the from clearml import Task raise an error if no clearml.conf exists? Or only when actual features requiring to define the server (such as Task.init ) will be called
Yes, it did spin two instances for the same task
Thanks for your inputs, I will try that! For completion, here is how I retrieve the parameters:
` from trains import Task
task = Task.init("test", "test")
parent_task = Task.get_task(task.parent)
task.get_logger().report_text(task.get_parameters())
artifact_name = task.get_parameter("General/artifact_name")
artifact = parent_task.artifacts[artifact_name].get() `
the first problem I had, that didn’t gave useful infos, was that docker was not installed in the agent machine x)
the instances takes so much time to start, like 5 mins
meaning the RestAPI returns nothing, is that correct
Yes exactly, this is the response from the api server when I try to scroll down on the console to get more logs
This is no coincidence - Any data versioning tool you will find are somehow close to how git works (dvc, etc.) since they aim to solve a similar problem. In the end, datasets are just files.
Where clearml-data stands out imo is the straightfoward CLI combined with the Pythonic API that allows you to register/retrieve datasets very easily
Nevermind, i was able to make it work, but no idea how
thanks for your help!
Hi TimelyPenguin76 , I guess it tries to spin them down a second time, hence the double print