Reputation
Badges 1
981 × Eureka!Is there a typo in your message? I don't see the difference between what I wrote and what you suggested: TRAINS_WORKER_NAME = "trains-agent":$DYNAMIC_INSTANCE_ID
Although task.data.last_iteration  is correct when resuming, there is still this doubling effect when logging metrics after resuming đ
Whohoo! Thanks đ
Ok I have a very different problem now: I did the following to restart the ES cluster:docker-compose down docker-compose up -dAnd now the cluster is empty. I think docker simply created a new volume instead of reusing the previous one, which was always the case so far.
I carry this code from older versions of trains to be honest, I don't remember precisely why I did that
The cloning is done in another task, which has the argv parameters I want the cloned task to inherit from
So I cannot ssh anymore to the agent after starting clearml-session on it
I have a mental model of the clearml-agent as a module to spin my code somewhere, and the python version running my code should not depend of the python version running the clearml-agent (especially for experiments running in containers)
There it is: https://github.com/allegroai/clearml/issues/493
The jump in the loss when resuming at iteration 31 is probably another issue -> for now I can conclude that:
I need to set sdk.development.report_use_subprocess = false I need to call task.set_initial_iteration(0)
I also tried task.set_initial_iteration(-task.data.last_iteration) , hoping it would counteract the bug, didnât work
AgitatedDove14 I do continue an aborted Task yes - So I shouldnât even need to call the task.set_initial_iteration function, interesting! Do you have any ideas what could be a reason of the behavior I am observing? I am trying to find ways to debug it
Now I'm curious, what did you end up doing ?
in my repo I maintain a bash script to setup a separate python env. then in my task I spawn a subprocess and I don't pass the env variables, so that the subprocess properly picks up the separate python env
Yes, I would like to update all references to the old bucket unfortunately⌠I think Iâll simply delete the old s3 bucket, wait or his name to be available again and recreate it where on the other aws account and move the data there. This way I donât have to mess with clearml data - I am afraid to do something wrong and loose data
Ha nice, makes perfect sense thanks AgitatedDove14 !
So probably only the main process (rank=0) should attach the ClearMLLogger?
v0.17.5rc2
It could be yes but the difference between now and last_report_time doesnât match the difference I observe
Why is it required in the case where boto3 can figure them out itself within the ec2 instance?
Alright, I had a look in the /tmp/.trains_agent_daemon_outabcdef.txt logs, not many insights from here. For the moment, I simply started a new trains-agent daemon in services mode and I will wait to see what happens.
mmmh probably yes, I canât say for sure (because I donât remember precisely when I upgraded to 0.17) but it looks like that
basically:
` from trains import Task
task = Task.init("test", "test", "controller")
task.upload_artifact("test-artifact", dict(foo="bar"))
cloned_task = Task.clone(task, name="test", parent=task.task_id)
cloned_task.data.script.entry_point = "test_task_b.py"
cloned_task._update_script(cloned_task.data.script)
cloned_task.set_parameters(**{"artifact_name": "test-artifact"})
Task.enqueue(cloned_task, queue_name="default") `
the latest version, but I think its normal: I set the TRAINS_WORKER_ID = "trains-agent":$DYNAMIC_INSTANCE_ID, where DYNAMIC_INSTANCE_ID is the ID of the machine
This is the mapping of the faulty index:
` {
"events-plot-d1bd92a3b039400cbafc60a7a5b1e52b_new" : {
"mappings" : {
"dynamic" : "strict",
"properties" : {
"@timestamp" : {
"type" : "date"
},
"iter" : {
"type" : "long"
},
"metric" : {
"type" : "keyword"
},
"plot_data" : {
"type" : "binary"
},
"plot_len" : {
"type" : "long"
},
"plot_str" : {
...
CostlyOstrich36 yes, when I scroll up, a new events.get_task_log is fired and the response doesnât contain any log (but it should)
I just move one experiment in another project, after moving it I am taken to the new project where the layout is then reset