Reputation
Badges 1
979 × Eureka!the instances takes so much time to start, like 5 mins
edited the aws_auto_scaler.py, actually I think it’s just a typo, I just need to double the brackets
Interestingly, I do see the 100gb volume in the aws console:
I did that recently - what are you trying to do exactly?
ok, what is your problem then?
I would probably leave it to the ClearML team to answer you, I am not using the UI app and for me it worked just well with different regions. Maybe check permissions of the key/secrets?
Could you please share the stacktrace?
And now that I restarted the server and went back into the project where I initially deleted the archived experiments, some of them are still there - I will leave them alone, too scared to do anything now 😄
It seems that around here, a Task that is created using init remotely in the main process gets its output_uri
parameter ignored
even if I explicitely use previous_task.output_uri = "
s3://my_bucket "
, it is ignored and still saves the json file locally
I killed both trains-agent and restarted one to have a clean start. This way it correctly spin up docker containers for services tasks. So probably the bug comes when a bug occurs while setting up a task, it cannot go back to the main task. I would need to do some tests to validate that hypothesis though
The task with id a445e40b53c5417da1a6489aad616fee
is not aborted and is still running
So the controller task finished and now only the second trains-agent services mode process is showing up as registered. So this is definitly something linked to the switching back to the main process.
I will try to isolate the bug, if I can, I will open an issue in trains-agent 🙂
The weird thing is that the second experiment started immediatly, correctly in a docker container, but failed with User aborted: stopping task (3)
at some point (while installing the packages). The error message is suprizing since I did not do anything. And then all following experiments are queued to services queue and stuck there.
On the cloned experiment, which by default is created in draft mode, you can change the commit to point either a specific commit or the latest commit of the branch
I’ll definitely check that out! 🤩
Hi CumbersomeCormorant74 yes, this is almost the scenario: I have a dozen of projects. In one of them, I have ~20 archived experiments, in different states (draft, failed, aborted, completed). I went to this archive, selected all of them and deleted them using the bulk delete operation. I had several failed delete popups. So I tried again with smaller bulks (like 5 experiments at a time) to localize the experiments at the origin of the error. I could delete most of them. At some point, all ...
I guess I’ll get used to it 😄
select multiple lines still works, you need to shift + click on the checkbox
DeterminedCrab71 Please check this screen recording
It broke the shift holding to select multiple experiments btw
Restarting the server ( docker-compose down
then docker-compose up
) solved the problem 😌 All experiments are back
and this works. However, without the trick from UnevenDolphin73 , the following won’t work (return None):if __name__ == "__main__": task = Task.current_task() task.connect(config) run() from clearml import Task Task.init()
AgitatedDove14 , my “uncommitted changes” ends with... if __name__ == "__main__": task = clearml.Task.get_task(clearml.config.get_remote_task_id()) task.connect(config) run() from clearml import Task Task.init()