Reputation
Badges 1
979 × Eureka!After some investigation, I think it could come from the way you catch error when checking the creds in trains.conf: When I passed the aws creds using env vars, another error poped up: https://github.com/boto/botocore/issues/2187 , linked to boto3
Hi DilapidatedDucks58 , I did that already, but I am reusing the same experiment instead of merging two experiments. Step 4 can be seen as:
Update the experiment status to stopped (if it is failed, you won’t be able to re-enqueue it) Set a parameter of that task to point to the latest checkpoint and load it (you can also infer it directy: I simply add a tag to the task resume
, and check at runtime if this tag exists, if yes, I fetch the latest checkpoint of the task) Use https://clea...
We would be super happy to have the possibility of documenting experiments (new tab in experiments UI) with a markdown editor!
AnxiousSeal95 The main reason for me to not use clearml-serving triton is the lack of documentation tbh 😄 I am not sure how to make my pytorch model run there
Sure 🙂 Opened https://github.com/allegroai/clearml/issues/568
CostlyOstrich36 yes, when I scroll up, a new events.get_task_log is fired and the response doesn’t contain any log (but it should)
trains-agent daemon --gpus 0 --queue default & trains-agent daemon --gpus 1 --queue default &
This one doesn’t have _to_dict
unfortunately
Thanks! (Maybe could be added to the docs ?) 🙂
Will it freeze/crash/break/stop the ongoing experiments?
What is latest rc of clearml-agent? 1.5.2rc0?
You mean it will resolve by itself in the following days or should I do something? Or there is nothing to do and it will stay this way?
did you try with another availability zone?
So I need to have this merging of small configuration files to build the bigger one
I will let the team answer you on that one 🙂
I will try with that and keep you updated
There is no way to filter on long types? I can’t believe it
For the moment this is what I would be inclined to believe
Ok I have a very different problem now: I did the following to restart the ES cluster:docker-compose down docker-compose up -d
And now the cluster is empty. I think docker simply created a new volume instead of reusing the previous one, which was always the case so far.
Interestingly, I do see the 100gb volume in the aws console:
It seems that around here, a Task that is created using init remotely in the main process gets its output_uri
parameter ignored
line 13 is empty 🤔
Now it starts, I’ll see if this solves the issue
wow if this works that’s amazing