I killed both trains-agent and restarted one to have a clean start. This way it correctly spin up docker containers for services tasks. So probably the bug comes when a bug occurs while setting up a task, it cannot go back to the main task. I would need to do some tests to validate that hypothesis though
I specified a torch @
https://download.pytorch.org/whl/cu100/torch-1.3.1%2Bcu100-cp36-cp36m-linux_x86_64.whl and it didn't detect the link, it tried to install latest version: 1.6.0
it also happens without hitting F5 after some time (~hours)
But that was too complicated, I found an easier approach
TimelyPenguin76 That sounds amazing! will there be a fallback mechanism as well? often p3.2xlarge are on shortage, would be nice to define one resources req as first choice (eg. p3.2xlarge) -> if not available -> use another resources req (eg. g4dn)
Thanks AgitatedDove14 !
What would be the exact content of NVIDIA_VISIBLE_DEVICES
if I run the following command?trains-agent daemon --gpus 0,1 --queue default &
I also discovered https://h2oai.github.io/wave/ last week, would be awesome to be able to deploy it in the same manner
Hi NonchalantHedgehong19 , thanks for the hint! what should be the content of the requirement file then? Can I specify my local package inside? how?
with what I shared above, I now get:docker: Error response from daemon: network 'host' not found.
I just move one experiment in another project, after moving it I am taken to the new project where the layout is then reset
did you try with another availability zone?
Thanks for sharing the issue UnevenDolphin73 , I’ll comment on it!
I’m not too fond of many user configurations, it’s confusing.
100% agree, nevertheless, how much is too many? Currently, there are only two settings in the user preferences category, so one more wouldn’t hurt?
however, clearml is open source, nothing stops you from adding the code and sending a PR
I’d be super happy to contribute yes! Nevertheless, I am not sure where to start: clearml-server repo? clearml-web repo?
Here is what happens with polling_interval_time_min=1
when I add one task to the queue. The instance takes ~5 mins to start and connect. During this timeframe, the autoscaler starts to new instances, then spin them down. So it acts as if max_spin_up_time_min=10
is not taken into account
btw CostlyOstrich36 , I can see in Profile > Version: 1.1.1-135 • 1.1.1 • 2.14
. What these numbers correspond to?
But you might want to double check
Awesome, thanks WackyRabbit7 , AgitatedDove14 !
Seems like it just went unresponsive at some point
They are, but this doesn’t work - I guess it’s because temp IAM accesses have an extra token, that should be passed as well, but there is no such option on the web UI, right?
SuccessfulKoala55 I was able to make it work with use_credentials_chain: true
in the clearml.conf and the following patch: https://github.com/allegroai/clearml/pull/478