Reputation
Badges 1
981 × Eureka!AgitatedDove14 Is it fixed with trains-server 0.15.1?
Is there any channel where we can see when new self hosted server version are published?
See my answer in the issue - I am not using docker
This is consistent: Each time I send a new task on the default queue, if trains-agent-1 has only one task running (the long one), it will pick another one. If I add one more experiment in the queue at that point (trains-agent-1 running two experiments at the same time), that experiment will stay in queue (trains-agent-2 and trains-agent-3 will not pick it because they also are running experiments)
(I am not part of the awesome ClearML team, just a happy user π )
AppetizingMouse58 btw I had to delete the old logs index before creating the alias, otherwise ES wonβt let me create an alias with the same name as an existing index
AgitatedDove14 I see that the default sample_frequency_per_sec=2. , but in the UI, I see that there isnβt such resolution (ie. it logs every ~120 iterations, corresponding to ~30 secs.) What is the difference with report_frequency_sec=30. ?
Hi SuccessfulKoala55 , not really wrong, rather I don't understand it, the docker image with the args after it
Just found yea, very cool! Thanks!
So the wheel that was working for me was this one: [torch-1.11.0+cu115-cp38-cp38-linux_x86_64.whl](https://download.pytorch.org/whl/cu115/torch-1.11.0%2Bcu115-cp38-cp38-linux_x86_64.whl)
but then why do I have to do task.connect_configuration(read_yaml(conf_path))._to_dict() ?
Why not task.connect_configuration(read_yaml(conf_path)) simply?
I mean what is the benefit of returning ProxyDictPostWrite instead of a dict?
I see what I described in https://allegroai-trains.slack.com/archives/CTK20V944/p1598522409118300?thread_ts=1598521225.117200&cid=CTK20V944 :
randomly, one of the two experiments is shown for that agent
AgitatedDove14 In theory yes there is no downside, in practice running an app inside docker inside a VM might introduce slowdowns. I guess itβs on me to check whether this slowdown is negligible or not
No space, I will add and test π
yes, the only thing I changed is:install_requires=[ ... "my-dep @ git+ ]to:install_requires=[ ... "git+ "]
yes, because it wonβt install the local package which has this setup.py with the problem in its install_requires described in my previous message
my agents are all .16 and I install trains 0.16rc2 in each Task being executed by the agent
AgitatedDove14 I was able to redirect the logger by doing so:clearml_logger = Task.current_task().get_logger().report_text early_stopping = EarlyStopping(...) early_stopping.logger.debug = clearml_logger early_stopping.logger.info = clearml_logger early_stopping.logger.setLevel(logging.DEBUG)
I followed https://github.com/NVIDIA/nvidia-docker/issues/1034#issuecomment-520282450 and now it seems to be setting up properly
Ok to be fair I get the same curve even when I remove clearml from the snippet, not sure why
correct, you could also use
Task.create
that creates a Task but does not do any automagic.
Yes, I didn't use it so far because I didn't know what to expect since the doc states:
"Create a new, non-reproducible Task (experiment). This is called a sub-task."
Thanks! Corrected both, now its building
Alright, so the steps would be:
trains-agent build --docker nvidia/cuda --id myTaskId --target base_env_services
That would create me a base docker image base_env_services . Then how should I ensure that trains-agent uses that base image for the services queue? My guess is:
trains-agent daemon --services-mode --detached --queue services --create-queue --docker base_env_services --cpu-only
Would that work?