Reputation
Badges 1
981 × Eureka!I mean, inside a parent, do not show the project [parent] if there is nothing inside
and then call task.connect_configuration probably
What is this cleanup service? where is it available?
I added the pass_hashed and restarted the server, still get the same problem
So the problem comes when I domy_task.output_uri = " s3://my-bucket , trains in the background checks if it has access to this bucket and it is not able to find/ read the creds
Hi AgitatedDove14 , sorry somehow this message got lost π
clearml version is the latest at the time, 1.7.1 Yes, I always see the "model uploaded completed" for such stuck tasks I am using python 3.8.10
This is what I get, when I am connected and when I am logged out (by clearing cache/cookies)
AgitatedDove14 According to the dependency order you shared, the original message of this thread isn't solved: the agent mentionned used output from nvcc (2) before checking the nvidia driver version (1)
Yes it would be very valuable to be able to tweak that param, currently it's quite annoying because it's set to 30 mins, so when a worker is killed by the autoscaler, I have to wait 30 mins before the autoscaler spins up a new machine because the autoscaler thinks there is already enough agents available, while in reality the agent is down
AgitatedDove14 Yes that might work, also the first one (with conda) might work as well, I will give it a try, thanks!
They indeed do auto-rotate when you limit the size of the logs
Hi AgitatedDove14 , coming by after a few experiments this morning:
Indeed torch 1.3.1 does not support cuda, I tried with 1.7.0 and it worked, BUT trains was not able to pick the right wheel when I updated the torch req from 1.3.1 to 1.7.0: It downloaded wheel for cuda version 101. But in the experiment log, the agent correctly reported the cuda version (111). I then replaced the torch==1.7.0 with the direct https link to the torch wheel for cuda 110, and that worked (I also tried specifyin...
Ok yes, I get it, this info is also available at the very beginning of the logs, where the agent logs the full docker run command, this docker_cmd is a shorter version?
thanks for clarifying! Maybe this could be clarified in the agent logs of the experiments with something like the following?agent.cuda_driver_version = ... agent.cuda_runtime_version = ...
Now I'm curious, what did you end up doing ?
in my repo I maintain a bash script to setup a separate python env. then in my task I spawn a subprocess and I don't pass the env variables, so that the subprocess properly picks up the separate python env
The only thing that changed is the new auth.fixed_users.pass_hashed field, that I donβt have in my config file
Ping CostlyOstrich36 AgitatedDove14 SuccessfulKoala55 Just making sure this wasn't missed π
Thanks TimelyPenguin76 and AgitatedDove14 ! I would like to delete artifacts/models related to the old archived experiments, but they are stored on s3. Would that be possible?
Both ^^, I already adapted the code for GCP and I was planning to adapt to Azure now
Hi AnxiousSeal95 , I hope you had nice holidays! Thanks for the update! I discovered h2o when looking for ways to deploy dashboards with apps like streamlit. Most likely I will use either streamlit deployed through clearml or h2o as standalone if ClearML won't support deploying apps (which is totally fine, no offense there π )
well I still see some ES errors in the logs
` clearml-apiserver | [2021-07-07 14:02:17,009] [9] [ERROR] [clearml.service_repo] Returned 500 for events.add_batch in 65750ms, msg=General data error: err=('500 document(s) failed to index.', [{'index': {'_index': 'events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b', '_type': '_doc', '_id': 'c2068648d2fe5da975665985f44c20b6', 'status':..., extra_info=[events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b][0] primary shard is not...
Here is what happens with polling_interval_time_min=1 when I add one task to the queue. The instance takes ~5 mins to start and connect. During this timeframe, the autoscaler starts to new instances, then spin them down. So it acts as if max_spin_up_time_min=10 is not taken into account
AgitatedDove14 Same problem with clearml==1.1.5rc2 π , I also tried with backend==gloo , still same problem
I am sorry to give infos that are not very precise, but itβs the best I can do - Is this bug happening only to me?
AgitatedDove14 I finally solved it: The problem was --network='host' should be --network=host
Guys the experiments I had running didn't fail, they just waited and reconnected, this is crazy cool
Some context: I am trying to log an HTML file and I would like it to be easily accessible for preview
Usually one or two tags, indeed, task ids are not so convenient, but only because they are not displayed in the page, so I have to go back to another page to check the ID of each experiment. Maybe just showing the ID of each experiment in the SCALAR page would already be great, wdyt?