
Reputation
Badges 1
981 × Eureka!Alright, experiment finished properly (all models uploaded). I will restart it to check again, but seems like the bug was introduced after that
Nevermind, nvidia-smi command fails in that instance, the problem lies somewhere else
The task is created using Task.clone() yes
So I need to have this merging of small configuration files to build the bigger one
Thanks for the hack! The use case is the following: I have a controler that creates training/validation/testing tasks by cloning (so that the parent task id is properly set to the controler). Otherwise I could simply create these tasks with Task.init, but then I would need to set manually the parent task for each one of these tasks, probably with a similar hack, right?
SuccessfulKoala55 I was able to recreate the indices in the new ES cluster. I specified number_of_shards: 4
for the events-log-d1bd92a3b039400cbafc60a7a5b1e52b
index. I then copied the documents from the old ES using the _reindex
API. The index is 7.5Gb on one shard.
Now I see that this index on the new ES cluster is ~19.4Gb 🤔 The index is divided into the 4 shards, but each shard is between 4.7Gb and 5Gb!
I was expecting to have the same index size as in the previous e...
Hi AnxiousSeal95 , I hope you had nice holidays! Thanks for the update! I discovered h2o when looking for ways to deploy dashboards with apps like streamlit. Most likely I will use either streamlit deployed through clearml or h2o as standalone if ClearML won't support deploying apps (which is totally fine, no offense there 🙂 )
it actually looks like I don’t need such a high number of files opened at the same time
AgitatedDove14 awesome! by "include it all" do you mean wizard for azure and gcp?
From https://discuss.pytorch.org/t/please-help-me-understand-installation-for-cuda-on-linux/14217/4 it looks like my assumption is correct: there is no need for cudatoolkit to be installed since wheels already contain all cuda/cudnn libraries required by torch
and this works. However, without the trick from UnevenDolphin73 , the following won’t work (return None):if __name__ == "__main__": task = Task.current_task() task.connect(config) run() from clearml import Task Task.init()
AgitatedDove14 WOW, thanks a lot! I will dig into that 🚀
I see 3 agents in the "Workers" tab
PS: in the new env, I’v set num_replicas: 0, so I’m only talking about primary shards…
What happens is different error but it was so weird that I thought it was related to the version installed
Disclaimer: I didn't check this will reproduce the bug, but that's all the components that should reproduce it: a for loop creating figures and clearml logging them
Yes, not sure it is connected either actually - To make it work, I had to disable both venv caching and set use_system_packages to off, so that it reinstalls the full env. I remember that we discussed this problem already but I don't remember what was the outcome, I never was able to make it update the private dependencies based on the version. But this is most likely a problem from pip that is not clever enough to parse the tag as a semantic version and check whether the installed package ma...
As you can see, more hard waiting (initial sleep), and then before each apt action, make sure there is no lock
AppetizingMouse58 btw I had to delete the old logs index before creating the alias, otherwise ES won’t let me create an alias with the same name as an existing index
That's why I suspected trains was installing a different version that the one I expected