Hmm that is odd, could it be you are changing the sys.path ?
(What I'm assuming is happening is that it detects the packages in the PYTHONPATH and for some reason the order is different so it finds the "system" package before the "venv" package, hence the incorrect version)
Seems correct.
I'm assuming something is wrong with the key/secret quoting ?!
Could you generate another one and test it ?
(you can have multiple key/secretes on the same user)
What about output_uri?
If you are using StorageManager directly, output_uri is not relevant
Done 🙂
Basically try with the latest RC 🙂
pip install trains 0.15.2rc0
Hi ConvolutedSealion94
Yes 🙂Task.set_random_seed(my_seed=123) # disable setting random number generators by passing None task = Task.init(...)
In order to facilitate the multiple credentials one must use the Clearml SDK obviously.
Yes 🙂
Number of entries in the dataset cache can be controlled via cleaml.conf : sdk.storage.cache.default_cache_manager_size
So on the ec2 instance (with the agent running), just install prior to running the agent:apt-get install poppler-utils
So essentially, the server helm chart creates randomly generated secret pair and deploys it as a shared k8 secret that pods can access.
This is the tricky part, for the helm chart to be able to create it, it means it can login to the server it means there is a secret embedded in the helm chart that lets you access the default server. you see my point ?
Yes please, just to verify my hunch.
I think that somehow the docker mounts the agent is creating are (for some reason) messing it up.
Basically you can just run the following (it will do everything automatically) (replace the <TASK_ID_HERE> with the actual one)
` docker run -it --gpus "device=1" -e CLEARML_WORKER_ID=Gandalf:gpu1 -e CLEARML_DOCKER_IMAGE=nvidia/cuda:11.4.0-devel-ubuntu18.04 -v /home/dwhitena/.git-credentials:/root/.git-credentials -v /home/dwhitena/.gitconfig:/root/.gitconfig ...
give me a minute to test
You might only see it when the upload is done
Yup, I just wanted to mark it completed, honestly. But then when I run it, Colab crashes.
task.close() will do that
BTW what's the exception you are getting ?
SkinnyPanda43 could it be the clearml.conf is too large? how come it exceeds 16kb ?
Any hint on how you start the AWS autoscaler ?
How do I tell from the ClearML UI which datasets version am I using?
Hi SubstantialElk6 , what exactly do you mean by "ClearML UI which datasets am I using" ? Do you mean is there an auto magic adding the dataset ID when you call Data.get() in your code ? (because if you are I specifically remember discussing adding this feature a few days ago, which you just bumped the priority of 😉 )
Hi DeliciousBluewhale87
Hmm, good question.
Basically the idea is that if you have ingestion service on the pods (i.e. as part of the yaml template used by the k8s glue) you can specify to the glue what are the exposed ports, so it knows (1) what's the maximum of instances it can spin, e.g. one per port (2) it will set the external port number on the Task, so that the running agent/code will be aware of the exposed port.
A use case for it would be combing the clearml-session with the k8s gl...
Hmm, I'm without, no reason why it will get stuck .
Removing all the auto loggers, this can be done with
Task.init(..., auto_connect_frameworks=False)
which would disconnect all the automatic loggers (Hydra etc) off course this is for debugging purposes
JitteryCoyote63 you mean in runtime where the agent is installing? I'm not sure I fully understand the use case?!
Hi MortifiedCrow63 , thank you for pinging! (seriously greatly appreciated!)
See here:
https://github.com/googleapis/python-storage/releases/tag/v1.36.0
https://github.com/googleapis/python-storage/pull/374
Can you test with the latest release, see if the issue was fixed?
https://github.com/googleapis/python-storage/releases/tag/v1.41.0
PanickyMoth78 thank you for the mock code, I can verify it reproduces the issue. It seem that for some reason (bug) when the same function is called multiple times it "collects" parents, hence the odd graph,
BTW: if you want to see exactly what is passed to the step you can press on the step's full_details, and see the hyperparameter section.
I'll make sure we fix this bug in the next RC.
Generally speaking I would say the Nvidia deep-learning AMI:
https://aws.amazon.com/marketplace/pp/prodview-7ikjtg3um26wq
Hi SuperficialGrasshopper36
/home/ubuntu/.clearml/venvs-builds.1/3.8/task_repository/repository_name/.venv
This is the problem, they should not be installed there, it should be in/home/ubuntu/.clearml/venvs-builds.1/3.8/
Could you post the poetry.lock file? Maybe it is something there?
What's the poetry version and cleaml-agent versions ?
FYI:ssh -R 8080:localhost:8080 -R 8008:localhost:8008 -R 8081:localhost:8081 replace_with_username@ubuntu_ip_heresolved the issue 🙂
SubstantialElk6
The ~<package name with first name dropped> == a.b.c is a known conda/pip temporary install issue. (Some left over from previous package install)
The easiest way is to find the site-packages folder and delete the package, or create a new virtual environment
BTW:
pip freeze will also list these broken packages