Reputation
Badges 1
25 × Eureka!ItchyJellyfish73
Unfortunately this needs backend support, and only available in the enterprise version, what is your use case for it? (It was designed to allow out of the box bare-metal multi gpu dynamic allocation, think DGX with 8 GPUs that instead of spinning down agents when you want to change the queue->num-gpu mapping you can do it on the fly)
Hi @<1523701168822292480:profile|ExuberantBat52>
We log all history anonymously here:
https://faq.clear.ml/
See if you can find it there
replace it with:git+No need for the repository name, this will ensure you always reinstall it (again pip feature)
BTW trains agent will not delete the venv until the next run, so you can check exactly what's missing there
and what is --storage s3//:inference ?
if you are using minio it should be something like None
Notice you have to specify the IP:port otherwise it thinks it is an AWS endpoint
Wait, that makes no sense to me. The API from python and the API from the UI are getting the same data from the backend ...
What are you getting with?from clearml import Task task = Task.get_task(task_id=<put task id here>) print(task.models)
Thank you DilapidatedDucks58 for the ping!
totally slipped my mind 😞
Could it be something else is missing and hence the import fails ?
Hi SubstantialElk6
Could you test with the latest RC6 ?
pip install clearml==0.17.5rc6
Hi @<1570220858075516928:profile|SlipperySheep79>
Is there a way to specify the working dir from the decoratoe
not directly, but why would that change anything? I mean the coponent code will be created in the git root, and you can still access files inside the subfolders
from .subfolder import something
what am I missing?
I mean what is the actual link?
File:// is a path to a file.
If your machine cannot access that path you get an error.
For example:
file:///home/user/file.bin
translates to /home/user/file.bin
If you do not have the file /home/user/file.bin on your machine you get an error.
GrievingTurkey78 make sense ?
Note that by default trains / clearml will not upload your weights file anywhere , only if you set "output_uri" to a specific location it will do that .
You can always log it manually:from clearml import InputModel input_model = InputModel.import_model(weights_url='/tmp/keras_example/weight.6.hdf5')
. but when we try to do a "New Run" from UI, it tries to follow the DAG of previous run (the run with all child nodes skipped) and the new run fails too.
This is odd, is this reproducible ? what's the clearml python package version ?
SubstantialElk6 could you try with the latest (just released)?pip install clearml-agent==0.17.2Then if possible, could you attach the full log of the agent's execution (Task->results->Console)
How can I specify the agent to use a specific conda environment inside the docker?
Hi CrookedWalrus33
By default it will pick the highest python in the PATH.
Then if you have a python version (in PATH) that matches the requested on on the Task, it will look for it.
Do you want to limit it to a specific python binary ?
EnviousStarfish54 regrading file server, you have one built into the trains-server, and this will be the default location to store all artifacts. You can also use external solutions like S3 GS Azure etc.
Regarding the models, any model store / load is automatically logged as long as you are using one of the supported frameworks (TF Keras PyTorch scikit learn)
If you want your model to be automatically uploaded, just add outpu_uri:
task=Task.init('examples', 'model', output_uri=' http://trai...
How does it work with k8s?
You need to install the clearml-glue and them on the Task request the container, notice you need to preconfigure the clue with the correct Job YAML
SolidSealion72 EcstaticGoat95 I'm hoping the issue is now resolved 🤞
can you verify with ?pip install git+
- Yes Task.init should be called on each subprocess (because torch forks them before they ar epatched)
- I think the main issue is that we patch the argparse on the Subprocess (this is assuming you did not manually parse non argv argument)
- If you can create a mock test I think we can work around the issue, as long as the way you spin it is the standard pytorch distub way
Ohh, if this is the case then it kind of makes sense to store on the Task itself. Which means the Task object will have to store it, and then the UI will display it :(
I think the actual solution is a vault , per user, which would allow users to keep their credentials on the sever, the agent to pass those to the Task when it spins it, based on the user. Unfortunately the vault feature is only available on the paid/enterprise version ( with RBAC etc.).
Does that make sense?
i keep getting an failed getting token error
MiniatureCrocodile39 what's the server you are using ?
As long as you import clearml on the main script, it should work. Regarding the Nvidia container, it should not interfere with any running processes, the only issue is memory limit. BTW any reason not to spin an agent on a dedicated machine? What is the gpu used for in the ckearml server machine?
in Your Additional ClearML Configuration (which is basically clearml.conf configuration)
Add the following:environment { GOOGLE_APPLICATION_CREDENTIALS="~/gs.cred" } files { gsc { contents: "<this is your GCP storage credentials file>" path: "~/gs.cred" } }Reference:
https://github.com/allegroai/clearml-agent/blob/a5a797ec5e5e3e90b115213c0411a516cab60e83/docs/clearml.conf#L421
https://github.com/allegroai/clearml-agent/blob/a5a797ec5e5e3e90b115213c0411a...
PompousBeetle71 the code is executed without arguments, in run-time trains / trains-agent will pass the arguments (as defined on the task) to the argparser. This means you that you get the ability to change them and also type checking 🙂
PompousBeetle71 if you are not using argparser how do you parse the arguments from sys.argv? manually?
If that's the case, post parsing, you can connect a dictionary to the Task and you will have the desired behavior
` task.connect(dict_with_arguments...
@<1523711619815706624:profile|StrangePelican34> are you saying that after the " with " block the task is marked completed? how is that possible? is this done manually ?
Plan is to have it out in the next couple of weeks.
Together with a major update in v0.16
They all "inherit" the same user / environment from one another