Reputation
Badges 1
25 × Eureka!it seems it's following the path of the script i'm using to task.create, eg:
The folder it should run it is the script path you are passing (i.e. "script=ep_fn," )
Wrong path would imply that is it not finding the correct repository, is that the case ?
https://github.com/allegroai/clearml/blob/master/clearml/automation/trigger.py
Example coming soon, with docs :)
You mean the entire organization already has Kubeflow, or to better organize something (if this is the second, what are we organizing, pipelines?)
I see what you mean.an_optimizer = HyperParameterOptimizer( base_task_id='39d2c27baa8145929b2e21f686a17046', hyper_parameters=[], objective_metric_title='epoch_accuracy', objective_metric_series='epoch_accuracy', objective_metric_sign='max', optimizer_class=aSearchStrategy, max_iteration_per_job=0, total_max_jobs=0, auto_connect_task=False, ) print(an_optimizer.get_top_experiments(top_k=5))
I think it would be nicer if the CLI had a subcommand to show the content ofΒ
~/.clearml_data.json
Β .
Actually, it only stores the last dataset id at the moment, no not much π
But maybe we should have a cmd line that just outputs the current datasetid, this means it will be easier to grab and pipe
WDYT?
Hi @<1673501379764686848:profile|VirtuousSeaturtle4>
What I dont get is that the example does not refer to a bucket path. What bucket path should I specify ?
you mean to store data?
But I do not have anything linked correctly since I rely in conda installing cuda/cudnn for me
From the log it installed:cudatoolkit==11.1.1
based on the CUDA it found on the host machine: agent.cuda_version = 110
But for some reason it installed the pytorch from the conda "pytorch" repo without the cuda support.
Hi @<1523701868901961728:profile|ReassuredTiger98>
Could you send the full log ? Also what's the clearml-agent
version?
Hi @<1523701868901961728:profile|ReassuredTiger98>
This should have worked, and seems like conda is not fetching the correct pytorch version (even though the conda env contains the cuda version they specify)
Let's try something, reset the Task, then edit the "Installed packages" and add:
cudatoolkit==11.1.1
Then try again.
Let's see what we get.
(The idea, is that I think conda forgets it just install cudatoolkit and assumes the env is for CPU)
Okay. AndΒ
110
Β means 11.1 and not 11.0?Β (edited)
110 means 11.0, the odd thing is, it actually installed 11.1, and from the pytorch website this is exactly how they suggest to install with conda...
Let me know if forcing the CUDA version changes anything
Hmm, you are correct
Which means this is some conda issue, basically when installing from env file, conda is not resolving the correct pytorch version π
Not sure why... Could you try to upgrade conda ?
@<1523701868901961728:profile|ReassuredTiger98> in the UI can you see it in the "installed packages" section under the Execution Tab ?
@<1523701868901961728:profile|ReassuredTiger98> it works on my machine π
of what task? i'm running lots of them and benchmarking
If you are skipping every installation it should be the same
because if you set CLEARML_AGENT_SKIP_PYTHON_ENV_INSTALL=1
it will not install Anything at all
This is why it's odd to me...
wdyt?
what if the preexisting venv is just the system python? my base image is python:3.10.10 and i just pip install all requirements in that image. Does that not avoid venv still?
it will basically create a new venv inside the container forking the existing preinistalled stuff (i.e. the new venv already has everything the python system has preinstalled)
then it will call "pip install" on all the "installed packages of the Task.
Which should just check everything is there and install nothing...
BTW: you can also just add -e "
CLEARML_AGENT_SKIP_PYTHON_ENV_INSTALL=1"
to the docker args (under the Execution tab) to override the setting of the docker.
you can also add " export;
" to the docker startup bash script section (do not add "#/bin/bash" , just the actual script) to get a list of all the environment variables inside the docker, just in case
Where is the log of the Task that is created for the component?
The error you are getting is the Pipeline itself throwing error because the component Task failed, the question is why did the component Task failed
Hi @<1572395184505753600:profile|GleamingSeagull15>
Is there an official place to report bugs and add feature requests for the app.clear.ml website?
GitHub issues is usually the place, or the
Assuming GitHub, but just making sure you don't have another PM tool you'd rather use.
Really appreciate asking! it is always hard to keep track π
Suppose that a new model version 2 is trained, but it does not fulfill our target metrics, is it possible to just save the model to model repo and not serve it, if a model version 1 is already being served?
Sure, just do not "publish" the model, it will be stored in the model repository, fully accessible but the clearml-serving will not serve it π
Maybe the only thing to worry about is making sure the IP address is stable, so if k8s replaces the node, you do not have to reconfigure the clients π
Hmm are you running the clearml-agent on this machine? (This is the orchestration module, it will spin the Tasks and the dockers on the gpus)
Hi @<1555362936292118528:profile|AdventurousElephant3>
I think your issue is that Task supports two types of code,
- single script/jupyter notebook
- git repo + git diffIn your example (If I understand correctly) you have a notebook calling another notebook, which means the first notebook will be stored on the Task, but the second notebook (not being part of a repository) will not be stored on the task, and this is why when the agent is running the code it fails to find the second notebook....