SmarmyDolphin68 sadly if this was not executed with trains (i.e. the offline option of trains), this is not really doable (I mean it is, if you write some code and parse the TB 😉 but let's assume this is way to much work)
A few options:
On the next run, use clearml OFFLINE option, (i.e. in your code call Task.set_offline() , or set env variable CLEARML_OFFLINE_MODE=1) You can compress the upload the checkpoint folder manually, by passing the checkpoint folder, see
Notice that the actual configuration that is used is the
But it is created here:
ClumsyElephant70 the odd thing is the error here:docker: Error response from daemon: manifest for nvidia/cuda:latest not found: manifest unknown: manifest unknown.
I would imagine it will be with "nvidia/cuda:11.3.0-cudnn8-runtime-ubuntu18.04" but the error is saying "nvidia/cuda:latest"
How could that be ?
Also can you manually run the same command (i.e. docker run --gpus device=0 --rm -it nvidia/cuda:11.3.0-cudnn8-runtime-ubuntu18.04 bash
SuccessfulKoala55 please post here once the code is available in your pytorch_ignite 🙂
Looks great, let me see if I can understand what's missing, because it should have worked ...
Hi ColossalAnt7
Following on SuccessfulKoala55 answer
I saw that there is a config file where you can specify specific users and passwords, but it currently requires
- mount the configuration file (the one holding the user/pass) into the pod from a persistent volume .
I think the k8s way to do this would be to use mounted config maps and secrets.
You can use ConfigMaps to make sure the routing is always correct, then add a load-balancer (a.k.a a fixed IP) for the users a...
the Task scheduler itself is a Task. What we did is we added a new parameter section on the Task (the task.connect call), so that we can later clone and modify it and use the new value in runtime
(Task.connect will put the data from the Task/UI back into the dict when the agent is running the Scheduler)
Does that make sense?
Can you explain what you meant byÂ
entropy point file?
There is no need to specify entry point file.
It is automatically detected when you run the Code manually on your machine.
My assumption was that the file "src/" (based on your log) is just a test file, and hence was not added top the repository. So the agent failed to actually restore it from the git (files that are not added are not considered part of the git diff, this is usually git behavio...
Any idea why the Pipeline Controller is Running despite the task passing?
What do you mean by "the task passing"
okay that's good, that means the agent could run it.
Now it is a matter of matching the TF with cuda (and there is no easy solution for that). Basically I htink that what you need is "nvidia/cuda:10.2-cudnn7-runtime-ubuntu16.04"
... Would not work for huge llm style models.
yes I agree... but then if the model is small enough then you can just keep it in memory ...
Hi @<1546303293918023680:profile|MiniatureRobin9>
Im not sure to understand the difference between a worker and an agent.
hmm we should probably make that clearer 🙂
agent = the clearml-agent instance running on the machine
worker is the system term representing the instance of the agent
You can have one machine with multiple agents (i.e. multiple workers) running on it.
Does that make sense ?
I have install a python environment by virtualenv tool, let's say
and python is
How to reuse the virtualenv by setting clearml agent?
So the agent is already caching the entire venv for you, nothing to worry about, just make sure you have this line in clearml:
No need to provide it an existing...
Thanks JumpyPig73
Yeah this would explain it ... (if hydra is setting something else we can tap into that as well)
copy paste the trains.conf from any machine, it just need the definition of the trains-server address.
Specifically if you run in offline mode, there is no need for the trains.conf and you can just copy the one on GitHub
hmm that would explain it failing
Hi WickedStarfish97
As a result, I don’t want the Agent to parse what imports are being used / install dependencies whatsoever
Nothing to worry about here, even if the agent detects the python packages, they are installed on top of the preexisting packages inside the docker. That said if you want to over ride it, you can also pass packages=[]
Hi @<1541954607595393024:profile|BattyCrocodile47>
But the files API is still open to the world, right?
No, of course not 🙂 (i.e. API is authenticated with JWT header, this is why you need to generate the secret/key in the UI)
That said, the login process itself is user/pass stored on the server, but other than that the web/api are secured. The file server on the other hand is plain http storage and does not verify the connection like the API does. So if you are going the self-ho...
One example is a node that resizes the images, this node receives as input a Dataset and iterates over each image, resizes it an outputs a new Dataset, which is used in the next node downstream in the Pipeline.
I agree, this sounds like a "function" rather than a job, so better suited for Kedro.
organization structureÂ
 and see for yourself (this pipeline has two nodesÂ
Interesting! let me dive into that and ...
In the documentation it warns about
"Only call Task.close if you are certain the Task is not needed."
Maybe this is not clear enough, this means you do not need to automatically Add/Log/Track things into the Task in the current process.
This does Not mean you cannot access the Task or its artifacts
Mark closed means to externally (i..e not from the process that crated the Task, maybe even from a different machine) close and mark the task as completed (this...
RobustRat47 I think you have to use the latest clearml package for that (1.6.0)
Ohh I see now the force SSH did not replace the user in the SSH link (only if the original was http), right ?
Yes that was a tricky one, basically always blame pip 🙂
One machine (original parent)
agent.package_manager.type = pip
agent.package_manager.pip_version =
Which would not upgrade the pip and use the preinstalled Unpacking python-pip-whl (20.0.2-5ubuntu1.10)
The other one has:
agent.package_manager.type = pip
agent.package_manager.pip_version.0 = <20.2 ; python_version < '3.10'
agent.package_manager.pip_version.1 = <22.3 ; python_version >\= '3.10'
and it instal...
Copy paste it here 🙂
as i also noticed that uploads are sometimes slow, and i see here max_connections=2
Makes sense to me, please go ahead and add that as well (basically the same thing on _AzureBlobServiceStorageDriver.upload_object
and an additional variable on the AzureContainerConfigurations
Could you PR a tested draft ? we will be able to take from there
Hi HappyDove3
Are you passing it this way?task.upload_artifact(name="my artifact", artifact_object=np.eye(3,3))
AbruptHedgehog21 could it be the console log itself is huge ?