Reputation
Badges 1
25 × Eureka!Thanks! Let me check something
It just seems frozen at the place where it should be spinning up the tasks within the pipeline
And is there an agent for those ? usually there is one agent for running logic tasks (like pipelines) running with --services-mode
which means multiple Tasks can be executed by the same agent. And other agents for compute Tasks that are a signle Task per agent (but you can run multiple agents on the same machine)
Yes that was a tricky one, basically always blame pip 🙂
One machine (original parent)
agent.package_manager.type = pip
agent.package_manager.pip_version =
Which would not upgrade the pip and use the preinstalled Unpacking python-pip-whl (20.0.2-5ubuntu1.10)
The other one has:
agent.package_manager.type = pip
agent.package_manager.pip_version.0 = <20.2 ; python_version < '3.10'
agent.package_manager.pip_version.1 = <22.3 ; python_version >\= '3.10'
and it instal...
the first runs perfectly fine,
Just making sure, running in an agent?
the second crashes
Running inside the same container as the first one ?
Woo, what a doozy.
yeah those "broken" pip versions are making our life hard ...
SourLion48 you mean the wraparound ?
https://github.com/allegroai/clearml/blob/168074acd97589df58436a3ec122a95a077620c2/docs/clearml.conf#L33
CrookedWalrus33 this is odd I tested the exact same code.
I suspect something with the environment maybe?
Whats the python version / OS ? also can you send full pipe freeze?2022-07-17 07:59:40,339 - clearml.storage - ERROR - Failed uploading: Parameter validation failed: Invalid type for parameter ContentType, value: None, type: <class 'NoneType'>, valid types: <class 'str'>
Yes this is odd, it should add the content-type of the file (for example "application/x-tar" but you are getting N...
Found the issue, fix in the next RC (soon to be out)
CrookedWalrus33 can you post the clearml.conf you have on the agent machine?
Are you using tensorboard or do you want to log directly to trains ?
SteadySeagull18 btw: in post-callback the node.job will be completed
because it is a called after the Task is completed
I'm a bit confused between the distinction / how to use these appropriately --
Task.init
does not have
repo
/
branch
args to set what code the task should be running.
It detects it automatically at run time 🙂 based on what is actually being used
My ideal is that I do exactly what
Task.create
does, but the task only goes into the pipeline section rather than making a new one in the experiments section.
Do y...
When I'm setting up my Pipeline, I can't go "here are some brand new tasks, please run them",
I think this is the main point. Can you create those Tasks via Task.create and get what you want? If so, then sure you can do that:
` def create_step_task(a_node):
task = Task.create(...)
return task
pipe.add_step(
name="stage_process",
parents=["stage_data"],
base_task_factor=create_step_task
) `wdyt?
As for the node, this confusing bit is that this is text from the docs...
this is very odd, can you post the log?
Hi CloudySwallow27
how can I just "define" it on my local PC, but not actually run it.
You can use the clearml-task
CLI
https://clear.ml/docs/latest/docs/apps/clearml_task#how-does-clearml-task-work
Or you can add the following line in your code, that will cause the execution to stop, and to continue on a remote machine (basically creating the Task and pushing it into an execution queue, or just aborting it)task = Task.init(...) task.execute_remotely()
https://clear.ml/do...
I think you can watch it after GTC on the nvidia website, and a week after that we will be able to upload it to the youtube channel 🙂
BTW
Grafana Visualizing endpoint request latency as well as prediction result value distributions
Right, so this "vault" design is built into the paid tiers of ClearML to achieve exactly that. Long story short, users can put their credentials/configs on the clearml-server and the agent (or the clients) will pull and merge them into the execution.
It's very cool and works really nice, but not part of the open source (or the SaaS tier).
What you could do is store these configurations on the Task itself (one way o r another). Maybe for example have an empty definitions.py
file part of ...
CloudySwallow27 okay essentially this defs file is kind of a user "secret vault" for access credentials, is that correct?
ohh, could it be a 32bit version of python ?
and pip install clearml-agent
fails?
In the installed packages section it includes
pywin32 == 303
even though that is not in my requirements.txt.
So for some reason it is being detected (meaning your code base actually imports it in code)
But you can just remove it, either by manually editing the cloned Task (right click, reset, then you can edit the section), or via codeTask.ignore_requirements("pywin32") task = Task.init(...)
So the agent installed okay. It's the specific Task that the agent is failing to create the environment for, correct?
if this is the case, what do you have in the "Installed Packages" section of the Task (see under the Execution tab)
Hi CloudySwallow27
This error occurs randomly during training (in other words training does successfully start).
What's the cleamrl-agent version you are using, and the clearml version ?
Sure go to the "All Projects" and filter by Task Type, application / service
Also this message suggests that I can change the configuration, but as said I can't find it anywhere and wouldn't know hot to change the configuration.
This means that you can launch a new one (i.e. abort, clone, edit, enqueue) directly from the web UI and in the UI edit the configuration. Unfortunately it does not support changing the configuration "live"
Yes the one you create manually is not really of the same "type" as the one you create online, this is why you do not see it there 😞
Any recommendation or working combinations of AMI
I would take the deeplearning AMIs from Nvidia AWS , I think they work on both CPU and GPU machines.
In terms of dockers, python dockers for CPU and nvidia runtime for GPU
[https://hub.docker.com/layers/library/python/3.11.2-bullseye/images/sha256-6128ea86d[…]d2c01646d599352f6ddd9893420eb815a06c3b90619f8?context=explore](https://hub.docker.com/layers/library/python/3.11.2-bullseye/images/sha256-6128ea86db7f6b1b286d2c01646d599352f6ddd98...