![Profile picture](https://clearml-web-assets.s3.amazonaws.com/scoold/avatars/AgitatedDove14.png)
Reputation
Badges 1
25 × Eureka!hi @<1546303293918023680:profile|MiniatureRobin9>
I can still see the metrics in Grafana. I
it will not delete it from grafana, it means it's no longer collected, make sense ?
Hmm so is the problem having the gituser inside the code? or the k8s_glue print ?
Hmm so yes that is true, if you are changing the bucket values you will have to manually also adjust it in grafana. I wonder if there is a shortcut here, the data is stored in Prometheus, and I would rather try to avoid deleting old data, Wdyt?
however can you see the inconsistency between the key and the name there:
Yes that was my point on "uniqueness" ... π
the model-key must be unique, and it is based on the filename itself (the context is known, it is inside the Task) but the Model Name is an entity, so it should have the Task Name as part of the entity name, does that make sense ?
BTW updating the values in grafana is basically configuration of the heatmap graph, so it is fairly easy to do, just not.automatic
WickedGoat98
The trains-agent-services docker is always CPU, the idea is put long lasting services there (like the auto cleanup or slack integration or HPO etc.)
To spin an agent with GPU on any machine (regardless of where the trains-server is) you can check the trains-agent
readme.
https://github.com/allegroai/trains-agent#running-the-trains-agent
SubstantialElk6 could you post "Installed packaged" section under Execution of this specific Task?
In the UI you can edit the base container image + add "SETUP SHELL SCRIPT", with any missing "apt update && apt-get install -y ..."
Hi @<1523701260895653888:profile|QuaintJellyfish58>
Is there a way or a trigger to detect when the number of workers in a queue reaches zero?
You mean to spin them down? what's the rational ?
Iβd like to implement a notification system that alerts me when there are no workers left in the queue.
How are they "dropping" ?
Specifically to your question, let me check I'm sure there is an API that get's that data becuase you can see it in the UI π
ReassuredTiger98
How can I make clearml-agent use pre-installed version from the nvidia/pytorch
If the Same version is required, the agent will not try to reinstall it (the new venv the agent is creating inside the container, inherits from the preinstalled system packages)
Comes with PyTorch Version 1.12 based on a commit
. I tried
torch >= 1.11
,
torch == 1.12
If in your installed packages you have torch==1.12
the agent should not tr...
Is the clearml-agent queue not available in the open source?
fully available in the open source, what is missing is the SLURM connection, in the open source daemon is installed per machine (node) and spins containers/venv on the machine. The enterprise version adds support so it uses SLURM to provision the node. I hope it helps π
so do you think it would be possible to spin up another daemon, which listens to this daemon, which then runs a slurm job?
This is exactly what the ...
Ohh try to add --full-monitoring
to the clearml-agent execute
None
One last question: Is it possible to set the pip_version task-dependent?
no... but why would it matter on a Task basis ? (meaning what would be a use case to change the pip version per Task)
Hi @<1600661428556009472:profile|HighCoyote66>
However, we need to allocate resources to ourselves manually, using an
srun
command or
sbatch
Long story short, there is a full SLURM integration, basically you push a job into the ClearML queue and it produces a slurm job that uses the agent to setup the venv/container and run your Task, but this is only part of the enterprise version π
You can however do the following (notice this is ...
Oh I see, this seems like Triton configuration issue, usually dim -1 means flexible. I can also mention that serving 1.1 should be released later this week with better multiple input support for triton. Does that make sense?
The warning just let's you know the current processes stopped and itis being launched on a remote machine.
What am I missing? Is the agent failing to run the job that you create manually ?
(notice that when creating a job manually, there is no "execute_remotely", you just enqueue it, as it is not actually "running locally")
Make sense ?
MagnificentSeaurchin79 do you have the "." package listed under "installed packages" after you reset the Task ?
Oh, then just make sure you call Task.init in your code,
as long as you have clearml.conf in the container or pass the ENV variables to configure your clearml, it should just work
Hi @<1549202366266347520:profile|GorgeousMonkey78>
how do I integrate sagemaker with clearml ,
you mean to launch an experiment, or just to log it?
Hi RoughTiger69
unfortunately, the model was serialized with a different module structure - it was originally placed in a (root) module called
model
....
Is this like a pickle issue?
Unfortunately, this doesnβt work inside clear.ml since there is some mechanism that overrides the import mechanism using
import_bind
.
__patched_import3
What error are you getting? (meaning why isn't it working)
Hmm should not make a diff.
Could you verify it still doesn't work with TF 2.4 ?
No worries π
Is this what you were looking for ?
I commented the upload_artifact at the end of the code and it finishes correctly now
upload_artifact caused the "failed" issue ?
Yes
Are you trying to upload_artifact to a Task that is already completed ?
Hi @<1627478122452488192:profile|AdorableDeer85>
I'm sorry I'm a bit confused here, any chance you can share the entire notebook ?
Also any reason why this is pointing to "localhost" and not IP/host of the clearml-server ? is the agent running on the same machine ?
Okay, could you try to run again with the latest clearml package from github?pip install -U git+
BTW:
Task.add_requirements('tensorflow', '2.2') will make sure you get the specified version π
Hmm let me check something
yes that makes send, I think what happened is one of the processes completed the Task (i.e. closed it) before the others did, and so they threw exception.
I switched to have all tasks in a separate process
I think that's probably the best (performance wise as well), nice!