Latest (1.5.1 I believe?), full log incoming, but it's like I've posted elsewhere already 🤔
It just sets up the environment and immediately crashes when trying to run the code.
The setup itself is done correctly.
Perfect now 👌 (also nice cleanup of default_new_data_root
duplicate code :D)
No that does not seem to work, I get
task.execute_remotely(queue_name="default")
2024-01-24 11:28:23,894 - clearml - WARNING - Calling task.execute_remotely is only supported on main Task (created with Task.init)
Defaulting to self.enqueue(queue_name=default)
Any follow-up thoughts, @<1523701070390366208:profile|CostlyOstrich36> , or maybe @<1523701087100473344:profile|SuccessfulKoala55> ? 🤔
Perfect, thanks for the answers Valeriano. These small stuff are missing from the documentation, but I now feel much more confident in setting this up.
Sorry AgitatedDove14 , forgot to get back to this.
I've been trying to convince my team to drop poetry 😄
Added the following line under volumes
for apiserver
, fileserver
, agent-services
:- /data/clearml:/data/clearml
Hey @<1523701070390366208:profile|CostlyOstrich36> , thanks for the reply!
I’m familiar with the above repo, we have the ClearML Server and such deployed on K8s.
What’s lacking is documentation regarding the clearml-agent helm chart. What exactly does it offer, etc.
We’re interested in e.g. using karpenter to scale our deployments per demand, effectively replacing the AWS autoscaler.
Setting the endpoint will not be the only thing missing though, so unfortunately that's insufficient 😞
Yes, I’ve found that too (as mentioned, I’m familiar with the repository). My issue is still that there is documentation as to what this actually offers.
Is this simply a helm chart to run an agent on a single pod? Does it scale in any way? Basically - is it a simple agent (similiar to on-premise agents, running in the background, but here on K8s), or is it a more advanced one that offers scaling features? What is it intended for, and how does it work?
The official documentation are very spa...
We load the endpoint (and S3 credentials) from a .env
file, so they're not immediately available at the time of from clearml import Task
.
It's a convenience thing, rather than exporting many environment variables that are tied together.
There's code that strips the type hints from the component function, just think it should be applied to the helper functions too :)
Yes exactly 👍 Good news.
Hey SuccessfulKoala55 ! Is the configuration file needed for Task.running_locally()
? This is tightly related with issue #395, where we need additional files for remote execution but have no way to attach them to the task other then using the StorageManager
as a temporary cache.
Anything else you’d recommend paying attention to when setting the clearml-agent helm chart?
But... Which queue does it listen to, and which type of instances will it use etc
I see, okay that already clarifies some stuff, I'll dig a bit more into this then! Thanks!
That's what I thought too, it should only look for the CLEARML_TASK_ID
environment variable?
Or is just integrated in the ClearML slack space and for some reason it's showing the clearml address then?
Right, so where can one find documentation about it?
The repo just has the variables with not much explanations.
Much much appreciated 🙏
SuccessfulKoala55 CostlyOstrich36 actually it is the import
statement, just finally got around to the traceback:
` File "/home/.../ccmlp/configs/mlops.py", line 4, in <module>
from clearml import Task
File "/home/.../.venv/lib/python3.8/site-packages/clearml/init.py", line 4, in <module>
from .task import Task
File "/home/.../.venv/lib/python3.8/site-packages/clearml/task.py", line 31, in <module>
from .backend_interface.metrics import Metrics
File "/home/......
We’re using karpenter
(more magic keywords for me), so my understanding is that that will manage the scaling part.
Maybe @<1523701827080556544:profile|JuicyFox94> can answer some questions then…
For example, what’s the difference between agentk8sglue.nodeSelector
and agentk8sglue.basePodTemplate.nodeSelector
?
Am I correct in understanding that the former decides the node type that runs the “scaler” (listening to the given agentk8sglue.queue
), and the latter for any new booted instance/pod, that will actually run the agent and the task?
Read: The former can be kept lightweight, as it does no...