
Reputation
Badges 1
25 × Eureka!So was definitely related to the symlinks in some form
could it be it actually deleted the cache? How many agents are running on the same machine ?
Basically two options, spin the clearml-k8s-glue, as a k8s service.
This service takes clearml jobs and creates k8s job on your cluster.
The second option is to spin agents inside pods statically, then inside the pods the agent work in venv model.
I know the enterprise edition has more sophisticated k8s integration where the glue also retains the clearml scheduling capabilities.
https://github.com/allegroai/clearml-agent/#kubernetes-integration-optional
yes, looks like. Is it possible?
Sounds odd...
Whats the exact project/task name?
And what is the output_uri?
because a pipeline is composed of multiple tasks, different tasks in the pipeline could run on different machines.
Yes!
. Or more specifically, they could run on different queues, and as you said, in your other response, we could have a Q for smaller CPU-based instances, and another queue larger GPU-based instances.
Exactly !
I like the idea of having a queue dedicated to CPU-based instances that has multiple agents running on it simultaneously. Like maybe four agents.
Th...
SmallDeer34 No worries, I'm happy to hear the issue disappeared 🙂
Having the ability to pack jobs/tasks onto the same "resource" (underlying server/EC2 instance)
This is essentially a "queue". Basically a queue is a way to abstract a specific type of resource, so that you can achieve exactly what you descibed.
open up a streaming use case, wherein batch (offline) inference could be done directly inside of a ClearML pipeline in reaction to an event/trigger (like new data landing in your data lake).
Yes, that's exactly how clearml is designed, a...
Hi DeliciousBluewhale87
This is the latest clearml-serving (stable release at GTC at the end of the month)
https://github.com/allegroai/clearml-serving/tree/dev
Generally speaking, clearml-sering is a control plane, preprocessing, ML inference, with Nvidia Triton for DL inference (fully transparent).
It allows you to spin an entire fully dynamic & scalable serving on top of k8s cluster. Once you spin the base containers, you can configure them live with a CLI, this includes adding new en...
Yes you can 🙂 (though not on the open-source version)
Okay I think I know what's going on (there is a race that for some reason on CoLab acts differently).
As a quick hack you can do the following:Task._report_subprocess_enabled = False task = Task.init(...) task.set_initial_iteration(0)
Hi ElegantCoyote26
If there is, it will have to be using the docker-mode, but I do not think this is actually possible because this is not a feature of docker. It is possible to do on k8s, but that's a diff level of integration 🙂
EDIT:
FYI we do support k8s integration
Feel free to add to the UI request list:
https://github.com/allegroai/trains/issues/81
but maybe hyperparam aborts in those cases?
from the hyperparam perspective it will be trying to optimize the global minimum, basically "ignoring" the last value reported. Does that make sense ?
We already redesigned the implementation so it should be quite easy to extend to GCP and Azure, what are you planning ?
Can you verify this example is not working for you?
https://github.com/allegroai/clearml/blob/master/examples/frameworks/hydra/hydra_example.py
But from your other answer, I think I'm understanding that you
can
have multiple agents on a single instance listening to the same queue.
Correct
So we could maybe initialize 4 instances of the agent on a single EC2 instance which would allow us to handle a higher volume of small batches concurrently without tying up the entire instance.
Correct (that said I do not understand how come a single Task does not utilize the CPU, I was under the impression it is run...
Yes, that seems to be the case. That said they should have different worker IDs agent-0 and agent-1 ...
What's your trains-agent version ?
Hi ObedientDolphin41
However, all of the pipelines tasks are ran on the same queue. Could I be missing something?
The pipeline Task itself is running on a dedicated queue (meaning agent/s) usually because the pipeline logic is mostly idling, where as the components themselves are doing the actual compute.
Specifically you can control the pipeline logic queue with pipeline_execution_queue
https://github.com/allegroai/clearml/blob/7016138c849a4f8d0b4d296b319e0b23a1b7bd9e/clearm...
This smells like a driver/image issue on the instance VM
What are you getting if add this inside your code?
os.system('nvidia-smi')
So what is the difference?!
1e876021bbef49a291d66ac9a2270705
just make sure you reset it 🙂
How are you getting:
beautifulsoup4 @ file:///croot/beautifulsoup4-split_1681493039619/work
is this what you had on the Original manual execution ? (i.e. not the one executed by the agent) - you can also look under "org _pip" dropdown in the "installed packages" of the failed Task
You can see the class here:
https://github.com/allegroai/clearml/blob/9b962bae4b1ccc448e1807e1688fe193454c1da1/clearml/binding/frameworks/init.py#L52
Basically you do:
` def my_callback(load_or_save, model):
# type: (str, WeightsFileHandler.ModelInfo) -> WeightsFileHandler.ModelInfo
assert load_or_save not in ('load', 'save')
# do something
if skip:
return None
return model
WeightsFileHandler.add_pre_callback(my_callback) `
RipeGoose2 models are automatically registered
i.e. added to the models artifactory, but it only points to where the files are stored
Only if you are passing the output_uri
argument to the Task.init, they will be actually uploaded.
If you want to disable this behavior you can passTask.init(..., auto_connect_frameworks={'pytorch': False})
Okay good news, there is a fix, bad news, sync to GitHub will only be tomorrow
Hmm so there is a way to add callbacks (somewhat cumbersome, and we would live feedback) so you can filter them out.
What do you think, would that work?
Task deletion failed: unhashable type: 'dict'
Hi FlutteringWorm14 trying to figure where this is coming from, give me a sec
Hmm that is odd, but at least we have a workaround 🙂
What's the matplotlib backend ?