Reputation
Badges 1
25 × Eureka!Hi @<1546665666675740672:profile|AttractiveFrog67>
- Make sure you stored the model's checkpoint (either pass
output_uri=TrueinTask.initor manually upload) - When you call
Task.initpass "continue_last_task=True" - Now you can do
last_checkpoint=task.models["output"][-1].get_local_copy()and all you need is to loadlast_checkpoint
JitteryCoyote63 see if upgrading the packages as they suggest somehow fixes it.
I have the feeling this is the same problem (the first error might be trains masking the original error)
the only problem with it is that it will start the task even if the task is completed
What is the criteria ?
it handles 2FA if my repo lies in Github and my account needs 2FA to sign in
It does not π
ImmensePenguin78 this is probably for a different python version ...
Here, I
know
the pattern is incomplete and invalid. A less advanced user might not understand what's up.
Basically like your suggestion that if the request fails while typing instead of the error popup the search bar will turn "dark red", and on the next key stroke will be "cleaned" ?
Hi MassiveBat21
CLEARML_AGENT_GIT_USER is actually git personal token
The easiest is to have a read only user/token for all the projects.
Another option is to use the ClearML vault (unfortunately not part of the open source) to automatically take these configuration on a per user basis.
wdyt?
LovelyHamster1 NICE! π
Hi IrritableJellyfish76
https://clear.ml/docs/latest/docs/references/sdk/task#taskget_tasks
task_name
(
str
) β The full name or partial name of the Tasks to match within the specified
project_name
(or all projects if
project_name
is
None
). This method supports regular expressions for name matching. (Optional)
You are right, this is a bit confusing, I will make sure that we add in the docstring an examp...
The additional edges in the graph suggest that these steps somehow contain dependencies that I do not wish them to have.
PanickyMoth78 I think I understand what you are saying, but it is hard to see if there is a "bug" here or a feature...
Can you post the full code of the pipline?
JitteryCoyote63 I think that without specifically adding torch to the requirements, the agent will not be able to automatically resolve the correct cuda/torch version. Basically you should add torch to the requirements.txt file, and provide it to Task create, or use Task.force_requirements_env_freeze
This really makes little sense to me...
Can you send the full clearml-session --verbose console output ?
Something is not working as it should obviously, console output will be a good starting point
Would you have an example of this in your code blogs to demonstrate this utilisation?
Yes! I definitely think this is important, and hopefully we will see something there π (or at least in the docs)
I'm running agent inside docker.
So this means venv mode...
Unfortunately, right now I can not attach the logs, I will attach them a little later.
No worries, feel free to DM them if you feel this is to much to post them here
This doesn't seem to be running inside a container...
What's the clearml-agent launch command you are using ? (i.e. do you have --docker flag)
Internally we use blob.upload_from_file it has a default 60sec timeout on the connection (I'm assuming the upload could take longer).
Hi OutrageousSheep60
Do you mean something like:
https://github.com/allegroai/clearml/tree/master/examples/datasets
?
Is this still an issue (if you provide queue name, the default tag is not used so no error should be printed)
Hi UpsetBlackbird87
I might be wrong, but it seems like ClearML does not monitor GPU pressure when deploying a task to a worker rather rely only on its configured queues.
This is kind of accurate, the way the agent works is that you allocate a resource for the agent (specifically a GPU), then sets queues (plural) to listen to (by default priority ordered). Then each agent is individually pulling jobs and running on the allocated GPU.
If I understand you correctly, you want multiple ...
Hi BattyLizard6
does clearml orchestration have the ability to break gpu devices into virtual ones?
So this is fully supported on A100 with MIG slices. That said dynamic multi-tenant GPU on Kubernetes is a Kubernetes issue... We do support multi agents on the same GPU on bare metal, or over shared GPU instances over k8s with:
https://github.com/nano-gpu/nano-gpu-agent
https://github.com/intel/intel-device-plugins-for-kubernetes/tree/main/cmd/gpu_plugin#fractional-resources
http...
wdym 'executed on different machines'?The assumption is that you have machines (i.e. clearml-agents) connected to clearml, which would be running all the different components of the pipeline. Think out of the box scale-up. Each component will become a standalone Job and the data will be passed (i.e. stored and loaded) automatically on the clearml-server (can be configured to be external object storage as well). This means if you have a step that needs GPU it will be launched on a GPU machine...
There is no way to create an artifact/model/dataset without a task, right?
Models are a an entity of it's own, and you can actually create one without a Task.
(just for my own interest: how much does the enterprise version divert from the open source version? It it just extended or are there core changes to the enterprise version)
It adds a few security layers on top, and adds a few features that are just not part of the open source (RBAC, hyper-datasets, advanced scheduling, cu...
Hmm, so the way the configuration works is it loads the default configuration (equivalent to the example in the docs) then it adds the ~/clearml.conf on top. That means that you can tell your users to just copy paste the credentials from the UI into a template you make. How is that ?
Hmm yeah I can see why...
Now that I think about it, at least in theory the second process that torch creates, should inherit from the main one, and as such Task.init is basically "ignored"
Now I wonder why your first version of the code did not work?
Could it be that we patched the argparser on the subprocess and that we should not have?
Hi MortifiedCrow63
I finally got GS credentials, there is something weird going on. I can verify the issue, with model upload I get timeout error while upload_artifacts just works.
Just updating here that we are looking into it.
Hi MortifiedCrow63
Sorry getting GS credentials is taking longer than expected π
Nonetheless it should not be an issue (model upload is essentially using the same StorageManager internally)
Maybe that's the issue :
https://github.com/googleapis/python-storage/issues/74#issuecomment-602487082
Hi @<1730396272990359552:profile|CluelessMouse37>
However, the caching doesn't seem to be working correctly. Despite not changing the configuration, the first step runs every time.
How are you creating the cached component?
is this a standalone script or a git repo link?
These parameters are dictionaries of specific configurations (dict of dict) that are the same but might not be taken into account properly by the caching mechanism.
hmm for the component to be cached (or reuse...
I see if this is the case try to set
'output_uri="file:///full/path/to/dir"'
Notice it has to have the full path there and the file:// prefix