Reputation
Badges 1
25 × Eureka!Makes sense to add it to docker run by default if GPUs are mentioned in agent.
I think this is an arch thing, --privileged is not needed on ubuntu flavor, that said you can always have it if you add it here:
https://github.com/allegroai/clearml-agent/blob/178af0dee84e22becb9eec8f81f343b9f2022630/docs/clearml.conf#L149
clearml-agent daemon --gpus 0 --queue default --docker
But docker still sees all GPUs.
Yes --gpus should be enough, are you sure regrading the --privileged flag ?
Hi ConvincingSwan15
A few background questions:
Where is the code that we want to optimize? Do you already have a Task of that code executed?
"find my learning script"
Could you elaborate ? is this connect to the first question ?
Thanks GiganticTurtle0 !
I will try to reproduce with the example you provided. regardless I already took a look at the code, and I'm pretty sure I know what the issue is. We will be pushing a few fixes after the weekend, I'm hoping this one will be included as well π
but cant catch that only one way for service queue or I can experiments with that?
UnevenOstrich23 I'm not sure what exactly is the question, but if you are asking weather this is limited, the answer is no it is not limited to that use case.
Specifically you can run as many agents in "services-mode" pulling from any queue/s that you need, and they can run any Task that is enqueued on those queues. There is no enforced limitation. Did that answer the question ?
Hi SpicyOtter88plt.plot([0, 1], [0, 1], 'r--', label='')
ti cannot have a legend without a label, so it gives it "anonymous" label, I think it should just get "unlabeled 0" wdyt?
so moving b in to a wonβt work if some subfolders are already there
I though that if they are already there you would merge / overwrite, isn't that what you need ?a/b/c/2.txt
seems like the result of moving b
from dataset B into folder b
in Dataset A, what am I missing?
(My assumption is that you have both datasets locally on the same machine and that you can just copy the files from b
of Datasset B into b
folder of Dataset A)
Hi PanickyMoth78
` torch.save(net.state_dict(), PATH) # auto-uploads to GCS
get all the models from the Task
output_models = Task.current_task().models["output"]
get the last one
last_model = output_models[-1]
set meta-data
last_model.set_metadata(key="my key", value="my value", type="str") `
Hi JitteryCoyote63 , let me check, this backwards compatibility might only apply for API version mismatch between the client and server.
ConvolutedChicken69
, does it take the agent off the queue? does it know it's not available to take tasks?
You mean will it "release" the GPU? (i.e. the agent will pull another Task) ?
If so, then no it will not, an "Interactive Session" session is (from the agent's perspective) a Task that will end sometime, and it will continue to monitor and run it, until you manually close it. The idea is that you are actually using the GPU, hence no on else can run a job on it.
To shut it down, ...
Thanks!
Hmm from here : None
Could it be you do not have privileges to the resource, or that you did not provide credentials ?
Did that autoscaler work before ?
Hmm I assume it is not running from the code directory...
(I'm still amazed it worked the first time)
Are you actually using "." ?
There are also "completed, aborted, queued" .
Archived is actually a tag (system tag, not user tag). There is a "state machines" of moving from one state to the other. The special case is "published" that we probably should have called "locked". The idea is that if a Task/Model is published, you cannot reset it (and even deleting requires force flag).
I would use additional user tags (or even system-tags) to mark "deployed" state, wdyt?
Okay that seems to explain it. Now the question is why it installed it in the wrong place.
Hmm, yes this fits the message. Which basically says that it gave up on analyzing the code because it run out of time. Is the execution very short? Or the repo very large?
if they're mission critical, but rather the clearml cache folder?
hmmm... they are important, but only when starting the process. any specific suggestion ?
(and they are deleted after the Task is done, so they are temp)
Only as "default docker + argument" , if you need the "extra_docker_arguments" (which I think a mount point is a good example for), then you have to add it in the conf file
CleanWhale17 nice ... π
So the answer is Trains supports the Pipeline / Automation of it, but lacks that dataset integration (that is basically up to you to manage, with either artifacts or any other method)
The Allegro Enterprise allows you to rerun the code, on a new version of the dataset from the UI (or automation) without changing a single line of code π
This will allow them to experiment outside of clearml and only switch to it when they are in an OK state. This will also helpnot to pollute clearml spaces with half backed ideas
What's the value of runnign outside of an experiment management context ? don't you want to log it?
There is no real penalty here, no?!
ResponsiveHedgehong88 so I would suggest using execute_remotely in your code, basically you start locally you make sure everything is passed as intended, then from within the code you call task.execute_remotely(...)
which will stop the current process and enqueue the Task on the selected queue for the agent to execute.
https://github.com/allegroai/clearml/blob/0397f2b41e41325db2a191070e01b218251bc8b2/examples/advanced/execute_remotely_example.py#L127
This way you can both easily test...
This will fix it, the issue is the "no default value" that breaks the casting@PipelineDecorator.component(cache=False) def step_one(my_arg=""):
You mean like a name of the artifact ?
@PipelineDecorator.component(repo="..")
The imports are not recognized - they are not on the pythonpath of the task that the agent starts.
RoughTiger69 add the imports inside the functions itself, you cal also specify the, on the component@PipelineDecorator.component(..., package=["pcakge", "package==1.2.3"])
or@PipelineDecorator.component(...): import pandas as pd # noqa ...
Seems the apiserver is out of connections, this is odd...
SuccessfulKoala55 do you have an idea ?
It was set to true earlier, I changed it to false to see if there would be any difference but doesnβt seem like it
I would actually just add:Task.add_requirements('google.cloud')
Before the Task.init call (Notice, it has to be before the the init call)
Hi MelancholyElk85
However, when I clone the pipeline from web UI and launch it once again, it works. Is there a way to bypass this?
In both cases, are you seeing a different behavior on the same machine running the agent (i.e. clonening from the UI vs code) ?
okay the odd thing git ls-remote --get-url origin
should have returned the same...
what's your git version? (git --version)
I have install a python environment by virtualenv tool, let's say
/home/frank/env
and python is
/home/frank/env/bin/python3.
How to reuse the virtualenv by setting clearml agent?
So the agent is already caching the entire venv for you, nothing to worry about, just make sure you have this line in clearml:
https://github.com/allegroai/clearml-agent/blob/249b51a31bee97d63f41c6d5542e657962008b68/docs/clearml.conf#L131
No need to provide it an existing...
Make sure you have the S3 credentials in your agent's clearml.conf :
https://github.com/allegroai/clearml-agent/blob/822984301889327ae1a703ffdc56470ad006a951/docs/clearml.conf#L210
Hi ColossalAnt7
Following on SuccessfulKoala55 answer
I saw that there is a config file where you can specify specific users and passwords, but it currently requires
- mount the configuration file (the one holding the user/pass) into the pod from a persistent volume .
I think the k8s way to do this would be to use mounted config maps and secrets.
You can use ConfigMaps to make sure the routing is always correct, then add a load-balancer (a.k.a a fixed IP) for the users a...
I think it should be treated as failed,
I'm not sure where I stand on default behavior, it it could easily be an argument for the pipeline controller