
Reputation
Badges 1
25 × Eureka!But these changes havenβt necessarily been merged into main. The correct behavior would be to use the forked repo.
So I would expect the agent to pull from your fork, is that correct? is that what you want to happen ?
PungentLouse55 could you test again with the latest from the GitHub? I think the issue should be solved:pip install git+
Yes, it's a bit confusing, the gist of it is that we wanted to have the ability to have diff configurations for diff buckets
Also could you explain the difference between trigger.start() and trigger.start_remotely()
Start will start the trigger process (the one "watching the changes") locally (this makes sense for debugging etc.)
start_remotely will launch the trigger process on the "services" where it should live forever π
Okay so when I add trigger_on_tags, the repetition issue is resolved.
Nice!
This problem occurs when I'm scheduling a task. Copies of the task keep being put on the queue ...
CrookedWalrus33 I'm testing with the latest RC on a local minio and this is what I'm getting:clearml.storage - INFO - Starting upload: /tmp/.clearml.upload_model_3by281j8.tmp => 10.99.0.188:9000/bucket/debug/PyTorch MNIST train.8b6edc440cde4469b82e6da17e74c952/models/mnist_cnn.tar clearml.Task - INFO - Waiting to finish uploads clearml.Task - INFO - Completed model upload to
MNIST train.8b6edc440cde4469b82e6da17e74c952/models/mnist_cnn.tar clearml.Task - INFO - Finished uploading
e...
Task.current_task().connect(training_args, name='hugggingface args')
And you should be able to change them when launching remotely π
SmallDeer34 btw: "set_parameters_as_dict" will replace all the arguments (and is one way) ...
@<1523704157695905792:profile|VivaciousBadger56>
Is the idea here the following? You want to use inversion-of-control such that I provide a function
f
to a component that takes the above dict an an input. Then I can do whatever I like inside the function
f
and return a different dict as output. If the output dict of
f
changes, the component is rerun; otherwise, the old output of the component is used?
Yes exactly ! this way you...
fyi: hot fix for 1.3.0 (smoothing graphs) was just released see v1.3.1
I am actually considering rolling back to 1.1.0,
Can you share why?
JitteryCoyote63 notice from the release notes of 1.2:
Important Note!
This release requires a MongoDB migration from previous versions. Please see
for more information.
I'm not sure you can downgrade that easily ...
TRAINS_WORKER_NAME=first_agent trains-agent --gpus 0
andTRAINS_WORKER_NAME=second_agent trains-agent --gpus 0
LOL that's the spirit , making your team happy is key to success in adoption π
BoredHedgehog47 if you are running it on K8s, then the setup script is running before everything else, even before an agent appears on the machine, unfortunately this means the output is not logged yet, hence the missing console lines (I think the next version of the glue will fix that)
In order to test you can do:export TEST_ME
then inside your code you will be able to see itos.environ['TEST_ME']
Make sense ?
I can't seem to find a difference between the two, why would matplotlib get listed and pandas does not... Any other package that is missing?
BTW: as an immediate "hack" , before your Task.init
call add the following:Task.add_requirements("pandas")
No it will not π the closer is closer to the actual print.
That said, I'm sure it would not be complicated to add.
But I have to wonder, this will really create a mess in the console log, so if someone wants it, it will be global (i.e. also in the visible console. not only in the backend), so the case where the console on the machine itself is "clean" but the backend log is full of debug stuff is not clear to me
If it cannot find the Task ID I'm guessing it is trying to connect to the demo server and not your server (i.e. configuration is missing)
Yes it does. I'm assuming each job is launched using a multiprocessing.Pool (which translates into a sub process). Let me see if I can reproduce this behavior.
connect_configuration
seems to take about the same amount of time unfortunately!
I think it is a better solution, that said from your description it sounds the issue is the upload bandwidth (i.e. json-ing the dict itself), could that be it?
(and even 1000 entries seems like something that would end up at 1mb upload, that is not that much)
EnviousStarfish54
and the 8 charts are actually identical
Are you plotting the same plot 8 times?
BTW: how did it get there ?
Assuming git repo looks something like:.git readme.txt module | +---- script.py
The working directory should be "."
The script path should be: "-m module.scipt"
And under the Configuration/Args, you should have:args1 = value args2 = another_value
Make sense?
HI SubstantialElk6
Yes you are correct the glue only needs to change the yaml and it will work.
When you say "Dev end" , what do you mean? I was thinking adding additional glue for multi node and just adding queues , for example add 4nodes queue and attach a glue to it, wdyt?
Regrading horovod, horovod is spinning its own nodes so integration with k8s is not trivial (regardless of ClearML). That said I know that they do have support for horovod in the Enterprise edition, but I'm not sure ...
This is odd I was running the example code from:
https://github.com/allegroai/clearml/blob/master/examples/pipeline/pipeline_from_decorator.py
It is stored inside a repo, but the steps that are created (i.e. checking the Task that is created) do not have any repo linked to them.
What's the difference ?
It can also work by running on multiple known nodes.
Horovod sits on top of openmpi that needs ssh to open multiple nodes, I'm not sure how one would connect it without passing the SSH keys from one node to the other, and making sure they can directly communicate. (Not saying it is not possible, but just a few things to configure before it works, the enterprise edition remove the need for the direct SSH connection between the nodes)
How would i add a glue for multinode?
Basic...
SubstantialElk6 could you post "Installed packaged" section under Execution of this specific Task?
HI @<1687643893996195840:profile|RoundCat60>
Are you running on AWS ?
So assuming they are all on the same LB IP: You should do:
LB 8080 (https) -> instance 8080
LB 8008 (https) -> instance 8008
LB 8081 (https) -> instance 8081
It might also work with:
LB 443 (https) -> instance 8080
We're not using a load balancer at the moment.
The easiest way is to add ELB and have amazon add the httpS on top (basically a few clicks on their console)
Hmm I would recommend passing it as an artifact, or returning it's value from the decorated pipeline function. Wdyt?