Reputation
Badges 1
25 × Eureka!hardware monitoring etc.
This is averaged and being sent only every 30 seconds, not a lot of calls.
I just saw that I went through the first 200k API calls rather fast, so that is how I rationalized it.
Yes, that's kind of makes sens
Once every 2000 steps, which is every few seconds. So in theory those ~20 scalars should be batched since they are reported more or less at the same time. It's a bit odd that the API calls added up so quickly anyway.
The default flush is ever...
Hi @<1661542579272945664:profile|SaltySpider22>
question 1: are parallel writes to a dataset with the same version possible?
When you are saying parallel what do you mean? from multiple machines ?
Whats the recommended way to append the dataset in a future version?
Once a dataset was finalized the only way to add files is to add another version that inherits from the previous one (i.e. the finalized version becomes the parent of the new version)
If you are worried about multip...
Hi CleanPigeon16
can I make the steps in the pipeline use the latest commit in the branch?
Yes:
manually clone the stesp's Task (in the UI), and in the UI edit the Execution section and change to "last sommit on branch" and specify the branch name programmatically (as the above, clone+edit)
ValueError: Could not parse reference '${run_experiment.models.output.-1.url}', step run_experiment could not be found
Seems like the "run_experiment" step is not defined. Could that be ...
Hi WackyRabbit7
I believe this is fixed in clearml-server 1.1 (this is a plotly color issue), releasing later today or tomorrow π
Fixed in pip install clearml==1.8.1rc0
π
ResponsiveCamel97
BTW: any reason not to allow this flexibility ?
ElegantKangaroo44 I tried to reproduce the "services mode" issue with no success. If it happens again let me know maybe will better understand how it happened (i.e. the "master" trains-agent gets stuck for some reason)
What's the python, torch, clearml version?
Any chance this can be reproducible ?
What's the full error trace/stack you are getting?
Can you try to debug it to where exactly it fails here?
https://github.com/allegroai/clearml/blob/86586fbf35d6bdfbf96b6ee3e0068eac3e6c0979/clearml/binding/import_bind.py#L48
RoughTiger69 wdyt?
I added the following to the
clearml.conf
file
the conf file that is on the worker machine ?
Hi NastySeahorse61
Did you archive And delete the experiments from the archive?
BTW: I think this question belongs to
Sounds great! I really like that approach, thanks GrotesqueDog77 !
Hmm I think you are correct:param auto_create: Create new dataset if it does not exist yet
it should have created it, this seems like a bug, I'll make sure to pass along π
Can you please tell me if you know whether it is necessary to rewrite the Docker compose file?
not by default, it should basically work out of the nox as long as you create the same data folders on the host machine (e.g. /opt/clearml)
Thanks PompousBaldeagle18 !
Which software you used to create the graphics?
Our designer, should I send your compliments π ?
You should add which tech is being replaced by each product.
Good point! we are also missing a few products from the website, they will be there soon, hence the "soft launch"
Hi GrotesqueDog77
What do you mean by share resources? Do you mean compute or storage?
Hi SkinnyPanda43
No idea what the ImageId actually is.
That's the ami image string that the new EC2 will be started with, make sense ?
I guess I just have to make sure that total memory usage of all parallel processes are not higher than my gpu's memory.
Yep, unfortunately I'm not aware of any way to do that automatically π
I think task.init flag would be great!
π
ZanyPig66 is this reproducible? This sounds like a bug, whats the TB version and OS you rae using?
Is this example working for you (i.e. you see debug images)
https://github.com/allegroai/clearml/blob/master/examples/frameworks/pytorch/pytorch_tensorboard.py
'relaunch_on_instance_failure'
This argument is Not part of the Pipeline any longer, are you running the latest clearml
python version?
Hope you donβt mind linking to that repo
LOL π
Hi ExcitedFish86
In Pytorch-Lightning I use DDP
I think a fix for pytorch multi-node / process distribution was commited to 1.0.4rc1, could you verify it solves the issue ? (rc1 should fix this specific issue)
BTW: no problem working with cleaml-server < 1
Hi PanickyMoth78
can receive access to a GCP project and use GKE to spin clusters up and workers or would that be on the customer to manage.
It does, and also supports AWS.
That said only the AWS is part of the open-source, but both are parts of the paid tier (I think Azure is in testing)
IrritableOwl63 in the profile page, look at the bottom right corner
The only workaround I can think of is :series = series + 'IoU>X'
It doesn't look that bad π
AbruptHedgehog21 the bucket and the full link are registered on the model object itself, you can see them in the ui, under the models tab. The only thing you actually need to pass inside is the credentials. Make sense?
how did you try to restart them ?
Yes, but how did you restart the agent on the remote machine ?