Reputation
Badges 1
25 × Eureka!No worries π glad it worked
Okay how do I reproduce it ?
Hi OutrageousGiraffe8
I was not able to reproduce π
Python 3.8 Ubuntu + TF 2.8
I get both metrics and model stored and uploaded
Any idea?
Thanks BoredHedgehog47 !
And yes if the Task.init() call was only in main.py then the TB inside the subprocess (train.py) would as you perceived not be captured.
Did you by any chance test calling Task.init in Both main.py and train.py ?
Hi HarebrainedBear62
What's the type of data ?
Hi
The Squash operation copies all the data and is no longer linked to previous commits?
Yes, basically the idea is if you have data version that relies on many parents that needs to be merged, the squash will create a merged copy and push it all as a single version, and then yes the parent versions are no longer needed
I thought this operation is like git squash but it seems to me
yeah... we did not want to actually delete the parents because unlike git, the operation is done ...
Okay this more complicated but possible.
The idea is to write a glue layer (service) that pulls from the (i.e UI) queue
sets the slurm job
and puts it in a pending queue (so you know the job s waiting in the slurm scheduler)
There is a template here:
https://github.com/allegroai/trains-agent/blob/master/trains_agent/glue/k8s.py
I would love to help and setup a slurm glue in a similar manner
what do you think?
This is assuming you can just run two copies of your code, and they will become aware of one another.
It should work π as long as the versions match, if they don't the venv will install the version you need (which is great, only penalty is the install, download wise it will be cached)
I had no idea it was going to do that and sent your servers over 1.4M API hits unintentionally
Yeah, that is way too much, I think relates to the frequency it updates the console π
okay this points to an issue with the k8s glue, I think it somehow failed to launch the pod. Can you send me the log of the clearml-k8s-glue ?
Hi ShallowArcticwolf27
from the command line to a remote machine while loading a localΒ
.env
Β file as a configuration object?
Where would the ".env" go to ? Are we trying to pass it to the remote machine somehow ?
First let's try to test if everything works as expected. Since 405 really feels odd to me here. Can I suggest following one of the examples start to end to test the setup, before adding your model?
Okay I found the issue ( I think),
If the images are reported very quickly, it will "decide" you are about to override the previous one (i.e. 101 -> overwriting 0, which makes sense, the bug was it would disable the 101 from uploading and not the 0 π )
Test fix:
in /backend_interface/metrics/events.py , line 292, change:
` last_count = self._get_metric_count(self.metric, self.variant, next=False)
if abs(self._count - last_count) > int(self._file_history_size):
...
but when I run the same task again it does not map the keys..Β (edited)
SparklingElephant70 what do you mean by "map the keys" ?
We should probably have a section on that (i.e. running two agents on the same GPU, then explain how top use it)
Hi CrookedAlligator14
Hi, I just started using clearml, and it is amazing!
Thank you! π
When I enqueue the task, the venv is setup and starts to install all the packages from the
requirements.txt
file, but at the end I get the following in the console:
Can you try with the latest agent, we improved the support for pytorch (they now have a proper pypi compatible repo), can you see if that solves it?pip3 install clearml-agent==1.5.0rc0
one can containerise the whole pipeline and run it pretty much anywhere.
Does that mean the entire pipeline will be running on the instance spinning the container ?
From here: this is what I understand:
https://kedro.readthedocs.io/en/stable/10_deployment/06_kubeflow.html
My thinking was I can use one command and run all steps locally while still registering all "nodes/functions/inputs/outputs etc" with clearml such that I could also then later go into the interface and clone an...
This is already part of the docker-compose file,
https://github.com/allegroai/clearml-server/blob/master/docker/docker-compose.yml
(Venv mode makes sense if running inside a container, if you need docker support you will need to mount the docker socket inside)
What is exactly the error you re getting from clearml? And what do you have in the configuration file?
@<1587253076522176512:profile|HollowPeacock33>
Is this a commercial ad? this seems like out of scope for this channel
Can you expand?
... the one for the last epoch and not the best one for that experiment,
well
Now we realized there is an option tu use
"min_global"
on the sign, is this what we need?
Yes π (or max_global)
Most likely yes, but I don't see how clearml would have an impact here, I am more inclined to think it would be a pytorch dataloader issue, although I don't see why
These are most certainly dataloader process. But clearml-agent when killing the process should also kill all subprocesses, and it might be there is something going on that prenets it from killing the subprocesses ...
Is this easily reproducible ? Can you verify it is still the case with the latest RC of clearml-agent ?
Does this require you run the pipeline locally (I see you have set default execution queue) or do any other specific set-up?
Yes this mean the pipeline Logic runs manually/locally (logic means launching components, not actually compute)
Please have a go at it, I'm sure some quirks in the psuedo code are missing but it should work, and I'll gladly help set it up
AstonishingRabbit13
https://github.com/googleapis/google-cloud-python/issues/4941#issuecomment-369472576
check the openssl and the date, this seems like SSL low level error (even before authentication)
Hi @<1623491856241266688:profile|TenseCrab59>
It is kind of dark or are you asking about the graphs ?