Reputation
Badges 1
25 × Eureka!Thank you @<1523701949617147904:profile|PricklyRaven28> !!!
Let me see if we can reproduce and how to solve it
The main issue is applying the patch requires git clone and that would fail on local (not pushed) commits.
What's the use case itself ?
(btw, if you copy the uncommitted changed into a file and git apply it, it will work)
We do upload the final model manually.
If this is the case just name it based on the parameters, no? am I missing soemthing?
https://github.com/allegroai/clearml/blob/cf7361e134554f4effd939ca67e8ecb2345bebff/clearml/model.py#L1229
I was just wondering if i can make the autologging usable.
It kind of assumes these are different "checkpoints" on the same experiment, and then stores them based on the file name
You can however change the model names later:
` Task.current_task().mo...
TrickyFox41 are you saying that if you add Task.init inthe code it works, but when you are calling "clearml-task" it does not work? (in both cases editing the Args/overrides ?
HiΒ SmoggyGoat53
There is a storage limit on the file server (basically 2GB per file limit), thisΒ is the cause of the error.
You can upload the 10GB to any S3 alike solution (or a shared folder). Just set the "output_uri" on the Task (either at Task.init or with Task.output_uri = " s3://bucket ")
Hi IrateBee40
What do you have in your ~/clearml.conf
?
Is it pointing to your clearml-server ?
Hi @<1523701260895653888:profile|QuaintJellyfish58>
Is there a way or a trigger to detect when the number of workers in a queue reaches zero?
You mean to spin them down? what's the rational ?
Iβd like to implement a notification system that alerts me when there are no workers left in the queue.
How are they "dropping" ?
Specifically to your question, let me check I'm sure there is an API that get's that data becuase you can see it in the UI π
Hi @<1695969549783928832:profile|ObedientTurkey46>
Use --services-mode in the agent , it will run many Tasks on the same machine, this is usually associated with the services queue, but can be run on any queue. This way you could have the same machine easily running those multiple "control" tasks.
wdyt?
restart the notebook kernel ?
Hi WackyRabbit7 ,
Yes we had the same experience with kaggle competitions. We ended up having a flag that skipped the task init :(
Introducing offline mode is on the to do list, but to be honest it is there for a while. The thing is, since the Task object actually interacts with the backend, creating an offline mode means simulation of the backend response. I'm open to hacking suggestions though :)
No idea, I just remember it is relatively old π
because step can be constructed with multiple
sub-components
but not all of them might be added to the UI graph
Just to make sure I fully understand when we decorate with @sub_node we want that to also appear in the UI graph (and have it's own Task / metrics etc)
correct?
Hi ConvolutedSealion94
Just making sure, you spinned the docker-compose of the clearml serving as well ?
Will they get ordered ascending or descending?
Good point, I'll check the docs... but I think they do not specify
https://clear.ml/docs/latst/docs/references/sdk/task#taskget_tasks
From the code it seems the ordered is not guaranteed.
You can however pass '-last_update'
: order_by
which will give you the latest updated first
` task_filter = {
'page_size': 2,
'page': 0,
'order_by': ['last_metrics.{}.{}'.format(title, series), '-last_update']
}
Task.get_tasks(...
JitteryCoyote63 Should be quite safe, there is no major change that I'm aware of on the ClearML side that can effect it.
That said, wait for after the weekend, we are releasing a new ClearML package, I remember there was something with the model logging, it might not directly have something to do with ignite, but worth testing on the latest version.
WickedGoat98 Notice this is not the "clearml-agent-services" docker but "clearml-agent" docker image
Also the default docker image is "nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04"
Other than that quite similar :)
docker mode. they do share the same folder with the training data mounted as a volume, but only for reading the data.
Any chance they try to store the TensorBoard on this folder ? This could lead to "No such file or directory: 'runs'" if one is deleting it, and the other is trying to access, or similar scenarios
HealthyStarfish45 We are now working on improving the k8s glue (due to be finished next week) after that we can take a stab at slurm, it should be quite straight forward. Will you be able to help with a bit of testing (setting up a slurm cluster is always a bit of a hassle π )?
PompousBeetle71 notice that starting with this version when you set model tags they will be stored as user tags , which you can change and edit in UI. So if you still need the system tags you have to access them directly.
PompousBeetle71 so basically exclude parameters that are considered "local" only, so that other people will not accidentally use them?
AstonishingSeaturtle47 that's awesome! Could you explain the hack, it might be helpful for others (I assume :))
π CooperativeFox72 please see if you can send a code snippet to reproduce the issue. I'd be happy to solve the it ...
Hi @<1624941407783358464:profile|GrievingTiger47>
I think you should try to contact the sales guys here: None
Hi Team,Can i clone experiment shared by some one, via link?
You mean someone that is not in your workspace ? (I'm assuming app.clear.ml ?)
LOL, if this is important we probably could add some support (meaning you will be able to specify it in the "installed packages" section, per Task).
If you find an actual scenario where it is needed, I'll make sure we support it π
My goal is to automatically run the AWS Autoscaler task on a clearml-agent pod when I deploy
LovelyHamster1 this is very cool!
quick question, if you are running on EKS, why not use the EKS autoscaling instead of the ClearML aws EC2 autoscaling ?
Hi @<1570220858075516928:profile|SlipperySheep79>
I think this is more complicated than one would expect. But as a rule of thumb, console logs and metrics are the main ones. I hope it helps? Maybe sort by number of iterations in the experiment table ?
BTW: probable better to ask in channel
Yey!
Out of curiosity, what's the workflow with snowflake?
Hi @<1547028116780617728:profile|TimelyRabbit96>
Trying to do model inference on a video, so first step in
Preprocess
class is to extract frames.
Basically this depends on the RestAPI, usually would will be sending a link to data to be processed and returned Synchronously
What you should have a custom endpoint doing the extraction, send Raw data into another endpoint doing the model inference, basically think "pipeline" end points:
[None](https://github.com/allegro...
@<1523704157695905792:profile|VivaciousBadger56> regrading: None
Is this a discussion or PR ?
(general ranting is saved for our slack channel π )