Reputation
Badges 1
25 × Eureka!each of it gets pushed as a separate Model entity right?
Correct
But thereβs only one unique model with multiple different version of it
Do you see multiple lines in the Model repository ? (every line is an entity) basically if you store it under the same local file, it will override the model entry (i.e. reuse it and upgrade the file itself), otherwise you are creating a new model, "version" will be progress in time ?
you mean The Task already exists or you want to create a Task from the code ?
(Do notice that even though you can spin two agents on the same GPU, the nvidia drivers cannot share allocated GPU memory, so if one Task consumes too much memory the other will not have enough free GPU memory to run)
Basically the same restriction as manually launching two processes using the same GPU
I'm assuming your are looking for the AWS autoscaler, spinning EC2 instances up/down and running daemons on them.
https://github.com/allegroai/clearml/blob/master/examples/services/aws-autoscaler/aws_autoscaler.py
https://clear.ml/docs/latest/docs/guides/services/aws_autoscaler
If this is the case, then we do not change the maptplotlib backend
Also
I've attempted converting theΒ
mpl
Β image toΒ
PIL
Β and useΒ
report_image
Β to push the image, to no avail.
What are you getting? error / exception ?
BTW: is this on the community server or self-hosted (aka docker-compose)?
I think you can force it to be started, let me check (I pretty sure you can on aborted Task).
TenseOstrich47 this looks like elasticserach is out of space...
Hi @<1541954607595393024:profile|BattyCrocodile47> and @<1523701225533476864:profile|ObedientDolphin41>
"we're already on AWS, why not use SageMaker?"
TBH, I've never gone through the ML workflow with SageMaker.
LOL I'm assuming this is why you are asking π
- First, you can use SageMaker and still log everything to ClearML (2 lines integration). At least you will have visibility to everything that is running/failing π
- SageMaker job is a container, which means for ...
compression=ZIP_DEFLATED if compression is None else compressionwdyt?
Hi ObnoxiousStork61
Is it possible to report ie. validation scalars but shifted by 1/2 iteration?
No π these are integers
What's the reason for the shift?
I'm also curious π
Could it be the Args section of the task it clones does not have the "input_train_data" argument ?
docker mode. they do share the same folder with the training data mounted as a volume, but only for reading the data.
Any chance they try to store the TensorBoard on this folder ? This could lead to "No such file or directory: 'runs'" if one is deleting it, and the other is trying to access, or similar scenarios
Hi SuperiorDucks36
Could you post the entire log?
(could not resolve host seems to be coming from the "git clone" call).
Are you able to manually clone the repository on the machine running trains-agent
See if this helps
Hi HealthyStarfish45
You can disable the entire TB logging :Task.init('examples', 'train', auto_connect_frameworks={'tensorflow': False})
Great!
I'll make sure the agent outputs the proper error π
Is there a way to filter a experiments in a hyperparameter sweep based on a given range of a parameter/metric in the UI
Are you referring to the HPO example? or the Task comparison ?
Okay let me see if I can think of something...
Basically crashing on the assertion here ?
https://github.com/ultralytics/yolov5/blob/d95978a562bec74eed1d42e370235937ab4e1d7a/train.py#L495
Could it be your are passing "Args/resume" True, but not specifying the checkpoint ?
https://github.com/ultralytics/yolov5/blob/d95978a562bec74eed1d42e370235937ab4e1d7a/train.py#L452
I think I know what's going on:
https://github.com/ultralytics/yolov5/blob/d95978a562bec74eed1d42e370235937ab4e1d7a/train...
Any chance you can zip the entire folder? I can't figure out what's missing, specifically "from config_files" , i.e. I have no packages nor file named config_files
Oh dear, I think your theory might be correct, and this is just the mongo preallocating storage.
Which means the entire /opt/trains just disappeared
Interesting use case, do you already have the connect_configuration in the code? or do we need to somehow create it ?
The notebook path goes through a symlink a few levels up the file system (before hitting the repo root, though)
Hmm sounds interesting, how can I reproduce it?
The notebook kernel is also not the default kernel,
What do you mean?
Hi StickyMonkey98
aΒ
very
Β large number of running and pending tasks, and doing that kind of thing via the web-interface by clicking away one-by-one is not a viable solution.
Bulk operations are now supported , upgrade the clearml-server to 1.0.2 π
Is it possible to fetch a list of tasks via Task.get_tasks,
Sure:Task.get_tasks(project_name='example', task_filter=dict(system_tags=['-archived']))
Setting the credentials on agent machine means the users cannot use their own credentials since an k8s glue agent serves multiple users.
Correct, I think "vault" option is only available on the paid tier π
but how should we do this for the credentials?
I'm not sure how to pass them, wouldn't it make sense to give the agent an all accessing credentials ?
basically the default_output_uri will cause all models to be uploaded to this server (with specific subfolder per project/task)
You can have the same value there as the files_server.
The files_server is where you have all your artifacts / debug samples
Hi TenseOstrich47 whats the matplotlib version and clearml version you are using ?