Examples: query, "exact match", wildcard*, wild?ard, wild*rd
Fuzzy search: cake~ (finds cakes, bake)
Term boost: "red velvet"^4, chocolate^2
Field grouping: tags:(+work -"fun-stuff")
Escaping: Escape characters +-&|!(){}[]^"~*?:\ with \, e.g. \+
Range search: properties.timestamp:[1587729413488 TO *] (inclusive), properties.title:{A TO Z}(excluding A and Z)
Combinations: chocolate AND vanilla, chocolate OR vanilla, (chocolate OR vanilla) NOT "vanilla pudding"
Field search: properties.title:"The Title" AND text
Profile picture
ScantChimpanzee51
Moderator
15 Questions, 49 Answers
  Active since 10 January 2023
  Last activity 2 months ago

Reputation

0

Badges 1

49 × Eureka!
0 Votes
10 Answers
671 Views
0 Votes 10 Answers 671 Views
[ClearML with Pytorch-based distributed training} Hi everyone! Is the combination of ClearML with torch.distributed.launch or torchrun actively supported? A ...
one year ago
0 Votes
2 Answers
547 Views
0 Votes 2 Answers 547 Views
[Auto scaler / API client does not see tasks in queue] We had used the AWS auto scaler (based on the aws_autoscaler.py script in the repo) and it worked grea...
11 months ago
0 Votes
2 Answers
627 Views
0 Votes 2 Answers 627 Views
[Potential bug where the script path option is changed for remote runs] Hi everyone! We’re still using ClearML quite a bit, usually by running the first, sma...
one year ago
0 Votes
3 Answers
803 Views
0 Votes 3 Answers 803 Views
one year ago
0 Votes
4 Answers
703 Views
0 Votes 4 Answers 703 Views
one year ago
0 Votes
16 Answers
662 Views
0 Votes 16 Answers 662 Views
[Injecting secrets into a ClearML Agent / accessing clearml.conf from within a Task] Hi everyone, we are using the ClearML AWS Autoscaler (still awesome 😉 )...
one year ago
0 Votes
1 Answers
489 Views
0 Votes 1 Answers 489 Views
Quick question: Is there a way for a task that is executing remotely to find out which ClearML queue it is in or was in?
11 months ago
0 Votes
6 Answers
534 Views
0 Votes 6 Answers 534 Views
[Errors when migrating ClearML Server from AWS to GCP] Hi everyone! As we’re using ClearML quite a bit, we’d love to take it with us when migrating our cloud...
one year ago
0 Votes
3 Answers
522 Views
0 Votes 3 Answers 522 Views
[Instance AutoScaler for GCP] In case someone else is interested, we have build an AutoScaler for GCP, too. It works similar to the AWS one in the ClearML re...
11 months ago
0 Votes
18 Answers
722 Views
0 Votes 18 Answers 722 Views
How do I view Debug Samples images in the browser when the output_uri is on Google Cloud Storage ( None )? Unlike for AWS storage, I do not get a popup windo...
one year ago
0 Votes
5 Answers
668 Views
0 Votes 5 Answers 668 Views
Hi everyone, quick question: Is there any easy way to get a task's full output directory ? E.g. when I create a task with task = Task.init(..., output_uri=" ...
one year ago
0 Votes
12 Answers
761 Views
0 Votes 12 Answers 761 Views
[Task gets interrupted / aborted / reset when in offline mode] For local testing, we have added a --no-clearml option to our code that sets task.set_offline(...
one year ago
0 Votes
7 Answers
821 Views
0 Votes 7 Answers 821 Views
Hi everyone, I’m getting an error during model upload to S3. The error shows up in the console like below and I don’t see any uploaded objects in S3: 2022-10...
one year ago
0 Votes
4 Answers
637 Views
0 Votes 4 Answers 637 Views
[Caching of environment and storage when using AWS auto scaler] First off : We are aiming to set up ClearML for large-scale DL training for multiple projects...
one year ago
0 Votes
2 Answers
690 Views
0 Votes 2 Answers 690 Views
[WebUI-based options injection not working] Hey everyone! Since our training repo has gotten quite complex, we configure all setup in an options.yml file whi...
one year ago
0 [Potential Bug Where The

Hi John, thanks for getting back to me!
So it shows up in the UI like shown below. It happens both when “recording” the local run on Mac and on Linux.

one year ago
0 [Clearml With Pytorch-Based Distributed Training} Hi Everyone! Is The Combination Of Clearml With

Hi @<1523701205467926528:profile|AgitatedDove14> , so I’ve managed to reproduce a bit more.
When I run very basic code via torchrun or torch.distributed.run then multiple ClearML tasks are created and visible in the UI (screenshot below). The logs and scalars are not aggregated but the task of each rank reports its own.

If however I branch out via torch.multiprocessing like below, everything works as expected. The “script path” just shows the single python script, all logs an...

one year ago
0 Is It Possible To Run Multiple Agent On Ec2 Machines Started By The Autoscaler? Or Have The One Agent Run Multiple Queue Jobs At Once? E.G. Having The Autoscaler Start 1X P3.8Xlarge (4 Gpu) On Aws Might Be Better Than 4X P3.2Xlarge (1 Gpu) In Terms Of Ava

Yes totally, but we’ve been having problems of getting these GPUs specifically (even manually in the EC2 console and across regions), so I thought maybe it’s easier to get one big one than many small ones, but I’ve never actually checked if that is true 🙂 Thanks anyhow!

one year ago
0 Hi Everyone, I’M Getting An Error During Model Upload To S3. The Error Shows Up In The Console Like Below And I Don’T See Any Uploaded Objects In S3:

So without the flush I got the error apparently at the very end of the script - all commands of my actual Python code had run.

one year ago
one year ago
0 [Task Gets Interrupted / Aborted / Reset When In Offline Mode] For Local Testing, We Have Added A

Hi AgitatedDove14 , so it took some time but I’ve finally managed to reproduce. The issue seems to be related to writing images via Tensorboard:
` from torch.utils.tensorboard import SummaryWriter
import torch
from clearml import Task, Logger

if name == "main":
task = Task.init(project_name="ClearML-Debug", task_name="[Mac] TB Logger, offline")
tb_logger = SummaryWriter(log_dir="tb_logger/demo/")
image_tensor = torch.rand(256, 256, 3)
for iter in range(10):
t...

one year ago
0 [Task Gets Interrupted / Aborted / Reset When In Offline Mode] For Local Testing, We Have Added A

It might be broken for me, as I said the program works without the offline mode but gets interrupted and shows the results from above with offline mode. But there might be another issue in between of course - any idea how to debug?
The environment variable is good to know, I will try with that as well and report back.

one year ago
0 Hi Everyone, Quick Question: Is There Any Easy Way To

Unfortunately not, task.data.output just contains <tasks.Output: { "destination": " s3://some_bucket " }> and when I convert task.data to a string and search for the desired uri, I cannot find it either.
But on the other hand, putting the url together from its name, id, etc. seems to work - it might be a little unsafe if the task gets re-named or something, but otherwise it should be fine.

one year ago
0 Hi Everyone, Quick Question: Is There Any Easy Way To

I actually wanted to load a specific artifact, but didn’t think of looking through the tasks output models. I have now changed to that approach which feels much safer, so we should be all done here. Thanks!

one year ago
0 How Do I View Debug Samples Images In The Browser When The Output_Uri Is On Google Cloud Storage (

I’m on Safari actually, but I just checked on Chrome (which shows this unsecure connection indicator) and images are activated. Might it still be due to non-HTTPS connection? We should get on that anyhow

one year ago
0 How Do I View Debug Samples Images In The Browser When The Output_Uri Is On Google Cloud Storage (

Ah got it - that is already the case though. I’m logged into a Google Account that can access that bucket and I can download the image by clicking on the Download link in the ClearML dashboard and by going through the GCP console to the bucket…

one year ago
0 How Do I View Debug Samples Images In The Browser When The Output_Uri Is On Google Cloud Storage (

Yes and yes - is that the issue and it might likely go away if we host it via HTTPS?

one year ago
0 How Do I View Debug Samples Images In The Browser When The Output_Uri Is On Google Cloud Storage (

@<1523701070390366208:profile|CostlyOstrich36> , you mean the ClearML server needs access to Cloud Storage in its clearml.conf file?
Just tried it by creating a ~/clearml.conf file and setting the entry as below - unfortunately the same result. I’ve re-started the docker-compose of course.

Did I miss something here?

    google.storage {
        credentials_json: "/home/.../my-crendetials.json"
    }
one year ago
0 [Webui-Based Options Injection Not Working] Hey Everyone! Since Our Training Repo Has Gotten Quite Complex, We Configure All Setup In An

Well duh, now it makes total sense! Should have checked docs or examples more closely 🙏
Yes if that works reliably then I think that option could make sense, it would have made things somewhat easier in my case - but this is just as good.

one year ago
0 Hi Everyone, I’M Getting An Error During Model Upload To S3. The Error Shows Up In The Console Like Below And I Don’T See Any Uploaded Objects In S3:

Yes makes sense, it sounded like that from the start. Luckily, the task.flush(...) way seems to work for now 🙂

one year ago
0 How Do I View Debug Samples Images In The Browser When The Output_Uri Is On Google Cloud Storage (

It was related, special characters also made prevented some access.
But it was and is also related to some authentication problem with Google: If you open the dashboard in Chrome, go to the developer console, you see a bunch of failed links to authenticate to. If you click one of them in another tab, it shows the Google signin screen and afterwards you can see the Debug Samples in Dashboard.
That all does not work in safari though for some reason 🙂

9 months ago
0 [Clearml With Pytorch-Based Distributed Training} Hi Everyone! Is The Combination Of Clearml With

Results of a bit more investigation:

The ClearML example does use the Pytorch dist package but none of the DistributedDataParallel functionality, instead, it reduces gradients “manually”. This script is also not prepared for torchrun as it launches more processes itself (w/o using the multiprocessing of Python or Pytorch.)

When running a simple example (code attached...

one year ago
0 [Clearml With Pytorch-Based Distributed Training} Hi Everyone! Is The Combination Of Clearml With

So my own repo I’m launching with either
torchrun --nproc_per_node 2 --standalone --master_addr 127.0.0.1 --master_port 29500 -m http://my_folder.my _script --some_option
or
python3 -m torch.distributed.launch --nproc_per_node 2 --master_addr 127.0.0.1 --master_port 29500 -m http://my_folder.my _script --some_option

one year ago
0 [Clearml With Pytorch-Based Distributed Training} Hi Everyone! Is The Combination Of Clearml With

Sorry that these issues go quite deep and chaotic - we would appreciate any help or ideas you can think of!

one year ago
0 [Clearml With Pytorch-Based Distributed Training} Hi Everyone! Is The Combination Of Clearml With

Ok great! I will debug starting with a simpler training script.
Just as a last question, is torchrun also supported rather than the (now deprecated but still usable) torch.distributed.launch ?

one year ago
0 [Task Gets Interrupted / Aborted / Reset When In Offline Mode] For Local Testing, We Have Added A

I meant maybe me activating offline mode, somehow changes something else in the runtime and that in turn leads to the interruption. Let me try to build a minimal reproducible version 🙂

one year ago
0 How Do I View Debug Samples Images In The Browser When The Output_Uri Is On Google Cloud Storage (

Ok I see, that is what I thought. But do you have any idea why I am not seeing these images? I am logged into my Gmail account and into the Google Cloud Console and can access both in another tab of the same browser. Am I missing something here?

11 months ago
0 How Do I View Debug Samples Images In The Browser When The Output_Uri Is On Google Cloud Storage (

If that helps: The URL I get when I copy it out of the ClearML dashboard is the same one as is listed under “Authenticated URL” when looking up the image in Google Cloud storage. And of course that opens the image if I go to it in another tab

11 months ago
0 [Task Gets Interrupted / Aborted / Reset When In Offline Mode] For Local Testing, We Have Added A

By the way, if we don’t wrap other calls in is_offline() we get errors like “DateTime object is not serializable”, but that’s a secondary issue.

one year ago
0 [Injecting Secrets Into A Clearml Agent / Accessing

That was the missing piece - thank you!
Awesome to all the details you have considered in ClearML 😉

one year ago
0 [Clearml With Pytorch-Based Distributed Training} Hi Everyone! Is The Combination Of Clearml With

AgitatedDove14 maybe to come at this from a broader angle:
Is ClearML combined with DataParallel or DistributedDataParallel officially supported / should that work without many adjustments? If so, would it be started via python ... or via torchrun ... ? What about remote runs, how will they support the parallel execution? To go even deeper, what about the machines started via ClearML Autoscaler? Can they either run multiple agents on them and/or start remote distribu...

one year ago
0 [Clearml With Pytorch-Based Distributed Training} Hi Everyone! Is The Combination Of Clearml With

When running on our bigger research repository which includes saving checkpoints and uploading to S3, the training ends with errors as shown below and a Killed message for the main process (I do not abort the main process manually):

2023-01-26 17:37:17,527 INFO: Save the latest model.
2023-01-26 17:37:19,158 - clearml.storage - INFO - Starting upload: /tmp/.clearml.upload_model_cvqpor8r.tmp => glass-clearml/RealESR/Glass-ClearML Demo/[Lambda] FMEN distributed check, v10 fileserver u...
one year ago
0 [Task Gets Interrupted / Aborted / Reset When In Offline Mode] For Local Testing, We Have Added A

So AgitatedDove14 if we use the CLEARML_OFFLINE_MODE environment variable instead the program runs through again.
The only thing is that now we get errors of the form
` 0%| | 0/18 [00:00<?, ?image/s]ClearML running in offline mode, session stored in /home/manuel/.clearml/cache/offline/offline-167ceb1cd3c946df8abc7206b781b486
2022-11-07 07:49:06,986 - clearml.metrics - WARNING - Failed uploading to /home/manuel/.clearml/cache/offline/offline-167ceb1cd3c946df8abc7206b781b486/...

one year ago
Show more results compactanswers