ScantChimpanzee51

16 Questions, 53 Answers

Active since 10 January 2023

Last activity 3 months ago

Reputation

Badges 1

53 × Eureka!

Questions 16
Answers 53

0 Votes

12 Answers

2K Views

0 Votes 12 Answers 2K Views

[Task Gets Interrupted / Aborted / Reset When In Offline Mode] For Local Testing, We Have Added A

[Task gets interrupted / aborted / reset when in offline mode] For local testing, we have added a --no-clearml option to our code that sets task.set_offline(...

clearml

2 years ago

0 Votes

2 Answers

2K Views

0 Votes 2 Answers 2K Views

[Auto Scaler / Api Client Does Not See Tasks In Queue]

[Auto scaler / API client does not see tasks in queue] We had used the AWS auto scaler (based on the aws_autoscaler.py script in the repo) and it worked grea...

mlops

2 years ago

0 Votes

1 Answers

1K Views

0 Votes 1 Answers 1K Views

Quick Question: Is There A Way For A Task That Is Executing Remotely To Find Out Which Clearml Queue It Is In Or Was In?

Quick question: Is there a way for a task that is executing remotely to find out which ClearML queue it is in or was in?

clearml

2 years ago

0 Votes

4 Answers

2K Views

0 Votes 4 Answers 2K Views

[Caching Of Environment And Storage When Using Aws Auto Scaler]

[Caching of environment and storage when using AWS auto scaler] First off : We are aiming to set up ClearML for large-scale DL training for multiple projects...

clearml

3 years ago

0 Votes

2 Answers

2K Views

0 Votes 2 Answers 2K Views

[Potential Bug Where The

[Potential bug where the script path option is changed for remote runs] Hi everyone! We’re still using ClearML quite a bit, usually by running the first, sma...

clearml

2 years ago

0 Votes

6 Answers

2K Views

0 Votes 6 Answers 2K Views

[Errors When Migrating Clearml Server From Aws To Gcp]

[Errors when migrating ClearML Server from AWS to GCP] Hi everyone! As we’re using ClearML quite a bit, we’d love to take it with us when migrating our cloud...

clearml

2 years ago

0 Votes

10 Answers

3K Views

0 Votes 10 Answers 3K Views

[Clearml With Pytorch-Based Distributed Training} Hi Everyone! Is The Combination Of Clearml With

[ClearML with Pytorch-based distributed training} Hi everyone! Is the combination of ClearML with torch.distributed.launch or torchrun actively supported? A ...

clearml

2 years ago

0 Votes

5 Answers

2K Views

0 Votes 5 Answers 2K Views

Hi Everyone, Quick Question: Is There Any Easy Way To

Hi everyone, quick question: Is there any easy way to get a task's full output directory ? E.g. when I create a task with task = Task.init(..., output_uri=" ...

clearml

2 years ago

0 Votes

16 Answers

2K Views

0 Votes 16 Answers 2K Views

[Injecting Secrets Into A Clearml Agent / Accessing

[Injecting secrets into a ClearML Agent / accessing clearml.conf from within a Task] Hi everyone, we are using the ClearML AWS Autoscaler (still awesome 😉 )...

clearml

3 years ago

0 Votes

2 Answers

2K Views

0 Votes 2 Answers 2K Views

[Webui-Based Options Injection Not Working] Hey Everyone! Since Our Training Repo Has Gotten Quite Complex, We Configure All Setup In An

[WebUI-based options injection not working] Hey everyone! Since our training repo has gotten quite complex, we configure all setup in an options.yml file whi...

mlops

3 years ago

0 Votes

4 Answers

2K Views

0 Votes 4 Answers 2K Views

[Plot Not Showing Up In Ui When Setting File_Server To S3 Bucket] As A Somewhat In Depth Question, We’Ve Set Our Output_Uri And File_Server To An S3 Bucket To Prevent The Server From Running Out Of Space As Discussed In This Message. However, I’Ve Noticed

[Plot not showing up in UI when setting file_server to S3 bucket] As a somewhat in depth question, we’ve set our output_uri and file_server to an S3 bucket t...

clearml

2 years ago

0 Votes

18 Answers

2K Views

0 Votes 18 Answers 2K Views

How Do I View Debug Samples Images In The Browser When The Output_Uri Is On Google Cloud Storage (

How do I view Debug Samples images in the browser when the output_uri is on Google Cloud Storage ( None )? Unlike for AWS storage, I do not get a popup windo...

clearml

2 years ago

0 Votes

7 Answers

726 Views

0 Votes 7 Answers 726 Views

Hi Everyone! After Clearml Worked Well For Use For About 2 Years, We Now Have The First Problem When Updating The Open Source Server. I Have Performed The Update As Advised Here With

Hi Everyone! After ClearML worked well for use for about 2 years, we now have the first problem when updating the open source server. I have performed the up...

clearml

3 months ago

0 Votes

3 Answers

2K Views

0 Votes 3 Answers 2K Views

[Instance Autoscaler For Gcp]

[Instance AutoScaler for GCP] In case someone else is interested, we have build an AutoScaler for GCP, too. It works similar to the AWS one in the ClearML re...

clearml

2 years ago

0 Votes

3 Answers

2K Views

0 Votes 3 Answers 2K Views

Is It Possible To Run Multiple Agent On Ec2 Machines Started By The Autoscaler? Or Have The One Agent Run Multiple Queue Jobs At Once? E.G. Having The Autoscaler Start 1X P3.8Xlarge (4 Gpu) On Aws Might Be Better Than 4X P3.2Xlarge (1 Gpu) In Terms Of Ava

Is it possible to run multiple agent on EC2 machines started by the Autoscaler? Or have the one agent run multiple Queue jobs at once? E.g. having the Autosc...

mlops

2 years ago

0 Votes

7 Answers

2K Views

0 Votes 7 Answers 2K Views

Hi Everyone, I’M Getting An Error During Model Upload To S3. The Error Shows Up In The Console Like Below And I Don’T See Any Uploaded Objects In S3:

Hi everyone, I’m getting an error during model upload to S3. The error shows up in the console like below and I don’t see any uploaded objects in S3: 2022-10...

clearml

3 years ago

0 [Task Gets Interrupted / Aborted / Reset When In Offline Mode] For Local Testing, We Have Added A

Happy to and thanks!

2 years ago

0 [Clearml With Pytorch-Based Distributed Training} Hi Everyone! Is The Combination Of Clearml With

When running on our bigger research repository which includes saving checkpoints and uploading to S3, the training ends with errors as shown below and a Killed message for the main process (I do not abort the main process manually):

2023-01-26 17:37:17,527 INFO: Save the latest model.
2023-01-26 17:37:19,158 - clearml.storage - INFO - Starting upload: /tmp/.clearml.upload_model_cvqpor8r.tmp => glass-clearml/RealESR/Glass-ClearML Demo/[Lambda] FMEN distributed check, v10 fileserver u...

2 years ago

0 [Injecting Secrets Into A Clearml Agent / Accessing

Although, some correction here: While the secret is indeed hidden in the logs, it is still visible in the “execution” tab of the experiment, see two screenshots below.
One again I set them with
task.set_base_docker(docker_arguments=["..."])

3 years ago

0 [Clearml With Pytorch-Based Distributed Training} Hi Everyone! Is The Combination Of Clearml With

Sorry that these issues go quite deep and chaotic - we would appreciate any help or ideas you can think of!

2 years ago

0 [Injecting Secrets Into A Clearml Agent / Accessing

Yes for example, or some other way to get credentials over to the container safely without them showing up in the checked-in code or web UI

3 years ago

0 [Clearml With Pytorch-Based Distributed Training} Hi Everyone! Is The Combination Of Clearml With

Results of a bit more investigation:

The ClearML example does use the Pytorch dist package but none of the DistributedDataParallel functionality, instead, it reduces gradients “manually”. This script is also not prepared for torchrun as it launches more processes itself (w/o using the multiprocessing of Python or Pytorch.)

When running a simple example (code attached...

2 years ago

0 [Potential Bug Where The

Hi John, thanks for getting back to me!
So it shows up in the UI like shown below. It happens both when “recording” the local run on Mac and on Linux.

2 years ago

0 [Task Gets Interrupted / Aborted / Reset When In Offline Mode] For Local Testing, We Have Added A

Hi AgitatedDove14 , so it took some time but I’ve finally managed to reproduce. The issue seems to be related to writing images via Tensorboard:
` from torch.utils.tensorboard import SummaryWriter
import torch
from clearml import Task, Logger

if name == "main":
task = Task.init(project_name="ClearML-Debug", task_name="[Mac] TB Logger, offline")
tb_logger = SummaryWriter(log_dir="tb_logger/demo/")
image_tensor = torch.rand(256, 256, 3)
for iter in range(10):
t...

2 years ago

0 [Auto Scaler / Api Client Does Not See Tasks In Queue]

Hi @<1523701087100473344:profile|SuccessfulKoala55> , sorry there was a mistake on my end - clearml.conf pointed to the wrong URL 🙈

2 years ago

0 [Task Gets Interrupted / Aborted / Reset When In Offline Mode] For Local Testing, We Have Added A

By the way, if we don’t wrap other calls in is_offline() we get errors like “DateTime object is not serializable”, but that’s a secondary issue.

2 years ago

0 [Injecting Secrets Into A Clearml Agent / Accessing

Hi SuccessfulKoala55 , thanks for getting back to me!
In the docs of Task.set_base_docker() it says “When running remotely the call is ignored”. Does that mean that this function call is executed when running locally to “record” the arguments and then when I duplicate the experiment and clone it remote, the call is ignored and the recorded values are used?

3 years ago

0 [Clearml With Pytorch-Based Distributed Training} Hi Everyone! Is The Combination Of Clearml With

AgitatedDove14 maybe to come at this from a broader angle:
Is ClearML combined with DataParallel or DistributedDataParallel officially supported / should that work without many adjustments? If so, would it be started via python ... or via torchrun ... ? What about remote runs, how will they support the parallel execution? To go even deeper, what about the machines started via ClearML Autoscaler? Can they either run multiple agents on them and/or start remote distribu...

2 years ago

0 [Errors When Migrating Clearml Server From Aws To Gcp]

@<1523701070390366208:profile|CostlyOstrich36> thank you, now everything works so far!
Last thing: Is there any way to change all the links in the new ClearML server such that an artifact that was previous under s3://… is now taken from gs://… ? The actual data is already available under the gs:// link of course

2 years ago

0 [Injecting Secrets Into A Clearml Agent / Accessing

SuccessfulKoala55 AgitatedDove14 So I’ve tried the approach and it does work, however, this of course results in the credentials being visible in the ClearML web interface output, which comes close to just hard-coding them in…
Is there any way to send the secrets safely?
Is there any way to access the clearml.conf file contents from within code? (afaik, the file does not get send over to the container - otherwise I could just yml-read it myself…)

3 years ago

0 Hi Everyone! After Clearml Worked Well For Use For About 2 Years, We Now Have The First Problem When Updating The Open Source Server. I Have Performed The Update As Advised Here With

Is there for example a good way to start the new version without any legacy data base and then migrate the data more or less manually? Is it enough to migrate the Mongo DB to see our old tasks again and are the schematics still compatible?

3 months ago

0 [Instance Autoscaler For Gcp]

Hi Jake, yes I’d love to! Just a question: how clean and complete does the example need to be? For example, this code relies on you building a correct Machine Image on GCP (which is somewhat unrelated to ClearML) and it does not get the logs from the agent instances - is that still good enough?

2 years ago

0 [Task Gets Interrupted / Aborted / Reset When In Offline Mode] For Local Testing, We Have Added A

So AgitatedDove14 if we use the CLEARML_OFFLINE_MODE environment variable instead the program runs through again.
The only thing is that now we get errors of the form
` 0%| | 0/18 [00:00<?, ?image/s]ClearML running in offline mode, session stored in /home/manuel/.clearml/cache/offline/offline-167ceb1cd3c946df8abc7206b781b486
2022-11-07 07:49:06,986 - clearml.metrics - WARNING - Failed uploading to /home/manuel/.clearml/cache/offline/offline-167ceb1cd3c946df8abc7206b781b486/...

2 years ago

0 Hi Everyone! After Clearml Worked Well For Use For About 2 Years, We Now Have The First Problem When Updating The Open Source Server. I Have Performed The Update As Advised Here With

New state: After starting with the old YML again, the web app looks new (presumably because the image allegroai/clearml:latest is used), but the server version still lists WebApp: 2.1.0-664 • Server: 1.9.2-317 • API: 2.23 .

Creating tasks and reporting things works again, but I regularly see the UI error shown attached. Any way to resolve things?

3 months ago

0 [Injecting Secrets Into A Clearml Agent / Accessing

Sorry to ask again, but the values are still showing up in the WebUI console logs this way (see screenshot.)
Here is the config that I paste into the EC2 Autoscaler Setup:
` agent {
extra_docker_arguments: ["-e AWS_ACCESS_KEY_ID=XXXXXX", "-e AWS_SECRET_ACCESS_KEY=XXXXXX"]

hide_docker_command_env_vars {
    enabled: true
    extra_keys: ["AWS_SECRET_ACCESS_KEY"]
    parse_embedded_urls: true
}

} `Never mind, it came from setting the options wrong, it has to be ...

3 years ago

0 How Do I View Debug Samples Images In The Browser When The Output_Uri Is On Google Cloud Storage (

@<1523701070390366208:profile|CostlyOstrich36> , you mean the ClearML server needs access to Cloud Storage in its clearml.conf file?
Just tried it by creating a ~/clearml.conf file and setting the entry as below - unfortunately the same result. I’ve re-started the docker-compose of course.

Did I miss something here?

    google.storage {
        credentials_json: "/home/.../my-crendetials.json"
    }

2 years ago

0 [Injecting Secrets Into A Clearml Agent / Accessing

That was the missing piece - thank you!
Awesome to all the details you have considered in ClearML 😉

3 years ago

0 Hi Everyone! After Clearml Worked Well For Use For About 2 Years, We Now Have The First Problem When Updating The Open Source Server. I Have Performed The Update As Advised Here With

Hi @<1523701070390366208:profile|CostlyOstrich36> , thank you for answering!
We are upgrading from v. 1.9 or so (I think) to the most recent one.

Attached below are 3 logs from api server, elastic search and file server - does this help to debug?

3 months ago

0 Hi Everyone! After Clearml Worked Well For Use For About 2 Years, We Now Have The First Problem When Updating The Open Source Server. I Have Performed The Update As Advised Here With

Also as an update, I tried to start the containers one by one and resolve the errors that came up. The only real one I found was that Redis crashed when loading the previous data base. Since I figured I wouldn't necessarily need the cache, I cleared the dump file, then all the services started - this all refers to the old server version.

When trying the same with the most recent docker-compose.yml , the services all started by themselves, but when I start the full docker compose, the das...

3 months ago

0 [Clearml With Pytorch-Based Distributed Training} Hi Everyone! Is The Combination Of Clearml With

Ok great! I will debug starting with a simpler training script.
Just as a last question, is torchrun also supported rather than the (now deprecated but still usable) torch.distributed.launch ?

2 years ago

0 [Errors When Migrating Clearml Server From Aws To Gcp]

To recap, the server started up on GCP as expected before migrating the data over. The migration was done by

deleting the current data sudo rm -fR /opt/clearml/data/*
unpacking the backup sudo tar -xzf ~/clearml_backup_data.tgz -C /opt/clearml/data
setting permissions sudo chown -R 1000:1000 /opt/clearml

2 years ago

0 Hi Everyone, Quick Question: Is There Any Easy Way To

Unfortunately not, task.data.output just contains <tasks.Output: { "destination": " s3://some_bucket " }> and when I convert task.data to a string and search for the desired uri, I cannot find it either.
But on the other hand, putting the url together from its name, id, etc. seems to work - it might be a little unsafe if the task gets re-named or something, but otherwise it should be fine.

2 years ago

0 How Do I View Debug Samples Images In The Browser When The Output_Uri Is On Google Cloud Storage (

Yes and yes - is that the issue and it might likely go away if we host it via HTTPS?

2 years ago

0 [Injecting Secrets Into A Clearml Agent / Accessing

Ahhh, ok got it! Thanks 👍

3 years ago

0 [Plot Not Showing Up In Ui When Setting File_Server To S3 Bucket] As A Somewhat In Depth Question, We’Ve Set Our Output_Uri And File_Server To An S3 Bucket To Prevent The Server From Running Out Of Space As Discussed In This Message. However, I’Ve Noticed

SuccessfulKoala55 just in case you have any more thoughts, but we could also continue as is 😊

2 years ago

0 Is It Possible To Run Multiple Agent On Ec2 Machines Started By The Autoscaler? Or Have The One Agent Run Multiple Queue Jobs At Once? E.G. Having The Autoscaler Start 1X P3.8Xlarge (4 Gpu) On Aws Might Be Better Than 4X P3.2Xlarge (1 Gpu) In Terms Of Ava

Yes totally, but we’ve been having problems of getting these GPUs specifically (even manually in the EC2 console and across regions), so I thought maybe it’s easier to get one big one than many small ones, but I’ve never actually checked if that is true 🙂 Thanks anyhow!

2 years ago

Show more results