RattyBluewhale45

6 Questions, 91 Answers

Active since 13 August 2024

Last activity one year ago

Reputation

Badges 1

89 × Eureka!

Questions 6
Answers 91

0 Votes

29 Answers

2K Views

0 Votes 29 Answers 2K Views

Hello! I Have An Issue Reproducing My Runs. The Task.Create Completes Successfully. When I Clone And Enqueue A Completed Task The Clone Fails. It Fails During The Python Requirements Installation. Why Is This? Do You Know How I Can Debug? Thank You In Adv

Hello! I have an issue reproducing my runs. The task.Create completes successfully. When I clone and enqueue a completed task the clone fails. It fails durin...

clearml

one year ago

0 Votes

23 Answers

1K Views

0 Votes 23 Answers 1K Views

Hello! My Workers Utilization Is Empty And Not Showing Any Graphs. Do You Know How I Can Troubleshoot This?

Hello! My Workers Utilization is empty and not showing any graphs. Do you know how I can troubleshoot this?

clearml

one year ago

0 Votes

8 Answers

2K Views

0 Votes 8 Answers 2K Views

Hello! I Am Having A Dependency Issue With Clearml. Would Someone Be Able To Help Me Understand How To Debug It/Replicate It?

Hello! I am having a dependency issue with clearml. Would someone be able to help me understand how to debug it/replicate it? from ultralytics import YOLO Im...

clearml

one year ago

0 Votes

13 Answers

1K Views

0 Votes 13 Answers 1K Views

Hello! Are You Able To Help Be Debug This Message?

Hello! Are you able to help be debug this message? RuntimeError: unable to write to file : No space left on device (28) 2024-09-09 14:29:50,124 - clearml.rep...

clearml

one year ago

0 Votes

5 Answers

959 Views

0 Votes 5 Answers 959 Views

Hello! I Get This Error When Running Multiple Jobs On The Same Dataset, Would Someone Be Able To Help Debug?:

Hello! I get this error when running multiple jobs on the same dataset, would someone be able to help debug?: FileNotFoundError: Image Not Found /clearml_age...

clearml

one year ago

0 Votes

41 Answers

131K Views

0 Votes 41 Answers 131K Views

Hey Guys! I'M Having Some Issues With Pytorch And Clearml. I Am Starting A New Task Using Task.Create And Setting Pytorch As A Requirement Under `Packages`. For Some Reason Pytorch With Cuda 12 Is Being Installed, But I Need Cuda 11. Do You Know How To Se

Hey guys! I'm having some issues with pytorch and clearml. I am starting a new task using task.create and setting pytorch as a requirement under packages. Fo...

pytorch

one year ago

0 Hello! Are You Able To Help Be Debug This Message?

Code to enqueue

from clearml import Task

task = Task.create(
    script="script.py",
    docker="ultralytics/ultralytics:latest",
    docker_args=["--network=host", "--ipc=host", "--shm_size=55G"],
)

task.enqueue(task, "default")

one year ago

0 Hello! Are You Able To Help Be Debug This Message?

But I could be wrong

one year ago

0 Hello! Are You Able To Help Be Debug This Message?

ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
�Traceback (most recent call last):
  File "/root/.clearml/venvs-builds/3.10/task_repository/script.py", line 36, in <module>
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/multiprocessing/queues.py", line 244, in _feed
    obj = _ForkingPickler.dumps(obj)
  File "/opt/conda/lib/python3.10/multiprocessing/reduction.py", line 51, in dumps
    cls(buf, protoco...

one year ago

0 Hello! I Have An Issue Reproducing My Runs. The Task.Create Completes Successfully. When I Clone And Enqueue A Completed Task The Clone Fails. It Fails During The Python Requirements Installation. Why Is This? Do You Know How I Can Debug? Thank You In Adv

Thanks @<1523701205467926528:profile|AgitatedDove14> , will take a look

one year ago

Setting agent.venvs_cache path back to ~/.clearml/venvs-cache seems to have done the trick!

one year ago

0 Hello! I Get This Error When Running Multiple Jobs On The Same Dataset, Would Someone Be Able To Help Debug?:

Our current setup is one clearml agent per GPU on the same machine

one year ago

0 Hello! I Get This Error When Running Multiple Jobs On The Same Dataset, Would Someone Be Able To Help Debug?:

Seems to work!

one year ago

0 Hello! I Get This Error When Running Multiple Jobs On The Same Dataset, Would Someone Be Able To Help Debug?:

Trying this:

clearml_dataset = Dataset.get(
    dataset_id=config.get("dataset_id"), alias=config.get("dataset_alias")
)
dataset_dir = clearml_dataset.get_local_copy()

destination_dir = os.path.join("/datasets", os.path.basename(dataset_dir))

shutil.copytree(dataset_dir, destination_dir)

results = model.train(
    data=destination_dir + "/data.yaml", epochs=config.get("epochs"), device=0
)

one year ago

0 Hello! I Am Having A Dependency Issue With Clearml. Would Someone Be Able To Help Me Understand How To Debug It/Replicate It?

How to replicate on ClearML:

task = Task.create(
    script="myscript.py",
    packages=["opencv-python==4.6.*", "ultralytics"],
    docker="nvcr.io/nvidia/pytorch:22.12-py3",
)

Contents of myscript.py:
from ultralytics import YOLO

one year ago

0 Hello! Are You Able To Help Be Debug This Message?

Although that's not ideal as it turns off CPU parallelisation

one year ago

0 Hi All Im Trying To Save My Model Checkpoints During Runtime But Am Running Into A Confusing Snag. I'M Using The Huggingface Architecture For A Transformer. Using Their Training Module To Control Training. In The Training Args, I Have The

But that doesn't explain why the model JSON files are missing.
@<1523701070390366208:profile|CostlyOstrich36> do you have any ideas? Thank you

one year ago

Try save_safetensors=False in TrainingArguments . Not sure if clearML supports safetensors

one year ago

0 Hello! Are You Able To Help Be Debug This Message?

Setting ultralytics workers=0 seems to work as per the thread above!

one year ago

0 Hello! My Workers Utilization Is Empty And Not Showing Any Graphs. Do You Know How I Can Troubleshoot This?

agents

one year ago

0 Hello! Are You Able To Help Be Debug This Message?

On prem is not K8s

one year ago

Thank you so much for your help @<1523701205467926528:profile|AgitatedDove14> !

one year ago

0 Hello! Are You Able To Help Be Debug This Message?

@<1523701070390366208:profile|CostlyOstrich36> I don't think it's related to disk, I think it's related to shm

one year ago

0 Hello! I Am Having A Dependency Issue With Clearml. Would Someone Be Able To Help Me Understand How To Debug It/Replicate It?

On local I am able to import ultralytics in this docker image
docker run --gpus 1 -it nvcr.io/nvidia/pytorch:22.12-py3
# pip install opencv-python==4.6.* ultralytics
# python

>>> from ultralytics import YOLO
>>>

one year ago

0 Hello! I Am Having A Dependency Issue With Clearml. Would Someone Be Able To Help Me Understand How To Debug It/Replicate It?

What does ClearML do differently that leads to a failure here?

one year ago

0 Hey Guys! I'M Having Some Issues With Pytorch And Clearml. I Am Starting A New Task Using Task.Create And Setting Pytorch As A Requirement Under `Packages`. For Some Reason Pytorch With Cuda 12 Is Being Installed, But I Need Cuda 11. Do You Know How To Se

agent.cuda_version="11.2"

one year ago

to achieve running both the agent and the deployment on the same machine, adding --network=host to the run arguments solved it!

one year ago

Thank you

one year ago

@<1523701070390366208:profile|CostlyOstrich36> I'm now running the agent with --docker , and I'm using task.create(docker="nvidia/cuda:11.0.3-cudnn8-runtime-ubuntu20.04")

one year ago

pip install --pre torchvision --force-reinstall --index-url None

one year ago

0 Hello! Are You Able To Help Be Debug This Message?

None

one year ago

0 Hello! My Workers Utilization Is Empty And Not Showing Any Graphs. Do You Know How I Can Troubleshoot This?

None

one year ago

Hi @<1523701070390366208:profile|CostlyOstrich36> I am not specifying a version 🙂

one year ago

Full log for the failed clone

one year ago

0 Hello! My Workers Utilization Is Empty And Not Showing Any Graphs. Do You Know How I Can Troubleshoot This?

Hey, yes I can see machine statistics on the experiments themselves

one year ago

I am trying task.create like so:

task = Task.create(
    script="test_gpu.py",
    packages=["torch"],
)

one year ago

Show more results