
Reputation
Badges 1
89 × Eureka!Try save_safetensors=False
in TrainingArguments
. Not sure if clearML supports safetensors
ERROR: Unexpected bus error encountered in worker. This might be caused by insufficient shared memory (shm).
�Traceback (most recent call last):
File "/root/.clearml/venvs-builds/3.10/task_repository/script.py", line 36, in <module>
Traceback (most recent call last):
File "/opt/conda/lib/python3.10/multiprocessing/queues.py", line 244, in _feed
obj = _ForkingPickler.dumps(obj)
File "/opt/conda/lib/python3.10/multiprocessing/reduction.py", line 51, in dumps
cls(buf, protoco...
DEBUG Installing build dependencies ... [?25l- \ | / - done
[?25h Getting requirements to build wheel ... [?25l- error
[1;31merror[0m: [1msubprocess-exited-with-error[0m
[31m×[0m [32mGetting requirements to build wheel[0m did not run successfully.
[31m│[0m exit code: [1;36m1[0m
[31m╰─>[0m [31m[21 lines of output][0m
[31m [0m Traceback (most recent call last):
[31m [0m File "/root/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/pip/_vendor/pyproject_hooks/_i...
Seems to work!
Hey, yes I can see machine statistics on the experiments themselves
@<1523701070390366208:profile|CostlyOstrich36> I don't think it's related to disk, I think it's related to shm
I think it might be related to the new run overwriting in this location
How are you getting:
beautifulsoup4 @ file:///croot/beautifulsoup4-split_1681493039619/work
This comes with the docker image ultralytics/ultralytics:latest
The original run completes successfully, it's only the runs cloned from the GUI which fail
agent.package_manager.pip_version=""
Trying this:
clearml_dataset = Dataset.get(
dataset_id=config.get("dataset_id"), alias=config.get("dataset_alias")
)
dataset_dir = clearml_dataset.get_local_copy()
destination_dir = os.path.join("/datasets", os.path.basename(dataset_dir))
shutil.copytree(dataset_dir, destination_dir)
results = model.train(
data=destination_dir + "/data.yaml", epochs=config.get("epochs"), device=0
)
Resetting and enqueuing task which has built successfully also fails 😞
As I get a bunch of these warnings in both of the clones that failed
Hi @<1523701205467926528:profile|AgitatedDove14>
ClearML Agent 1.9.0
Collecting pip<20.2
Using cached pip-20.1.1-py2.py3-none-any.whl (1.5 MB)
Installing collected packages: pip
Attempting uninstall: pip
Found existing installation: pip 20.0.2
Not uninstalling pip at /usr/lib/python3/dist-packages, outside environment /usr
Can't uninstall 'pip'. No files were found to uninstall.
What I dont understand is how to tell clearml to install this version of pytorch and torchvision, with cu118
Although that's not ideal as it turns off CPU parallelisation
Thank you for your help @<1523701205467926528:profile|AgitatedDove14>
@<1523701070390366208:profile|CostlyOstrich36> do you have any ideas?
Container nvcr.io/nvidia/pytorch:22.12-py3
But that doesn't explain why the model JSON files are missing.
@<1523701070390366208:profile|CostlyOstrich36> do you have any ideas? Thank you
I can install on the server with this command
Our current setup is one clearml agent per GPU on the same machine
Code to enqueue
from clearml import Task
task = Task.create(
script="script.py",
docker="ultralytics/ultralytics:latest",
docker_args=["--network=host", "--ipc=host", "--shm_size=55G"],
)
task.enqueue(task, "default")
Setting ultralytics workers=0 seems to work as per the thread above!