
Reputation
Badges 1
89 × Eureka!Setting ultralytics workers=0 seems to work as per the thread above!
Collecting pip<20.2
Using cached pip-20.1.1-py2.py3-none-any.whl (1.5 MB)
Installing collected packages: pip
Attempting uninstall: pip
Found existing installation: pip 20.0.2
Not uninstalling pip at /usr/lib/python3/dist-packages, outside environment /usr
Can't uninstall 'pip'. No files were found to uninstall.
It's hanging at
Installing collected packages: zipp, importlib-resources, rpds-py, pkgutil-resolve-name, attrs, referencing, jsonschema-specifications, jsonschema, certifi, urllib3, idna, charset-normalizer, requests, pyparsing, PyYAML, six, pathlib2, orderedmultidict, furl, pyjwt, psutil, python-dateutil, platformdirs, distlib, filelock, virtualenv, clearml-agent
Successfully installed PyYAML-6.0.2 attrs-23.2.0 certifi-2024.7.4 charset-normalizer-3.3.2 clearml-agent-1.8.1 distlib-0.3....
I am running the agent with clearml-agent daemon --queue training
Thank you for getting back to me
I have set agent{cuda_version: 11.2}
What does ClearML do differently that leads to a failure here?
Looks okay there
If I run nvidia-smi it returns valid output and it says the CUDA version is 11.2
Isn't the problem that CUDA 12 is being installed?
Solved that by setting docker_args=["--privileged", "--network=host"]
WARNING:clearml_agent.helper.package.requirements:Local file not found [torch-tensorrt @ file:///opt/pytorch/torch_tensorrt/py/dist/torch_tensorrt-1.3.0a0-cp38-cp38-linux_x86_64.whl], references removed
Final answer was
docker="ultralytics/ultralytics:latest",
docker_args=["--network=host", "--ipc=host"],
@<1523701070390366208:profile|CostlyOstrich36> I don't think it's related to disk, I think it's related to shm
But that doesn't explain why the model JSON files are missing.
@<1523701070390366208:profile|CostlyOstrich36> do you have any ideas? Thank you
We are getting the dataset like this:
clearml_dataset = Dataset.get(
dataset_id=config.get("dataset_id"), alias=config.get("dataset_alias")
)
dataset_dir = clearml_dataset.get_local_copy()
On local I am able to import ultralytics in this docker imagedocker run --gpus 1 -it
nvcr.io/nvidia/pytorch:22.12-py3# pip install opencv-python==4.6.* ultralytics
# python
>>> from ultralytics import YOLO
>>>
Trying this:
clearml_dataset = Dataset.get(
dataset_id=config.get("dataset_id"), alias=config.get("dataset_alias")
)
dataset_dir = clearml_dataset.get_local_copy()
destination_dir = os.path.join("/datasets", os.path.basename(dataset_dir))
shutil.copytree(dataset_dir, destination_dir)
results = model.train(
data=destination_dir + "/data.yaml", epochs=config.get("epochs"), device=0
)
I am trying task.create like so:
task = Task.create(
script="test_gpu.py",
packages=["torch"],
)