Reputation
Badges 1
611 × Eureka!So it seems to be definitely a problem with docker and not with clearml. However, I do not get, why it works for you but on none of my machine (all Ubuntu 20.04 with docker 20.10)
No. Here is a better example. I have two types of workstations: Type X can execute tasks of type A and B. Type Y can execute tasks of type B. This could be the case if type X workstations have for example more VRAM, newer drivers, etc...
I have two queues. Queue A and Queue B. I submit tasks of type A to queue A and tasks of type B to queue B.
Here is what can happen:
Enqueue the first task of type B. Workstations of type X will run this task. Enqueue the second task of type A. Workstation ...
[2021-05-07 10:53:00,566] [9] [WARNING] [elasticsearch] POST ` [status:N/A request:60.061s]
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 445, in _make_request
six.raise_from(e, None)
File "<string>", line 3, in raise_from
File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 440, in _make_request
httplib_response = conn.getresponse()
File "/usr/lib64/python3.6/http/client.py", lin...
So I just tried again, but with manual deleting via Web UI.
I am wondering where to put my experiment logic, so that it gets lazily executed and not at task definition time (i.e. in get_task_experiment() how to get my experiment logic in there without running it)
Alright, that s unfortunate. But thank you very much!
With remote_execution it is command="[...]" , but on local it is command='train' like it is supposed to be.
But here is the funny thing:
channels:
- pytorch
- conda-forge
- defaults
dependencies:
- cudatoolkit=11.1.1
- pytorch=1.8.0
Installs GPU
SuccessfulKoala55 I just had the issue again. The logs show nothing of interest. It looks like OOM to me, but I will test this again with way larger SWAP, so the server only slows down, but does not kill something. Unfortunately, kernel logs also do not show much (maybe I have my server logs misconfigured, I am no expert).
What is interesting though is that docker only showed my nginx, minio and docker-registry to have exited, while all the clearml containers were still running. I restarted ...
Any idea why deletion of artifacts on my second fileserver does not work?
fileserver_datasets: networks: - backend - frontend command: - fileserver container_name: clearml-fileserver-datasets image: allegroai/clearml:latest restart: unless-stopped volumes: - /opt/clearml/logs:/var/log/clearml - /opt/clearml/data/fileserver-datasets:/mnt/fileserver - /opt/clearml/config:/opt/clearml/config ports: - "8082:8081"
ClearML successfu...
I guess this is from clearml-server and seems to be bottlenecking artifact transfer speed.
Mhhm, good hint! Unfortunetly I can see nowhere in logs when the server creates a delete request
Okay, thanks for explaining!
Yea, but doesn't this feature make sense on a task level? If I remember correctly, some dependencies will sometimes require different pip versions. And dependencies are on task basis.
Can you tell me how I create tasks correctly? The PipelineController.add_step takes the task-id/task-name, but I would rather just define a function that returns the task directly, since the base-task may not be already on the clearml-server.
Nvm. I forgot to start my agent with --docker . So here comes my follow up question: It seems like there is no way to define that a Task requires docker support from an agent, right?
And how to specify this fileserver as output_uri ?
` # Connecting ClearML with the current process,
from here on everything is logged automatically
task = Task.init(project_name="examples", task_name="artifacts example")
task.set_base_docker(
"my_docker",
docker_arguments="--memory=60g --shm-size=60g -e NVIDIA_DRIVER_CAPABILITIES=all",
)
if not running_remotely():
task.execute_remotely("docker", clone=False, exit_process=True)
timer = Timer()
with timer:
# add and upload Numpy Object (stored as .npz file)
task.upload_a...
Obviously in my examples there is a lot of stuff missing. I just want to show, that the user should be able to replicate Task.init easily so it can be configured in every way, but still can make use of the magic that clearml has, for stuff that does not differ from the comfort way.
Mhhm, now conda env creation takes forever since it probably resolves conflicts. At least that is what is happening when I tried to manually install my environment
I just manually went into the docker container and ran python -m venv env --system-site-packages and activated the virtual env.
When I run pip list then, it correctly shows the preinstalled packages including torch 1.12.0a0+2c916ef
Hi TimelyMouse69 Thank you for your answer.
I use 3.10.8 locally and 3.10.6 remotely. Everything is run in a docker container, locally and remotely on the docker-agent (exactly the same docker image).
Thank you for looking into the disappearing dev . It seems like this should be the reason for pip trying to install a stable version of 1.14, which does only exist as nightly
btw: Could you check whether agent.package_manager.system_site_packages is true or false in your config and in the summary that the agent gives before execution?
I start my agent in --foreground mode for debugging and it clearly show false , but in the summary that the agent gives before the task is executed, it shows true .
You mean I can add exactly what you wrote--extra-index-url clearml torch == 1.14.0.dev20221205+cu117 torchvision == 0.15.0.dev20221205+cputo the installed packages section?