Reputation
Badges 1
611 × Eureka!Thank you very much, good to know!
But it is not possible to aggregate scalars, right? Like taking the mean, median or max of the scalars of multiple experiments.
So it seems to be definitely a problem with docker and not with clearml. However, I do not get, why it works for you but on none of my machine (all Ubuntu 20.04 with docker 20.10)
No. Here is a better example. I have two types of workstations: Type X can execute tasks of type A and B. Type Y can execute tasks of type B. This could be the case if type X workstations have for example more VRAM, newer drivers, etc...
I have two queues. Queue A and Queue B. I submit tasks of type A to queue A and tasks of type B to queue B.
Here is what can happen:
Enqueue the first task of type B. Workstations of type X will run this task. Enqueue the second task of type A. Workstation ...
[2021-05-07 10:53:00,566] [9] [WARNING] [elasticsearch] POST ` [status:N/A request:60.061s]
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 445, in _make_request
six.raise_from(e, None)
File "<string>", line 3, in raise_from
File "/usr/local/lib/python3.6/site-packages/urllib3/connectionpool.py", line 440, in _make_request
httplib_response = conn.getresponse()
File "/usr/lib64/python3.6/http/client.py", lin...
So I just tried again, but with manual deleting via Web UI.
Yes, from the documentation:
Creates a new Task (experiment) if:
The Task never ran before. No Task with the same task_name and project_name is stored in ClearML Server.
The Task has run before (the same task_name and project_name), and (a) it stored models and / or artifacts, or (b) its status is Published , or (c) it is Archived.
A new Task is forced by calling Task.init with reuse_last_task_id=False.
Otherwise, the already initialized Task object for the same task_nam...
I am wondering where to put my experiment logic, so that it gets lazily executed and not at task definition time (i.e. in get_task_experiment() how to get my experiment logic in there without running it)
Alright, that s unfortunate. But thank you very much!
With remote_execution it is command="[...]" , but on local it is command='train' like it is supposed to be.
But here is the funny thing:
channels:
- pytorch
- conda-forge
- defaults
dependencies:
- cudatoolkit=11.1.1
- pytorch=1.8.0
Installs GPU
SuccessfulKoala55 I just had the issue again. The logs show nothing of interest. It looks like OOM to me, but I will test this again with way larger SWAP, so the server only slows down, but does not kill something. Unfortunately, kernel logs also do not show much (maybe I have my server logs misconfigured, I am no expert).
What is interesting though is that docker only showed my nginx, minio and docker-registry to have exited, while all the clearml containers were still running. I restarted ...
Any idea why deletion of artifacts on my second fileserver does not work?
fileserver_datasets: networks: - backend - frontend command: - fileserver container_name: clearml-fileserver-datasets image: allegroai/clearml:latest restart: unless-stopped volumes: - /opt/clearml/logs:/var/log/clearml - /opt/clearml/data/fileserver-datasets:/mnt/fileserver - /opt/clearml/config:/opt/clearml/config ports: - "8082:8081"
ClearML successfu...
I guess this is from clearml-server and seems to be bottlenecking artifact transfer speed.
Mhhm, good hint! Unfortunetly I can see nowhere in logs when the server creates a delete request
Okay, thanks for explaining!
Yea, but doesn't this feature make sense on a task level? If I remember correctly, some dependencies will sometimes require different pip versions. And dependencies are on task basis.
Can you tell me how I create tasks correctly? The PipelineController.add_step takes the task-id/task-name, but I would rather just define a function that returns the task directly, since the base-task may not be already on the clearml-server.
Nvm. I forgot to start my agent with --docker . So here comes my follow up question: It seems like there is no way to define that a Task requires docker support from an agent, right?
And how to specify this fileserver as output_uri ?
` # Connecting ClearML with the current process,
from here on everything is logged automatically
task = Task.init(project_name="examples", task_name="artifacts example")
task.set_base_docker(
"my_docker",
docker_arguments="--memory=60g --shm-size=60g -e NVIDIA_DRIVER_CAPABILITIES=all",
)
if not running_remotely():
task.execute_remotely("docker", clone=False, exit_process=True)
timer = Timer()
with timer:
# add and upload Numpy Object (stored as .npz file)
task.upload_a...
Obviously in my examples there is a lot of stuff missing. I just want to show, that the user should be able to replicate Task.init easily so it can be configured in every way, but still can make use of the magic that clearml has, for stuff that does not differ from the comfort way.
Mhhm, now conda env creation takes forever since it probably resolves conflicts. At least that is what is happening when I tried to manually install my environment
I just manually went into the docker container and ran python -m venv env --system-site-packages and activated the virtual env.
When I run pip list then, it correctly shows the preinstalled packages including torch 1.12.0a0+2c916ef