Reputation
Badges 1
70 × Eureka!Hey Natan, good point! But I have actually set both
docker-compose with entrypoint.sh with python3 -m clearml_agent daemon --docker "${CLEARML_AGENT_DEFAULT_BASE_DOCKER:-$TRAINS_AGENT_DEFAULT_BASE_DOCKER}" --force-current-version ${CLEARML_AGENT_EXTRA_ARGS:-$TRAINS_AGENT_EXTRA_ARGS} --queue office
it appears at multiple places. Seems like the mapping of pip and apt cache does work but the access rights are now an issue
` # pip cache folder mapped into docker, used for python package caching
docker_pip_cache = /clearml-cache/pip-cache
# apt cache folder mapped into docker, used for ubuntu package caching
docker_apt_cache = /clearml-cache/apt-cache
docker_internal_mounts {
apt_cache: "/clearml-cache/apt-cache"
pip_cache: "/clearml-cache/pip-cache"
vcs_cache: "/clearml-cache/vcs-cache"
venv_build: "/clearml-cache/venvs-builds"
pip_download: "/cl...
The strange thing was that my agents where running in the morning but just disappeared in the clearml server ui under workers-and-queues . So I did docker-compose down / up and then I got this error.
So I don't need docker_internal_mounts at all?
tensorflow/stream_executor/platform/default/dso_loader.cc:60] Could not load dynamic library 'libcusolver.so.10'; dlerror: libcusolver.so.10: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /usr/local/nvidia/lib:/usr/local/nvidia/lib64
Ok, if I would like to have a different behaviour I would need one agent per task, right?
the error your are citing happens when running clearml-agent daemon --gpus 0 --queue default --docker nvidia/cuda
RuntimeError: stack expects each tensor to be equal size, but got [15, 640, 372, 2] at entry 0 and [15, 322, 640, 2] at entry 1 Detected an exited process, so exiting main terminating child processes exiting
It is working now, it seemed like I pointed to a wrong entrypoint.sh in the docker-compose file. Still strange...
Hi AgitatedDove14 , I get an error when running a task on my worker. I have looked into /home/user/.clearml/venvs-builds but it is empty. Any idea why this happens? I actually don’t know what I changed to cause this issue… I’m running clearml-agent v1.0.0
clearml_agent: ERROR: Command '['python3.6', '-m', 'virtualenv', '/home/user/.clearml/venvs-builds/3.6']' returned non-zero exit status 1.
When using clearml-agent daemon --queue default --docker it is running. In this case I always had some issues when adding the --gpu flag.
SuccessfulKoala55 I'm currently inside the docker container to recover the ckpt files. But /root/.clearml/venvs-builds seems to be empty. Any idea where I could then find the ckpt files?
I can see the following using docker ps:d5330ec8c47d allegroai/clearml-agent "/usr/agent/entrypoi…" 3 weeks ago Up 3 weeks clearml
I execute the following to access the containerdocker exec -u root -t -i clearml /bin/bash
I went to /root/.clearml/venv-builds but it is empty
Actually I saw that the RuntimeError: context has already been set appears when the task is initialised outside if __name__ == "__main__":
using this code in https://github.com/facebookresearch/fastMRI/blob/master/banding_removal/scripts/pretrain.py
` if name == "main":
task = Task.init(project_name="dummy",
task_name="pretraining",
task_type=Task.TaskTypes.training,
reuse_last_task_id=False)
task.connect(args)
print('Arguments: {}'.format(args))
# only create the task, we will actually execute it later
task.execute_remotely()
spawn_dist.run...
using top inside the elasticsearch container shows elastic+ 20 0 17.0g 8.7g 187584 S 2.3 27.2 1:09.18 java that the 8g are reserved. So setting ES_JAVA_OPTS: -Xms8g -Xmx8g should work.
` if name == "main":
task = Task.init(project_name="dummy",
task_name="pretraining",
task_type=Task.TaskTypes.training,
reuse_last_task_id=False)
task.connect(args)
print('Arguments: {}'.format(args))
# only create the task, we will actually execute it later
task.execute_remotely()
spawn_dist.run(args) `I added it to this script and use it as a starting point https://github.com/facebookresearch/fastMRI/bl...
python3.6 -m virtualenv /home/tobias_vitt/.clearml/venvs-builds/3.6 returns StopIteration:
Try to restart ES and see if it helps
docker-compose down / up does not help
` Process failed, exit code 1task ab1a90dacb9042eea8e4a6a16640d7f4 pulled from 8f06b6b160c14a3591d791c1885b309e by worker test:gpu1
Running task 'ab1a90dacb9042eea8e4a6a16640d7f4'
Storing stdout and stderr log to '/tmp/.clearml_agent_out.kbkz1n40.txt', '/tmp/.clearml_agent_out.kbkz1n40.txt'
Current configuration (clearml_agent v1.0.0, location: /tmp/.clearml_agent.3e6l7juj.cfg):
sdk.storage.cache.default_base_dir = ~/.clearml/cache
sdk.storage.cache.size.min_free_bytes ...
since it is a single node, I guess it will not possible to recover or partially recover the index right?
` root@ubuntu:/opt/clearml# sudo docker logs clearml-elastic
OpenJDK 64-Bit Server VM warning: Option UseConcMarkSweepGC was deprecated in version 9.0 and will likely be removed in a future release.
{"type": "server", "timestamp": "2021-11-09T12:49:13,403Z", "level": "INFO", "component": "o.e.e.NodeEnvironment", "cluster.name": "clearml", "node.name": "clearml", "message": "using [1] data paths, mounts [[/usr/share/elasticsearch/data (//some_ip/clearml-server-data)]], net usable_space [3.4tb]...
are they in conflict?
The output seen above indicates that the index is corrupt and probably lost, but that is not necessary the case
Solving the replica issue now allowed me to get better insights into why the one index is red.
` {
"index" : "events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b",
"shard" : 0,
"primary" : true,
"current_state" : "unassigned",
"unassigned_info" : {
"reason" : "CLUSTER_RECOVERED",
"at" : "2021-11-09T22:30:47.018Z",
"last_allocation_status" : "no_valid_shard_copy"
},
"can_allocate" : "no_valid_shard_copy",
"allocate_explanation" : "cannot allocate because a...
I increased already the memory to 8GB after reading similar issues here on the slack`
Just making sure, how exactly did you do that?
docker-compose downelasticsearch: networks: - backend container_name: clearml-elastic environment: ES_JAVA_OPTS: -Xms8g -Xmx8g `` docker-compose up -d