Reputation
Badges 1
70 × Eureka!RuntimeError: stack expects each tensor to be equal size, but got [15, 640, 372, 2] at entry 0 and [15, 322, 640, 2] at entry 1 Detected an exited process, so exiting main terminating child processes exiting
CostlyOstrich36 Thank you for your response, is there something like a public project roadmap?
When using clearml-agent daemon --queue default --docker it is running. In this case I always had some issues when adding the --gpu flag.
So I don't need docker_internal_mounts at all?
It is working now, it seemed like I pointed to a wrong entrypoint.sh in the docker-compose file. Still strange...
probably found the issue
I'm running the following agent:clearml-agent --config-file /clearml-cache/config/clearml-cpu.conf daemon --queue cpu default services --docker ubuntu:20.04 --cpu-only --services-mode 4 --detached
The goal is to have an agent that can run multiple cpu only tasks at the same time. I notices that when enqueueing multiple tasks, all except for one stay pending until the first one finished downloading all packages and started with code execution. And then task by task switch to "run...
yes, this one is running in venv and not docker, because I had some issues with cuda and docker. The virtualenv==20.4.6 in the requirements.txt. I think it broke after installing clearml-serving in the same env.
We run a lot of pipelines that are cpu only with some parallel steps. Its just about improving the execution time
Can you send some more comprehensive log - perhaps there are other messages that are related
which logs do you wish?
The output seen above indicates that the index is corrupt and probably lost, but that is not necessary the case
Yes, this happened when the disk got filled up to 100%
I increased already the memory to 8GB after reading similar issues here on the slack`
Just making sure, how exactly did you do that?
docker-compose downelasticsearch: networks: - backend container_name: clearml-elastic environment: ES_JAVA_OPTS: -Xms8g -Xmx8g `` docker-compose up -d
Did you wait for all the other indices to reach yellow status?
yes I waited until everything was yellow
I will try to recover it, but anyway the learning is to fully separate the fileserver and any output location from mongo, redis and elastic. Also maybe it makes sense the improve the ES setup to have replicas
SuccessfulKoala55 do you have any example? I guess a lot of people face this issue
W: chown to _apt:root of directory /var/cache/apt/archives/partial failed - SetupAPTPartialDirectory (1: Operation not permitted) W: chmod 0700 of directory /var/cache/apt/archives/partial failed - SetupAPTPartialDirectory (1: Operation not permitted) Collecting pip==20.1.1
AgitatedDove14 I created a new clean venv and freshly installed the clearml-agent under python / pip 3.8 and now it is working again. Still don't know what caused this issue. Thank you very much for helping!
` Process failed, exit code 1task ab1a90dacb9042eea8e4a6a16640d7f4 pulled from 8f06b6b160c14a3591d791c1885b309e by worker test:gpu1
Running task 'ab1a90dacb9042eea8e4a6a16640d7f4'
Storing stdout and stderr log to '/tmp/.clearml_agent_out.kbkz1n40.txt', '/tmp/.clearml_agent_out.kbkz1n40.txt'
Current configuration (clearml_agent v1.0.0, location: /tmp/.clearml_agent.3e6l7juj.cfg):
sdk.storage.cache.default_base_dir = ~/.clearml/cache
sdk.storage.cache.size.min_free_bytes ...
Solving the replica issue now allowed me to get better insights into why the one index is red.
` {
"index" : "events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b",
"shard" : 0,
"primary" : true,
"current_state" : "unassigned",
"unassigned_info" : {
"reason" : "CLUSTER_RECOVERED",
"at" : "2021-11-09T22:30:47.018Z",
"last_allocation_status" : "no_valid_shard_copy"
},
"can_allocate" : "no_valid_shard_copy",
"allocate_explanation" : "cannot allocate because a...
I think Anna means that if artifacts and models are stored on the clearml fileserver their path will contain the IP or domain of the fileserver. If you then move the fileserver to a different host, all the urls are broken since the host changed.
curl -XPUT -H 'Content-Type: application/json' 'localhost:9200/_settings' -d '{"index" : {"number_of_replicas" : 0}}This command made all my indices beside the broken one which is still red, come green again. It comes from https://stackoverflow.com/questions/63403972/elasticsearch-index-in-red-health/63405623#63405623 .
Exactly, all agents should share the cache that is mounted via nfs. I think it is working now 🙂
SuccessfulKoala55 so you say deleting other old indices that I don't need could help?
docker run --gpus device=0 --rm -it nvidia/cuda:11.3.0-cudnn8-runtime-ubuntu18.04 bash worked, I could run in it nvidia-smi and see gpu 0
using top inside the elasticsearch container shows elastic+ 20 0 17.0g 8.7g 187584 S 2.3 27.2 1:09.18 java that the 8g are reserved. So setting ES_JAVA_OPTS: -Xms8g -Xmx8g should work.
` elasticsearch:
networks:
- backend
container_name: clearml-elastic
environment:
ES_JAVA_OPTS: -Xms8g -Xmx8g
bootstrap.memory_lock: "true"
cluster.name: clearml
cluster.routing.allocation.node_initial_primaries_recoveries: "500"
cluster.routing.allocation.disk.watermark.low: 500mb
cluster.routing.allocation.disk.watermark.high: 500mb
cluster.routing.allocation.disk.watermark.flood_stage: 500mb
discovery.zen.minimum_master_no...
since it is a single node, I guess it will not possible to recover or partially recover the index right?