Reputation
Badges 1
70 × Eureka!so you say deleting other old indices that I don't need could help?
This did not help, I still have the same issue
I will try to recover it, but anyway the learning is to fully separate the fileserver and any output location from mongo, redis and elastic. Also maybe it makes sense the improve the ES setup to have replicas
since it is a single node, I guess it will not possible to recover or partially recover the index right?
Can you send some more comprehensive log - perhaps there are other messages that are related
which logs do you wish?
Try to restart ES and see if it helps
docker-compose down / up does not help
SuccessfulKoala55 I'm currently inside the docker container to recover the ckpt files. But /root/.clearml/venvs-builds seems to be empty. Any idea where I could then find the ckpt files?
SuccessfulKoala55 Hey, for us artifact download urls, model download urls, images in plots and debug image urls are broken. In the linked example I can see a solution for the debug images and potentially plot images but cant find the artifacts and model urls inside ES. Are those urls maybe stored inside the mongodb? Any idea where to find them?
SuccessfulKoala55 so you say deleting other old indices that I don't need could help?
CostlyOstrich36 Thank you for your response, is there something like a public project roadmap?
Actually I saw that the RuntimeError: context has already been set appears when the task is initialised outside if __name__ == "__main__":
using this code in https://github.com/facebookresearch/fastMRI/blob/master/banding_removal/scripts/pretrain.py
` if name == "main":
task = Task.init(project_name="dummy",
task_name="pretraining",
task_type=Task.TaskTypes.training,
reuse_last_task_id=False)
task.connect(args)
print('Arguments: {}'.format(args))
# only create the task, we will actually execute it later
task.execute_remotely()
spawn_dist.run...
My code produces now an error inside one of the threads, but that should be an issue on my side. Still this issue inside a child thread was not detected as failure and the training task resulted in "completed". This error happens now with the Task.init inside the if __name__ == "__main__": as seen above in the code snippet.
RuntimeError: stack expects each tensor to be equal size, but got [15, 640, 372, 2] at entry 0 and [15, 322, 640, 2] at entry 1 Detected an exited process, so exiting main terminating child processes exiting
I'm running now the the code shown above and will let you know if there is still an issue
Solving the replica issue now allowed me to get better insights into why the one index is red.
` {
"index" : "events-training_stats_scalar-d1bd92a3b039400cbafc60a7a5b1e52b",
"shard" : 0,
"primary" : true,
"current_state" : "unassigned",
"unassigned_info" : {
"reason" : "CLUSTER_RECOVERED",
"at" : "2021-11-09T22:30:47.018Z",
"last_allocation_status" : "no_valid_shard_copy"
},
"can_allocate" : "no_valid_shard_copy",
"allocate_explanation" : "cannot allocate because a...
Hey AgitatedDove14 , I fixed my code issue and are now able to train on multiple gpus using the https://github.com/facebookresearch/fastMRI/blob/master/banding_removal/fastmri/spawn_dist.py . Since I create the ClearML Task in the main thread I now can't see any training plots and probably also not the output model. What would be the right approach? I would like to avoid using Task.current_task().upload_artifact() or manual logging. I really enjoy the automatic detection
, what version of clearml is your server?
the docker-compose use clearml:latest
` if name == "main":
task = Task.init(project_name="dummy",
task_name="pretraining",
task_type=Task.TaskTypes.training,
reuse_last_task_id=False)
task.connect(args)
print('Arguments: {}'.format(args))
# only create the task, we will actually execute it later
task.execute_remotely()
spawn_dist.run(args) `I added it to this script and use it as a starting point https://github.com/facebookresearch/fastMRI/bl...
The strange thing was that my agents where running in the morning but just disappeared in the clearml server ui under workers-and-queues . So I did docker-compose down / up and then I got this error.
We do have a queue called office and another queue called default, so the agent is not listening for queues that are not defined. Or do I misunderstand something? The server has all queues defined that the agents are using
ssh into the elasticsearch container identify the id of the index that seem to be broken run /usr/share/elasticsearch/jdk/bin/java -cp lucene-core*.jar -ea:org.apache.lucene… org.apache.lucene.index.CheckIndex /usr/share/elasticsearch/data/nodes/0/indices/your-id/0/index/ -verbose -exorcise This can be dangerous but is the only option if you assume that the data is lost anyway. either running 3. repairs broken segments or it shows as in my case ` No problems were detected with this i...
The agents also share the clearml.conf file which causes some issue with the worker_id/worker_name. They all want to be ubuntu:gpu0. Any idea how I can randomize it? Setting the CLEARML_WORKER_ID env var somehow does not work
or only not for apt and pip?
` elasticsearch:
networks:
- backend
container_name: clearml-elastic
environment:
ES_JAVA_OPTS: -Xms8g -Xmx8g
bootstrap.memory_lock: "true"
cluster.name: clearml
cluster.routing.allocation.node_initial_primaries_recoveries: "500"
cluster.routing.allocation.disk.watermark.low: 500mb
cluster.routing.allocation.disk.watermark.high: 500mb
cluster.routing.allocation.disk.watermark.flood_stage: 500mb
discovery.zen.minimum_master_no...
docker-compose with entrypoint.sh with python3 -m clearml_agent daemon --docker "${CLEARML_AGENT_DEFAULT_BASE_DOCKER:-$TRAINS_AGENT_DEFAULT_BASE_DOCKER}" --force-current-version ${CLEARML_AGENT_EXTRA_ARGS:-$TRAINS_AGENT_EXTRA_ARGS} --queue office
I like this approach more but it still requires resolved environment variables inside the clearml.conf
so now there is the user conflict between the host and the agent inside the container