Reputation
Badges 1
981 × Eureka!The task requires this service, so the task starts it on the machine - Then I want to make sure the service is closed by the task upon completion/failure/abortion
AgitatedDove14 any chance you found something interesting? π
SuccessfulKoala55 Thanks! If I understood correctly, setting index.number_of_shards = 2 (instead of 1) would create a second shard for the large index, splitting it into two shards? This https://stackoverflow.com/a/32256100 seems to say that itβs not possible to change this value after the index creation, is it true?
hoo thats cool! I could place torch==1.3.1 there
when can we expect the next self hosted release btw?
` Traceback (most recent call last):
File "devops/train.py", line 73, in <module>
train(parse_args)
File "devops/train.py", line 37, in train
train_task.get_logger().set_default_upload_destination(args.artifacts + '/clearml_debug_images/')
File "/home/machine/miniconda3/envs/py36/lib/python3.6/site-packages/clearml/logger.py", line 1038, in set_default_upload_destination
uri = storage.verify_upload(folder_uri=uri)
File "/home/machine/miniconda3/envs/py36/lib/python3.6/site...
Hi DeterminedCrab71 Version: 1.1.1-135 β’ 1.1.1 β’ 2.14
Ok, deleting installed packages list worked for the first task
Still failing with the same error π
Thanks for the help SuccessfulKoala55 , the problem was solved by updating the docker-compose file to the latest version in the repo: https://github.com/allegroai/clearml-server/blob/master/docker/docker-compose.yml
Make sure to do docker-compose down & docker-compose up -d afterwards, and not docker-compose restart
edited the aws_auto_scaler.py, actually I think itβs just a typo, I just need to double the brackets
AgitatedDove14 I see https://github.com/allegroai/clearml-session/blob/main/clearml_session/interactive_session_task.py#L21= that a key pair is hardcoded in the repo. Is it being used to ssh to the instance?
SuccessfulKoala55 Could you please point me to where I could quickly patch that in the code?
Thanks for the explanations,
Yes that was the case This is also what I would think, although I double checked yesterday:I create a task on my local machine with trains 0.16.2rc0 This task calls task.execute_remotely() The task is sent to an agent running with 0.16 The agent install trains 0.16.2rc0 The agent runs the task, clones it and enqueues the cloned task The cloned task fails because it has no hyper-parameters/args section (I can seen that in the UI) When I clone the task manually usin...
My use case it: in a spot instance marked for termination after 2 mins by aws, I want to close a task and prevent the clearml-agent to pick up a new task after.
Awesome! (Broken link in migration guide, step 3: https://allegro.ai/docs/deploying_trains/trains_server_es7_migration/ )
AgitatedDove14 yes! I now realise that the ignite events callbacks seem to not be fired (I tried to print a debug message on a custom Events.ITERATION_COMPLETED) and I cannot see it logged
I followed https://github.com/NVIDIA/nvidia-docker/issues/1034#issuecomment-520282450 and now it seems to be setting up properly
but not as much as the ELB reports
AgitatedDove14 Yes exactly, I tried the fix suggested in the github issue urllib3>=1.25.4 and the ImportError disappeared π
AgitatedDove14 Is it possible to shut down the server while an experiment is running? I would like to resize the volume and then restart it (should take ~10 mins)
I will let the team answer you on that one π
AgitatedDove14 How can I filter out tasks archived? I don't see this option
This is how I start the agent that is running the two experiments in parallel:python3 -m trains_agent --config-file "~/trains.conf" daemon --queue default --log-level DEBUG --detached
Hi AgitatedDove14 , I donβt see any in the https://pytorch.org/ignite/_modules/ignite/handlers/early_stopping.html#EarlyStopping but I guess I could overwrite it and add one?