Reputation
Badges 1
981 × Eureka!Ok, I am asking because I often see the autoscaler starting more instances than the number of experiments in the queues, so I guess I just need to increase the max_spin_up_time_min
Hi SoggyFrog26 , https://github.com/allegroai/clearml/blob/master/docs/datasets.md
` Traceback (most recent call last):
File "devops/train.py", line 73, in <module>
train(parse_args)
File "devops/train.py", line 37, in train
train_task.get_logger().set_default_upload_destination(args.artifacts + '/clearml_debug_images/')
File "/home/machine/miniconda3/envs/py36/lib/python3.6/site-packages/clearml/logger.py", line 1038, in set_default_upload_destination
uri = storage.verify_upload(folder_uri=uri)
File "/home/machine/miniconda3/envs/py36/lib/python3.6/site...
Yes, super thanks AgitatedDove14 !
Hi TimelyPenguin76 ,
trains-server: 0.16.1-320
trains: 0.15.1
trains-agent: 0.16
I made sure before deleting the old index that the number of docs matched
But clearml does read from env vars as well right? Itβs not just delegating resolution to the aws cli, so it should be possible to specify the region to use for the logger, right?
I will try to isolate the bug, if I can, I will open an issue in trains-agent π
I actually need to be able to overwrite files, so in my case it makes sense to give the Deleteobject permission in s3. But for other cases, why not simply catch this error, display a warning to the user and store internally that delete is not possible?
it would be nice if Task.connect_configuration could support custom yaml file readers for me
I get the same error when trying to run the task using clearml-agent services-mode with docker, so weird
I want in my CI tests to reproduce a run in an agent because the env changes and some things break in agents and not locally
Awesome! (Broken link in migration guide, step 3: https://allegro.ai/docs/deploying_trains/trains_server_es7_migration/ )
AgitatedDove14 This looks awesome! Unfortunately this would require a lot of changes in my current code, for that project I found a workaround π But I will surely use it for the next pipelines I will build!
I am using clearml_agent v1.0.0 and clearml 0.17.5 btw
AgitatedDove14 I see https://github.com/allegroai/clearml-session/blob/main/clearml_session/interactive_session_task.py#L21= that a key pair is hardcoded in the repo. Is it being used to ssh to the instance?
with open(path, "r") as stream: return yaml.load(stream, Loader=yaml.FullLoader)
AgitatedDove14 I made some progress:
In clearml.conf of the agent, I set: sdk.development.report_use_subprocess = false (because I had the feeling that Task._report_subprocess_enabled = False wasnβt taken into account) Iβve set task.set_initial_iteration(0) Now I was able to get the followin graph after resuming -
This https://discuss.elastic.co/t/index-size-explodes-after-split/150692 seems to say for the _split API such situation happens and solves itself after a couple fo days, maybe the same case for me?
SuccessfulKoala55 , This is not the exact corresponding request (I refreshed the tab since then), but the request is an events.get_task_logs , with the following content:
I think that somehow somewhere a reference to the figure is still living, so plt.close("all") and gc cannot free the figure and it ends up accumulating. I don't know where yet
I also don't understand what you mean by unless the domain is different... The same way ssh keys are global, I would have expected the git creds to be used for any git operation
Ha I just saw in the logs:
WARNING:py.warnings:/root/.clearml/venvs-builds/3.8/lib/python3.8/site-packages/torch/cuda/__init__.py:145: UserWarning:
NVIDIA A10G with CUDA capability sm_86 is not compatible with the current PyTorch installation.
The current PyTorch install supports CUDA capabilities sm_37 sm_50 sm_60 sm_70.
If you want to use the NVIDIA A10G GPU with PyTorch, please check the instructions at
AgitatedDove14 ok, but this happens in my local machine, not in the agent
So previous_task actually ignored the output_uri