Reputation
Badges 1
979 × Eureka!Thanks! (Maybe could be added to the docs ?) 🙂
Yes that was my assumption as well, it could be several causes to be honest now that I see that also matplotlib itself is leaking 😄
And after the update, the loss graph appears
This one doesn’t have _to_dict
unfortunately
yes that makes sense, I will do that. Thanks!
Hi CostlyOstrich36 , I mean insert temporary access keys
They are, but this doesn’t work - I guess it’s because temp IAM accesses have an extra token, that should be passed as well, but there is no such option on the web UI, right?
I get the following error:
I am confused now because I see in the master branch, the clearml.conf file has the following section:# Or enable credentials chain to let Boto3 pick the right credentials. # This includes picking credentials from environment variables, # credential file and IAM role using metadata service. # Refer to the latest Boto3 docs use_credentials_chain: false
So it states that IAM role using metadata service should be supported, right?
btw SuccessfulKoala55 the parameter is not documented in https://allegro.ai/clearml/docs/docs/references/clearml_ref.html#sdk-development-worker
This is what I get with mprof
on this snippet above (I killed the program after the bar reaches 100%, otherwise it hangs trying to upload all the figures)
Well no luck - using matplotlib.use('agg')
in my training codebase doesn't solve the mem leak
Ok, I am asking because I often see the autoscaler starting more instances than the number of experiments in the queues, so I guess I just need to increase the max_spin_up_time_min
Yes, it did spin two instances for the same task
Here is what happens with polling_interval_time_min=1
when I add one task to the queue. The instance takes ~5 mins to start and connect. During this timeframe, the autoscaler starts to new instances, then spin them down. So it acts as if max_spin_up_time_min=10
is not taken into account
Why would it solve the issue? max_spin_up_time_min
should be the param defining how long to wait after starting an instance, not polling_interval_time_min
, right?
I will try with that and keep you updated
Thanks! Corrected both, now its building
btw, I tried with alpine instead of ubuntu:18.04, got :
Unable to find image 'alpine:latest' locally
latest: Pulling from library/alpine
df20fa9351a1: Pulling fs layer
df20fa9351a1: Verifying Checksum
df20fa9351a1: Download complete
df20fa9351a1: Pull complete
Digest: sha256:185518070891758909c9f839cf4ca393ee977ac378609f700f60a771a2dfe321
Status: Downloaded newer image for alpine:latest
docker: Error response from daemon: OCI runtime create failed: container_linux.go:349: starting containe...
Ok, so what worked for me in the end was:config = task.connect_configuration(read_yaml(conf_path)) cfg = OmegaConf.create(config._to_dict())
Are you planning to add a server-backup service task in the near future?
Ok, I won't have time to venture to check the different database components, the first option (shuting down the server) sounds like the easiest option for me, I would then run manually the script once a month or so
(Just to know if I should wait a bit or go with the first solution)
btw task._get_task_property('hyperparams')
also gives me ValueError: Task has no hyperparams section defined
I can probably have a python script that checks if there are any tasks running/pending, and if not, run docker-compose down to stop the clearml-server, then use boto3 to trigger the creating of a snapshot of the EBS, then wait until it is finished, then restarts the clearml-server, wdyt?
The task is created using Task.clone() yes
Could be also related to https://allegroai-trains.slack.com/archives/CTK20V944/p1597928652031300