Reputation
Badges 1
981 × Eureka!Yes I did, I found the problem: docker-compose was using trains-server 0.15 because it didn't see the new version of trains-server. Hence I had trains-server 0.15 running with ES7.
-> I deleted all the containers and it successfully pulled trains-server 0.16. Now everything is running properly 🙂
continue_last_task is almost what I want, the only problem with it is that it will start the task even if the task is completed
nothing wrong from ClearML side 🙂
I'll try to pass these values using the env vars
Alright, thanks for the answer! Seems legit then 🙂
SO I updated the config with:resource_configurations { A100 { instance_type = "p3.2xlarge" is_spot = false availability_zone = "us-east-1b" ami_id = "ami-04c0416d6bd8e4b1f" ebs_device_name = "/dev/xvda" ebs_volume_size = 100 ebs_volume_type = "gp3" key_name = "<my-key-name>" security_group_ids = ["<my-sg-id>"] subnet_id = "<my-subnet-id>" } }
but I get in the logs of the autoscaler:
` Warning! exception occurred: An error occurred (InvalidParam...
TimelyPenguin76 , no, I’ve only set the sdk.aws.s3.region = eu-central-1 param
, causing it to unregister from the server (and thus not remain there).
Do you mean that the agent actively notifies the server that it is going down? or the server infers that the agent is down after a timeout?
extra_configurations = {'SubnetId': "<subnet-id>"}with brackets right?
it worked for the other folder, so I assume yes --> I archived the /opt/trains/data/mongo, sent the archive via scp, unarchived, updated the rights and now it works
I was asking to exclude this possibility from my debugging journey 😁
Yes it would be very valuable to be able to tweak that param, currently it's quite annoying because it's set to 30 mins, so when a worker is killed by the autoscaler, I have to wait 30 mins before the autoscaler spins up a new machine because the autoscaler thinks there is already enough agents available, while in reality the agent is down
Thanks SuccessfulKoala55 for the answer! One followup question:
When I specify:agent.package_manager.pip_version: '==20.2.3'
in the trains.conf, I get:trains_agent: ERROR: Failed parsing /home/machine1/trains.conf (ParseException): Expected end of text, found '=' (at char 326), (line:7, col:37)
ok, what is your problem then?
So this message appears when I try to ssh directly into the instance
Sure, I opened an issue https://github.com/allegroai/clearml/issues/288 unfortunately I don't have time to open a PR 🙏
I would like to try it to see if it solves some dependencies not found eventhough they are installed when using --system-site-packages
Thanks! Unfortunately still not working, here is the log file:
After I started clearml-session
But I am not sure it will connect the parameters properly, I will check now
Hi, /opt/clearml is ~40Mb, /opt/clearml/data is about ~50gb
I didn’t use ignite callbacks, for future reference:
` early_stopping_handler = EarlyStopping(...)
def log_patience(_):
clearml_logger.report_scalar("patience", "early_stopping", early_stopping_handler.counter, engine.state.epoch)
engine.add_event_handler(Events.EPOCH_COMPLETED, early_stopping_handler)
engine.add_event_handler(Events.EPOCH_COMPLETED, log_patience) `