Reputation
Badges 1
981 × Eureka!I'll try to pass these values using the env vars
Alright, thanks for the answer! Seems legit then 🙂
TimelyPenguin76 , no, I’ve only set the sdk.aws.s3.region = eu-central-1 param
, causing it to unregister from the server (and thus not remain there).
Do you mean that the agent actively notifies the server that it is going down? or the server infers that the agent is down after a timeout?
extra_configurations = {'SubnetId': "<subnet-id>"}with brackets right?
it worked for the other folder, so I assume yes --> I archived the /opt/trains/data/mongo, sent the archive via scp, unarchived, updated the rights and now it works
I was asking to exclude this possibility from my debugging journey 😁
Yes it would be very valuable to be able to tweak that param, currently it's quite annoying because it's set to 30 mins, so when a worker is killed by the autoscaler, I have to wait 30 mins before the autoscaler spins up a new machine because the autoscaler thinks there is already enough agents available, while in reality the agent is down
Thanks SuccessfulKoala55 for the answer! One followup question:
When I specify:agent.package_manager.pip_version: '==20.2.3'
in the trains.conf, I get:trains_agent: ERROR: Failed parsing /home/machine1/trains.conf (ParseException): Expected end of text, found '=' (at char 326), (line:7, col:37)
ok, what is your problem then?
So this message appears when I try to ssh directly into the instance
Sure, I opened an issue https://github.com/allegroai/clearml/issues/288 unfortunately I don't have time to open a PR 🙏
I would like to try it to see if it solves some dependencies not found eventhough they are installed when using --system-site-packages
Thanks! Unfortunately still not working, here is the log file:
After I started clearml-session
But I am not sure it will connect the parameters properly, I will check now
Hi, /opt/clearml is ~40Mb, /opt/clearml/data is about ~50gb
I didn’t use ignite callbacks, for future reference:
` early_stopping_handler = EarlyStopping(...)
def log_patience(_):
clearml_logger.report_scalar("patience", "early_stopping", early_stopping_handler.counter, engine.state.epoch)
engine.add_event_handler(Events.EPOCH_COMPLETED, early_stopping_handler)
engine.add_event_handler(Events.EPOCH_COMPLETED, log_patience) `
The main issue is the task_logger.report_scalar() not reporting the scalars
and I didn't have this problem before because when cu117 wheels were not available, the agent was trying to get the wheel with the closest cu version and was falling back to 1.11.0+cu115, and this one was working
Yes, I guess that's fine then - Thanks!
I guess I can have a workaround by passing the pipeline controller task id to the last step, so that the last step can download all the artifacts from the controller task.
Thanks! (Maybe could be added to the docs ?) 🙂