Reputation
Badges 1
981 × Eureka!extra_configurations = {'SubnetId': "<subnet-id>"}with brackets right?
I was asking to exclude this possibility from my debugging journey π
Thanks SuccessfulKoala55 for the answer! One followup question:
When I specify:agent.package_manager.pip_version: '==20.2.3'
in the trains.conf, I get:trains_agent: ERROR: Failed parsing /home/machine1/trains.conf (ParseException): Expected end of text, found '=' (at char 326), (line:7, col:37)
ok, what is your problem then?
So this message appears when I try to ssh directly into the instance
Sure, I opened an issue https://github.com/allegroai/clearml/issues/288 unfortunately I don't have time to open a PR π
I would like to try it to see if it solves some dependencies not found eventhough they are installed when using --system-site-packages
After I started clearml-session
But I am not sure it will connect the parameters properly, I will check now
Hi, /opt/clearml is ~40Mb, /opt/clearml/data is about ~50gb
I didnβt use ignite callbacks, for future reference:
` early_stopping_handler = EarlyStopping(...)
def log_patience(_):
clearml_logger.report_scalar("patience", "early_stopping", early_stopping_handler.counter, engine.state.epoch)
engine.add_event_handler(Events.EPOCH_COMPLETED, early_stopping_handler)
engine.add_event_handler(Events.EPOCH_COMPLETED, log_patience) `
The main issue is the task_logger.report_scalar() not reporting the scalars
and I didn't have this problem before because when cu117 wheels were not available, the agent was trying to get the wheel with the closest cu version and was falling back to 1.11.0+cu115, and this one was working
Yes, I guess that's fine then - Thanks!
I guess I can have a workaround by passing the pipeline controller task id to the last step, so that the last step can download all the artifacts from the controller task.
Thanks! (Maybe could be added to the docs ?) π
Indeed, I actually had the old configuration that was not JSON - I converted to json, now works π
both are repos for python modules (experiment one and dependency of the experiment)
ha wait, I removed the http:// in the host and it worked π
No space at the end of the diff file:
` diff --git a/configs/2.2.2_from_scratch.yaml b/configs/2.2.2_from_scratch.yaml
index 9fece48..5816f78 100644
--- a/configs/2.2.2_from_scratch.yaml
+++ b/configs/2.2.2_from_scratch.yaml
@@ -136,7 +136,7 @@ data_processing:
optimizer:
type: 'RMSprop'
args:
- lr: 2.5e-4
- lr: 1.5e-5
momentum: 0
weight_decay: 0 `
Ok, now I get ERROR: No matching distribution found for conda==4.9.2 (from -r /tmp/cached-reqscaw2zzji.txt (line 13))
And since I ran the task locally with python3.9, it used that version in the docker container
In all the steps I want to store them as artifacts to s3 because itβs very convenient.
The last step should merge them all, ie. it needs to know all the other artifacts of the previous steps
So I want to be able to visualise it quickly as a table in the UI and be able to download it as a dataframe, which of report_media or artifact is better?
I am now trying with agent.extra_docker_arguments: ["--network='host'", ] instead of what I shared above