Reputation
Badges 1
979 × Eureka!Ok, I am asking because I often see the autoscaler starting more instances than the number of experiments in the queues, so I guess I just need to increase the max_spin_up_time_min
Here is what happens with polling_interval_time_min=1
when I add one task to the queue. The instance takes ~5 mins to start and connect. During this timeframe, the autoscaler starts to new instances, then spin them down. So it acts as if max_spin_up_time_min=10
is not taken into account
(BTW: it will work with elevated credentials, but probably not recommended)
What does that mean? Not sure to understand
Yes, I stayed with an older version for a compatibility reason I cannot remember now π - just tested with 1.1.2 and itβs the same
I tried specifying the bucket directly in my clearml.conf, same problem. I guess clearml just reads from the env vars first
Sorry both of you, my problem was actually lying somewhere else (both buckets are in the same region) - thanks for you time!
I am using clearml_agent v1.0.0 and clearml 0.17.5 btw
` Traceback (most recent call last):
File "devops/train.py", line 73, in <module>
train(parse_args)
File "devops/train.py", line 37, in train
train_task.get_logger().set_default_upload_destination(args.artifacts + '/clearml_debug_images/')
File "/home/machine/miniconda3/envs/py36/lib/python3.6/site-packages/clearml/logger.py", line 1038, in set_default_upload_destination
uri = storage.verify_upload(folder_uri=uri)
File "/home/machine/miniconda3/envs/py36/lib/python3.6/site...
Yes thatβs correct - the weird thing is that the error shows the right detected region
Also what is the benefit of having by default index.number_of_shards = 1
for the metrics and the logs indices? Having more allows to scale and later move them in separate nodes if needed - the default heap size being 2Gb, it should be possible, or?
Hi AgitatedDove14 , so I ran 3 experiments:
One with my current implementation (using "fork") One using "forkserver" One using "forkserver" + the DataLoader optimizationI sent you the results via MP, here are the outcomes:
fork -> 101 mins, low RAM usage (5Go constant), almost no IO forkserver -> 123 mins, high RAM usage (16Go, fluctuations), high IO forkserver + DataLoader optimization: 105 mins, high RAM usage (from 28Go to 16Go), high IO
CPU/GPU curves are the same for the 3 experiments...
I want to make sure that an agent did finish uploading its artifacts before marking itself as complete, so that the controller does not try to access these artifacts while they are not available
I finally found a workaround using cache, will detail the solution in the issue π
Super! Iβll give it a try and keep you updated here, thanks a lot for your efforts π
I checked the server code diffs between 1.1.0 (when it was working) and 1.2.0 (when the bug appeared) and I saw many relevant changes that can introduce this bug > https://github.com/allegroai/clearml-server/compare/1.1.1...1.2.0
extra_configurations = {"SubnetId": "<subnet-id>"}
That fixed it π
No, I want to launch the second step after the first one is finished and all its artifacts are uploaded
Yes π Thanks!
I get the following error:
This is the issue, I will make sure wait_for_status() calls reload at the ends, so when the function returns you have the updated object
That sounds awesome! It will definitely fix my problem π
In the meantime: I now do:task.wait_for_status() task._artifacts_manager.flush() task.artifacts["output"].get()
But I still get KeyError: 'output'
... Was that normal? Will it work if I replace the second line with task.refresh
() ?
ubuntu18.04 is actually 64Mo, I can live with that π
my docker-compose for the master node of the ES cluster is the following:
` version: "3.6"
services:
elasticsearch:
container_name: clearml-elastic
environment:
ES_JAVA_OPTS: -Xms2g -Xmx2g
bootstrap.memory_lock: "true"
cluster.name: clearml-es
cluster.initial_master_nodes: clearml-es-n1, clearml-es-n2, clearml-es-n3
cluster.routing.allocation.node_initial_primaries_recoveries: "500"
cluster.routing.allocation.disk.watermark.low: 500mb
clust...
Ha I see, it is not supported by the autoscaler > https://github.com/allegroai/clearml/blob/282513ac33096197f82e8f5ed654948d97584c35/trains/automation/aws_auto_scaler.py#L120-L125
Not of the ES cluster, I only created a backup of the clearml-server instance disk, I didnβt think there could be a problem with ESβ¦
Now I am trying to restart the cluster with docker-compose and specifying the last volume, how can I do that?
AppetizingMouse58 the events_plot.json template misses the plot_len
declaration, could you please give me the definition of this field? (reindexing with dynamic: strict
fails with: "mapping set to strict, dynamic introduction of [plot_len] within [_doc] is not allowed
)
amazon linux