Reputation
Badges 1
979 × Eureka!AgitatedDove14 Yes exactly! it is shown in the recording above
Ok, in that case it probably doesn’t work, because if the default value is 10 secs, it doesn’t match what I get in the logs of the experiment: every second the tqdm adds a new line
I don’t have a registry to push my image to.I think I can get around it actually - Will it work if I just build the image locally once, then start the agent? Docker would recognise that image locally and just use it right? I won’t need to update that image often anyway
erf, I have the same problem with ProxyDictPreWrite 😄 What is the use case of this one ?
Ha nice, makes perfect sense thanks AgitatedDove14 !
Nice, thanks!
AgitatedDove14 I see that the default sample_frequency_per_sec=2.
, but in the UI, I see that there isn’t such resolution (ie. it logs every ~120 iterations, corresponding to ~30 secs.) What is the difference with report_frequency_sec=30.
?
CostlyOstrich36 good enough, I will fallback to sorting by updated, thanks!
Thanks SuccessfulKoala55 !
Maybe you could add to your docker-compose file an option for limiting the size of the logs, since there is no limit by default, their size will grow for ever, which doesn't sound ideal https://docs.docker.com/compose/compose-file/#logging
SuccessfulKoala55 , This is not the exact corresponding request (I refreshed the tab since then), but the request is an events.get_task_logs
, with the following content:
I get the following error:
This is the issue, I will make sure wait_for_status() calls reload at the ends, so when the function returns you have the updated object
That sounds awesome! It will definitely fix my problem 🙂
In the meantime: I now do:task.wait_for_status() task._artifacts_manager.flush() task.artifacts["output"].get()
But I still get KeyError: 'output'
... Was that normal? Will it work if I replace the second line with task.refresh
() ?
ubuntu18.04 is actually 64Mo, I can live with that 😛
my docker-compose for the master node of the ES cluster is the following:
` version: "3.6"
services:
elasticsearch:
container_name: clearml-elastic
environment:
ES_JAVA_OPTS: -Xms2g -Xmx2g
bootstrap.memory_lock: "true"
cluster.name: clearml-es
cluster.initial_master_nodes: clearml-es-n1, clearml-es-n2, clearml-es-n3
cluster.routing.allocation.node_initial_primaries_recoveries: "500"
cluster.routing.allocation.disk.watermark.low: 500mb
clust...
Ha I see, it is not supported by the autoscaler > https://github.com/allegroai/clearml/blob/282513ac33096197f82e8f5ed654948d97584c35/trains/automation/aws_auto_scaler.py#L120-L125
Not of the ES cluster, I only created a backup of the clearml-server instance disk, I didn’t think there could be a problem with ES…
Now I am trying to restart the cluster with docker-compose and specifying the last volume, how can I do that?
AppetizingMouse58 the events_plot.json template misses the plot_len
declaration, could you please give me the definition of this field? (reindexing with dynamic: strict
fails with: "mapping set to strict, dynamic introduction of [plot_len] within [_doc] is not allowed
)
amazon linux
SuccessfulKoala55 They do have the right filepath, eg:https://***.com:8081/my-project-name/experiment_name.b1fd9df5f4d7488f96d928e9a3ab7ad4/metrics/metric_name/predictions/sample_00000001.png
These images are actually stored there and I can access them via the url shared above (the one written in the pop up message saying that these files could not be deleted)
/opt/clearml/data/fileserver
does not appear anywhere, sorry for the confusion - It’s the actual location where the files are stored
I could delete the files manually with sudo rm
(sudo is required, otherwise I get Permission Denied
)
I can also access these files directly if I enter the url in the browser
Adding back clearml logging with matplotlib.use('agg')
, uses more ram but not that suspicious
yes -> but I still don't understand why the post_packages didn't work, could be worth investigating
The workaround I could find for now is to add the following to CONTAINER > SETUP SHELL SCRIPT:mkdir -p ~/git/credential chmod 0700 ~/git/credential git config --global credential.helper 'cache --socket ~/git/credential/socket'
I also tried setting ebs_device_name = "/dev/sdf"
- didn't work
And now that I restarted the server and went back into the project where I initially deleted the archived experiments, some of them are still there - I will leave them alone, too scared to do anything now 😄
Thanks TimelyPenguin76 and AgitatedDove14 ! I would like to delete artifacts/models related to the old archived experiments, but they are stored on s3. Would that be possible?