
Reputation
Badges 1
25 × Eureka!That's with the key at
/root/.ssh/id_rsa
You mean inside the container that the autoscaler spinned ?
Notice that the agent by defult would mount the Host .ssh over the existing .ssh inside the container, if you do not want this behavior you need to set: agent.disable_ssh_mount: true
in clearml.conf
PungentLouse55 you can find the metrics in the "original" (aka base template) experiment.
DistressedGoat23
We are running a hyperparameter tuning (using some cv) which might take a long time and might be even aborted unexpectedly due to machine resources.
We therefore want to see the progress
On the HPO Task itself (not the individual experiments the one controlling it all) there is the global progress of the optimization metric, is this what you are looking for ? Am I missing something?
I think your use case is the original idea behind "use_current_task" option, it was basically designed to connect code that creates the Dataset together with the dataset itself.
I think the only caveat in the current implementation is that it should "move" the current Task into the dataset project / set the name. wdyt?
WickedGoat98 Notice this is not the "clearml-agent-services" docker but "clearml-agent" docker image
Also the default docker image is "nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04"
Other than that quite similar :)
Hi ReassuredTiger98
However, the clearml-agent also stops working then.
you mean the clearml-agen daemon (the one that spinned the container) is crashing as well ?
I think this is the only mount you need:
Data persisted in every Kubernetes volume by ClearML will be accessible in /tmp/clearml-kind folder on the host.
SuccessfulKoala55 is this correct ?
Hi ConvincingSwan15
For the train.py do I need a setup.py file in my repo to work corerctly with the agent ? For now it is just the path to train,py
I'm assuming the train.py is part of the repository, no?
If it is, how come the agent after cloning the repository cannot find it ?
Could it be it was accidentally not added to the git repo ?
No I was was pointing out the lack of one
Sounds like a great idea, could you open a github issue (if not already opened) ? just so we do not forget
set the pytorch lightning trainer argument
log_every_n_steps
to
1
(default
50
) to prevent the ClearML iteration logger from timing-out
Hmm that should not have an effect on the training time, all logs are send in the background, that said checkpoints might slow it a bit (i.e.; i...
JealousParrot68 yes this seems like a correct description.
The main diff between 1 & 2 is what is the actual data, if this is training/testing data, then Dataset would make sense, if this is a part of a preprocessing pipeline, then artifacts make more sense (notice we added pipeline step caching in the artifacts, so that you can reuse steps if they have the same parameters/code, which means you are able to clone a pipeline and rerun without repeating unnecessary data processing.
Hi DilapidatedDucks58 ,
I'm not aware of anything of this nature, but I'd like to get a bit more information so we could check it.
Could you send the web-server logs ? either from the docker or the browser itself.
Hi DrabCockroach54
Notice the free GPU memory is global hence (low), but the memory (at least with new nvidia drivers) is per process. I'm assuming that the processes using the memory is not a sub process? could that be ? whats the OS you are running on?
Hi SpicyCrab51 ,
Hmm, how exactly is the Dataset opened?
If the Dataset object is alive for 30h it will keep the dataset alive, why isn't it being closed ?
So I can set output_uri = "s3://<bucket_name>/prefix" and the local models will be loaded into the s3 bucket by ClearML ?
Yes, magic 🙂
I think this was the issue: None
And that caused TF binding to skip logging the scalars and from that point it broke the iteration numbering and so on.
This code will give you one graph titled "loss" with two series: (1) trains (2) loss
SteadyFox10 I suspect you are correct 🙂
CourageousLizard33 see also section (4) here:
https://github.com/allegroai/trains-server/blob/master/docs/install_linux_mac.md#launching-the-trains-server-docker-in-linux-or-macos
Could you test with the same file? Maybe timeout has something to do with the file size ?
Many thanks! I'll pass on to technical writers 🙂
PompousParrot44 the fundamental difference is that artifacts are uploaded manually (i.e. a user will specifically "ask" to upload an artifact), models are logged automatically and a user might not want them uploaded (imagine debugging sessions, or testing).
By adding the 'upload_uri' arguments, you can specify to trains that you want all models to be automatically uploaded (not just logged).
Now here is the nice thing, when running using the trains-agent, you can have:
Always upload the mod...
what does it mean to run the steps locally?
start_locally : means the pipeline code itself (the logic that runs / controls the DAG) runs on the local machine (i.e. no agent), but this control logic creates/clones Tasks and enqueues them, for those Tasks you need an agent to execute them
run_pipeline_steps_locally=True: means the Tasks the pipeline creates, instead of enqueuing them and having an agent runs them, they will be launched on the same local machine (think debugging, other...
Thanks for checking @<1545216070686609408:profile|EnthusiasticCow4> stable release will be out soon
In order for the sample to work you have to run the template experiment once. Then the HP optimizer will find the best HP for it.
I'm assuming this is related to this thread:
None
SoggyBeetle95 maybe it makes sense to configure the agent with an access-all credentials? Wdyt
CloudyHamster42 FYI the warning will not be shown in the next Trains version, the issue is now fixed, thank you 🙂
Regrading the double axes, see if adding plt.clf()
helps. It seems the axes are leftover from the previous figure, that somehow are still there...
Okay we have located the issue, thanks guys! We will push a patch release hopefully later today
as a backup plan: is there a way to have an API key set up prior to running docker compose up?
Not sure I follow, the clearml API pair is persistent across upgrades, and the storage access token are unrelated (i.e. also persistent), what am I missing?
Not really sure that's easily done ... I mean you could query the data, but I'm not sure how you would import it. Btw why would you move from pro to self hosted?