Reputation
Badges 1
25 × Eureka!And when runningย
get
ย the files on the parent dataset will be available as links.
BTW: if you call get_mutable_copy() the files will be copied, so you can work on them directly (if you need)
Correct ๐
I have to admit mounting it to a different drive is a good reason to bring this feature back, the reasoning was it means the agent needs to make sure it manages them (e.g. multiple agents running on the same machine)
Maybe there is setting in docker to move the space used in a different location?
No that I know of...
I can simply increase the storage of the first disk, no problem with that
probably the easiest ๐
But as you describedย
ย it looks like an edge case, so I donโt mindย
๐
JitteryCoyote63 I think that with 0.17.2 we stopped mounting the venv build to the host machine. Which means it is all stored inside the docker.
Yes (Mine isn't and it is working ๐ )
Oh, yes, that might be (threshold is 3 minutes if no reports) but you can change that:task.set_resource_monitor_iteration_timeout(seconds_from_start=10)
Hi DangerousDragonfly8
, is it possible to somehow extract the information about the experiment/task of which status has changed?
From the docstring of add_task_trigger
```py def schedule_function(task_id): pass ```
This means you are getting the Task ID that caused the trigger, now you can get all the info that you need with Task.get_task(task_id)
` def schedule_function(task_id):
the_task = Task.get_task(task_id)
# now we have all the info on the Task tha...
I just set the git credentials in the
clearml.conf
and it works out of the box
git has issues with passing the user/token from the main repo to the submodules, hence my surprise that it is working out-of-the-box.
Do notice that if you are ussing ssh-key this is a none issue.
Nope, no
.netrc
defined anywhere, ...
If this is the case can you try to add the following to your "extra_vm_bash_script"
` echo machine example.com > ~/.netrc && echo log...
Yes that's the part that is supposed to only pull the GPU usage for your process (and sub processes) instead of globally on the entire system
containing the
Extension
module
Not sure I follow, what is the Extension module ? what were you running manually that is not just pip install /opt/keras-hannd
?
I can't seem to figure out what the names should be from the pytorch example - where did INPUT__0 come from
This is actually the latyer name in the model:
https://github.com/allegroai/clearml-serving/blob/4b52103636bc7430d4a6666ee85fd126fcb49e2e/examples/pytorch/train_pytorch_mnist.py#L24
Which is just the default name Pytorch gives the layer
https://discuss.pytorch.org/t/how-to-get-layer-names-in-a-network/134238
it appears I need to converted into TorchScript?
Yes, this ...
The main reason we need the above mentioned functionality is because there are some experiments that need to run for a long time. Let's say weeks.
Good point!
. We need to temporarily pause(kill or something else) running HPO task and reassign the resource for other needs.
Oh I see now....
Later, when more important experiments has been completed, we can continue HPO task from the same state.
Quick question when you say the HPO Task, you mean the HPO controller logic Task...
okay that makes sense, if this is the case I would just use clearml-agent execute --id <task_id here>
to continue the training Task.
Do notice you have to reload your last chekcpoint from the Task's models/artifacts to continue ๐
Last question, what is the HPO optimization algorithm, is it just grid/random search or optuna hbop/optuna, if this is the later, how do make it "continue" ?
should reload the reported scalars
Exactly (notice it also understand when was the last report of scalars so it should automatically increase the iterations (i.e. you will not accidentally overwrite previously reported scalars)
and the task needs to reload last checkpoints only, right?
Correct ๐
We didn't figure out the best way of continuing for both the grid and optuna. Can you suggest something?
That is a good point, not sure if we have a GH issue, for that but wo...
Hi ElegantCoyote26
is there a way to get a Task's docker container id/name?
you mean like Task.get_task("task_id_here").get_base_docker()
?
ow a Task's results page also has a plot for this, but I guess it's at the machine level and not the task level?
This is actually on the container level, meaning checked from inside the container. It should be what you are looking for
Correct (basically pip freeze results)
If you have a requirements file then you can specify it:Task.force_requirements_env_freeze(requirements_file='requirements.txt')
If you just want pip freeze
output to be shown in your "Installed Packages" section then use:Task.force_requirements_env_freeze()
Notice that in both cases you should call the function Before you call Task.init()
btw, what do you mean by "Packages will be installed from projects requirements file" ?
SmarmyDolphin68 , All looks okay to me...
Could you verify you still get the plot on debug samples as image with the latest trains RCpip install trains==0.16.4rc0
if this is the case pytorch really messed things up, this means they removed packages
Let me check something
Hi JitteryCoyote63 ,
The easiest would probably be to list the experiment folder, and delete its content.
I might be missing a few things but the general gist should be:from trains.storage import StorageHelper h = StorageHelper('s3://my_bucket') files = h.list(prefix='s3://my_bucket/task_project/task_name.task_id') for f in files: h.delete(f)
Obviously you should have the right credentials ๐
Could it be it was never allocated to begin with ?
is it planned to add a multicursor in the future?
CheerfulGorilla72 can you expand? what do you mean by multicursor ?
Is Task.current_task() creating a task?
Hmm it should not, it should return a Task instance if one was already created.
That said, I remember there was a bug (not sure if it was in a released version or an RC) that caused it to create a new Task if there isn't an existing one. Could that be the case ?
is it also possible to somehow propagate ssh keys to the agent pod? Not sure how to approach that
I would use the k8s secret manager to do that (there is a way to mount secrets files into pod, SSH is relatively standard to do)
Hi MelancholyChicken65
I'm assuming you need ssh protocol not https user/token, set this one to true ๐force_git_ssh_protocol: true
https://github.com/allegroai/clearml-agent/blob/76c533a2e8e8e3403bfd25c94ba8000ae98857c1/docs/clearml.conf#L39
You can however pass a specific Task ID and it will reuse it "reuse_last_task_id=aabb11", would that help?
Hmm I'm sorry it might be "continue_last_task", can you try:Task.init(..., continue_last_task="aabb11")
Hi PanickyMoth78
` torch.save(net.state_dict(), PATH) # auto-uploads to GCS
get all the models from the Task
output_models = Task.current_task().models["output"]
get the last one
last_model = output_models[-1]
set meta-data
last_model.set_metadata(key="my key", value="my value", type="str") `
Hi FierceHamster54
Thanks for bringing it up ๐
... in term of secret managements/key-value stores
Currently the open-source version does not include the Vault support (e.g. secret management), this is something they added to the enterprise version a few versions away, and as far as I understand this is a per user/project/company granularity feature (i.e. company wide merging with project merging with user specific).
Is this what you are looking for or am I missing something ?