Reputation
Badges 1
25 × Eureka!Create one experiment (I guess in the scheduler)
task = Task.init('test', 'one big experiment')
Then make sure the the scheduler creates the "main" process as subprocess, basically the default behavior)
Then the sub process can call Task.init and it will get the scheduler Task (i.e. it will not create a new task). Just make sure they all call Task init with the same task name and the same project name.
Hi @<1567321739677929472:profile|StoutGorilla30>
Is it necessary to serve keras model using triton engine?
It is not, but it is the most efficient way to serve keras models, and this is why by default clearml-serving is using Nvidia Triton (we are talking 10x factors)
I would start with the keras example, see that it works and then work your way into your example (notice you always need to provide the layers form the in/out of the model)
[None](https://github.com/allegroai/clearml-s...
I ran the test, but there was no result.
what do you mean by no result, no data after the new query?
Hi @<1694157594333024256:profile|DisturbedParrot38>
You mean how to tell the agent to pull only some submodules of your git?
If this is the case you can actually remove them on your git branch, submodule is a file with a soft link. Wdyt?
That is quite neat! You can also put a soft link from the main repo to the submodule for better visibility
I double checked the code it's always being passed 😞
Yes I was thinking a separate branch.
The main issue with telling git to skip submodules is that it will be easily forgotten and will break stuff. BTW the git repo itself is cached so the second time there is no actual pull. Lastly it's not clear on where one could pass a git argument per task. Wdyt?
It will not create another 100 tasks, they will all use the main Task. Think of it as they "inherit" it from the main process. If the main process never created a task (i.e. no call to Tasl.init) then they will create their own tasks (i.e. each one will create its own task and you will end up with 100 tasks)
PanickyMoth78 RC is outpip install clearml==1.6.3rc1
🤞
Hi TrickyRaccoon92
BTW: checkout the HP optimization example, it might make things even easier 🙂 https://github.com/allegroai/trains/blob/master/examples/optimization/hyper-parameter-optimization/hyper_parameter_optimizer.py
You can disable it with:
Task.init('example', 'train', auto_connect_frameworks={'pytorch': False})
nfs version 3
That's the thing, NFS will automatically set file access and flags based on the mount options you cannot change them post mount.
How about creating a new user just for the agent, it makes sense from security / credentials perspective
SmarmySeaurchin8 what's the mount command you are using?
Hi MiniatureShells8
The torch.save triggers the model creation.
If you are using the same filename, then the same model in the system will be used.
New filenames will create new models.
What exactly is your use case ?
Why? The task should have completed successfully, how is this aborting?
Early stopping by the HPO process, like hyper-band, e.g. this training model is going nowhere let's stop it.
You actually have to login/ssh under said user, have another dedicated mountpoint and spin the agent from that user.
Yes, it could, crontab uses the user it is running from (root if used with sudo)
They all "inherit" the same user / environment from one another
Thanks @<1523703472304689152:profile|UpsetTurkey67>
I'm pretty sure it has!
Let me check how we can merge it into the cleamrl-agent, sounds good?
why would root cause the user to become nobody with group nogroup?
It is exactly the case, they inherit the cron service user (uid/gid) which would look like nobody/nogroup
. Is there any known issue with amazon sagemaker and ClearML
On the contrary it actually works better on Sagemaker...
Here is what I did on sage maker, created:
created a new sagemaker instance opened jupyter notebook Started a new notebook conda_python3 / conda_py3_pytorchIn then I just did "!pip install clearml" and Task.init
Is there any difference ?
DeterminedToad86 I suspect that since it was executed on sagemaker it registered a specific package that is unique for Sagemaker (no to worry installed packages can be edited after you clone/reset the Task)
Would it suffice to provide the git credentials ...
That should be enough, basically this is where they should be:
https://github.com/allegroai/clearml-agent/blob/0462af6a3d3ef6f2bc54fd08f0eb88f53a70724c/docs/clearml.conf#L18
Hmm that is odd, it seemed to missed the fact this is a jupyter notbook.
What's the clearml version you are using ?
DeterminedToad86
So based on the log it seems the agent is installing:
torch from https://download.pytorch.org/whl/cu102/torch-1.6.0-cp36-cp36m-linux_x86_64.whl
and torchvision from https://torchvision-build.s3-us-west-2.amazonaws.com/1.6.0/gpu/cuda-11-0/torchvision-0.7.0a0%2B78ed10c-cp36-cp36m-manylinux1_x86_64.whl
See in the log:Warning, could not locate PyTorch torch==1.6.0 matching CUDA version 110, best candidate 1.7.0
But torchvision is downloaded from the cuda 11 folder...
I...
Nicely done DeterminedToad86 🙂
Wasn't this issue resolved by torch?
Hi DeterminedToad86
I just verified on a clean sagemaker instance everything should just work, see here: https://demoapp.demo.clear.ml/projects/0e919ea1cc5c499b99e1ab85004b6e97/experiments/887edef09d4549e88b829a34c87d4d5b/output/execution Yes if you have more than one file (either notebook or python script) than you must have a git repo, in order to run the task using the Agent.