Yes, this is exactly how the clearml k8s glue works (notice the resource allocation, spin nodes up/down, is done by k8s which sometimes do take some time, if you only need "bare metal nodes" on the cloud, it might be more efficient to use the aws autoscaler, that essentially does the same thing
Please feel free to do so (always better to get it from a user not the team behind the product 😉 )
Hi FiercePenguin76
Maybe it makes sense to use
schedule_function
I think you are correct. This means the easiest would be to schedule a function, and have that function do the Task cloning/en-queuing. wdyt?
As a side note , maybe we should have the ability of custom function that Returns a task ID. the main difference is that the Task ID that was created will be better logged / visible (as opposed to the schedule_function, where the fact there was a Task that was created / ...
Hi FlutteringWorm14
Is there some way to limit that?
What do you mean by that? are you referring to the Free tier ?
BTW: latest PyCharm plugin with 2022 support was just released:
https://github.com/allegroai/clearml-pycharm-plugin/releases/tag/1.1.0
I cannot test it at the moment, hence my question.
JuicyFox94 any chance you can blindly approve ?
if in the "installed packages" I have all the packages installed from the requirements.txt than I guess I can clone it and use "installed packages"
After the agent finished installing the "requirements.txt" it will put back the entire "pip freeze" into the "installed packages", this means that later we will be able to fully reproduce the working environment, even if packages change (which will eventually happen as we cannot expect everyone to constantly freeze versions)
My problem...
not sure what is the "right way" 🙂
But I do pkill -f "trains-agent --gpus 0"
This will kill a process that started "trains-agent --gpus 0" Notice it matches the cmd pattern so it has to match the way you executed the agent. You can check it with ps -Af | grep trains-agent
Thanks ShortElephant92 ! PR looks good, I'll ask the guts to take a look
I'm running agent inside docker.
So this means venv mode...
Unfortunately, right now I can not attach the logs, I will attach them a little later.
No worries, feel free to DM them if you feel this is to much to post them here
Hi FierceHamster54
Sure just dodataset = Dataset.get(dataset_project="project", dataset_name="name")
This will by default fetch the latest version
The cool thing of using the trains-agent, you can change any experiment parameters and automate the process, so you get hyper-parameter optimization out of the box, and you can build complicated pipelines
https://github.com/allegroai/trains/tree/master/examples/optimization/hyper-parameter-optimization
https://github.com/allegroai/trains/blob/master/examples/automation/task_piping_example.py
Fully automatic, just have them defined and Task.init and everything else will work out of the box.
Notice the Env will override clearml.conf, so you can have clearml.conf with other default values inside the container, and have the Env override the definition
(not to worry, it is Not a must to have clearml.conf , it's just a nice way to add default values)
Hi JitteryCoyote63
I think there is a GitHub issue (request on it), this is not very trivial to build (basically you need the agent to first temporary pull the git, apply changes, build docker, remove temp build, and restart with the new image)
Any specific reason for not pushing a docker, or using the extra docker bash script on the Task itslef?
CheerfulGorilla72
upd: I see NAN in the tensorboard, and 0 in Clearml.
I have to admit, since NaN's are actually skipped in the graph, should we actually log them ?
Hi GloriousPenguin2 , Sorry this is a bit confusing. Let me expand:
When converting into a plotly object (the default), you cannot really control the dimensions of the plot in the UI programatically, you can however drag the seperator and expand width / height If you pass to report_matplotlib_figure
the argument " report_image=True,
" it will create a static image from matplotlib figure (as rendered locally) and use that as the figure, this way you get exactly wysiwyg , but the...
Agreed, MotionlessCoral18 could you open a feature request on the clearml-agent repo please? (I really do not want this feature to get lost, and I'm with you on the importance, lets' make sure we have it configured from the outside)
. And I saw that it upload the notebook it self as notebook. Does it is normal? There is a way to disable it?
Hi FriendlyElk26
Yes this is normal, it backups your notebook as well as converts it into python code (see "Execution - uncommitted changes" so that later the clearml-agent will be able to run it for you on remote machines.
You can also use task.connect({"param": "value")
to expose arguments to use in the notebook so that later you will be able to change them from the U...
Thanks GorgeousMole24
That is a very good point! passing to product guys
UpsetBlackbird87pipeline.start()
Will launch the pipeline itself On a remote machine (a machine running the services agent).
This is why your pipeline is "stuck" it is not actually running.
When you call start_lcoally() the pipeline logic itself is runnign on your machine and the nodes are running on the workers.
Makes sense ?
- Be able to trigger the “pure” function (e.g. train()) locally, without any
code running, while driving it from a configuration e.g. path to the data.
When you say " without any http://clear.ml code" do mean without the agent, or without using the Clearml.Dataset ?
Be able to trigger the “
decorator” (e.g. train_clearml()) while driving it from configuration e.g. dataset_id
Hmm I can think of:
` def train_clearml(local_folder=None, dataset_id=None):
...
That is odd, can you send the full Task log? (Maybe some oddity with conda/pip ?!)
Hi DeliciousBluewhale87
You can achieve the same results programmatically with Task.create
https://github.com/allegroai/clearml/blob/d531b508cbe4f460fac71b4a9a1701086e7b6329/clearml/task.py#L619
…every user in the server has the same credentials, and they don’t need to know them..makes sense?
Make sense, single credentials for everyone, without the need to distribute
Is that correct?
I am actually saving a dictionary that contains the model as a value (+ training datasets)
How are you specifically doing that? pickle?
Because it lives behind a VPN and github workers don’t have access to it
makes sense
If this is the case, I have to admit that combining offline-mode and remote execution makes sense, no?
Hi ZippySheep23
Any ideas what might be happening?
I think you passed the upload limit (2.36 GB) 🙂
So that agent on different nodes will probably require different cuda-version images.
That makes sense SarcasticSquirrel56
I would edit the helm chart (or deploy manually) based on a selector that will select the different nodes/gpus and assign the correct containers (i.e. matching CUDA versions to the diff GPUs / drivers)
BTW: you can also playaround with k8s glue, which would dynamically spin pods based on clearml Tasks.
wdyt?