Reputation
Badges 1
25 × Eureka!AbruptHedgehog21 could it be the console log itself is huge ?
So if I am not using remote machine can I disable this?
yes I think you can, add to your clearml.conf
sdk.development.store_jupyter_notebook_artifact = false
BTW: why would you turn it off ?
What do you have in the artifacts of this task id: 4a80b274007d4e969b71dd03c69d504c
Hi TrickyRaccoon92
BTW: checkout the HP optimization example, it might make things even easier 🙂 https://github.com/allegroai/trains/blob/master/examples/optimization/hyper-parameter-optimization/hyper_parameter_optimizer.py
The idea is that it is not necessary, using the trains-agent you can not only launch the experiment on a remote machine, you can override the parameters, not just cmd line arguments, but any dictionary you connected with the Task or configuration...
It's dead simple to install:
Pip install trains-agent
the.n you can simply do:
Trains-agent execute --id myexperimentid
Hi BroadMole98 ,
what's the current setup you have? And how do you launch jobs to Snakemake?
yes, that makes sense to me.
What is your specific use case, meaning when/how do you stop / launch the hpo?
Would it make sense to continue from a previous execution and just provide the Task ID? Wdyt?
I did not start with python -m, as a module. I'll try that
I do not think this is the issue.
It sounds like anything you do on your specific setup will end with the same error, which might point to a problem with the git/folder ?
Sounds good to me 🙂
Questions
I want to trigger a retrain task when F1
That means that in inference you are reporting the F1 score, correct?
As part of the retraining I have to train all the models and then have to choose best one and deploy it
Are you using passing output_uri to Task.init? are you storing the model as artifact?
You can tag your model/task with "best" tag (and untag the previous one). Then in production , look for the "best" task and get its model
Thoughts?
Or you can do:
param={'key': 123}
task.connect(param)
SpotlessFish46 unless all the code is under "uncommitted changes" section, what you have is a link to the git repo + commit id
This is part if a more advanced set of features of the scheduler, but only available in the enterprise edition 🙂
you mean in the enterprise
Enterprise with the smarter GPU scheduler, this is inherent problem of sharing resources, there is no perfect solution, you either have fairness, but then you get idle GPU's of you have races, where you can get starvation
This only talks about bugs reporting and enhancement suggestions
I'll make sure this is fixed 🙂
JitteryCoyote63 what am I missing?
What are the errors you are getting (with / without the envs)
Yes, hopefully they have a different exception type so we could differentiate ... :) I'll check
Awesome! any way to hear the talk w/o/ registering for the whole conference?
CloudySwallow27 Anyway we will make sure we upload the talk to the clearml youtube channel after the Talk
So I see this in the build, which means it works , and compiles, what is missing ?
` Building wheels for collected packages: leap
Building wheel for leap (setup.py) ... [?25l- \ |
1667848450770 UH-LPT371:0 DEBUG / - \ | / - done
[?25h Created wheel for leap: filename=leap-0.4.1-cp38-cp38-linux_x86_64.whl size=1052746 sha256=1dcffa8da97522b2611f7b3e18ef4847f8938610180132a75fd9369f7cbcf0b6
Stored in directory: /root/.cache/pip/wheels/b4/0c/2c/37102da47f10c22620075914c8bb4a9a2b1f858263021...
What's the error you are getting ?
(open the browser web developer, see if you get something on the console log)
it seems like each task is setup to run on a single pod/node based on the attributes like
gpu memory
,
os
,
num of cores,
worker
BoredHedgehog47 of course you can scale on multiple node.
The way to do that is to create a k8s Yaml with replicas, each pod is actually running the exact same code with the exact same setup, notice that inside the code itself the DL frameworks need to be able to communicate with one another and b...
Actually this is by default for any multi node training framework torch DDP / openmpi etc.
Exactly !
I think it's inside the container since it's after the worker pulls the image
Oh that makes more sense, I mean it should not build the from source, but make sense
To solve for build for source:
Add to the "Additional ClearML Configuration" section the following line:agent.package_manager.pip_version: "<21"
You can also turn on venv caching
Add to the "Additional ClearML Configuration" section the following line:agent.venvs_cache.path: ~/.clearml/venvs-cache
I will make sure w...
UnevenDolphin73 are you positive, is this reproducible? What are you getting?