I think that clearml should be able to do parameter sweeps using pipelines in a manner that makes use of parallelisation.
Use the HPO, it is basically doing the same thing with some more sophisticated algorithm (HBOB):
https://github.com/allegroai/clearml/blob/master/examples/optimization/hyper-parameter-optimization/hyper_parameter_optimizer.py
For example - how would this task-based example be done with pipelines?
Sure, you could do something like:
` from clearml import Pi...
CourageousLizard33 so you have a Linux server running Ubuntu VM with Docker inside?
I would imagine that you could just run the docker on the host machine, no?
BTW, I think 8gb is a good recommendation for a VM it's reasonable enough to start with, I'll make sure we add it to the docs
Probably less secure though :)
Hi SubstantialElk6
ClearML-Serving is already out with a new version, the ETA for the next ClearML-serving full 1.0 (which is the new redesign version) is the end of May
Hi @<1523715429694967808:profile|ThickCrow29>
Is there a way to specify a callback upon an abort action from the user
You mean abort of the entire pipeline?
None
Hi MoodyCentipede68 , I think I saw something like it, can you post the full log? The triton error is above, also I think it restarted the container automatically and then it worked
NICE! MoodyCentipede68 this is awesome 🙂
MoodyCentipede68 could it be that the model is on one account (workspace) and your credentials (the ones provided to the docker compose) are from another workspace?
The error itself point to the triton helper failing to get the model ID from the backend. The models are uploaded to a a specific workspace, and it looks like a mismatch (I.e. the model Id is nowhere to be found) wdyt?
WickedElephant66 this seems like a general network issue, like the docker service is missing your companies firewall certificate.
Can you pull any container from docker hub ?
CooperativeFox72 you can you start by checking the latest RC :)pip install trains==0.15.2rc0
in clearml.conf we could have:azure.storage { max_connections = 10 # containers: [ # { # account_name: "clearml" # account_key: "secret" # # container_name: # } # ] }
Then in AzureContainerConfigurations
:
` @classmethod
def from_config(cls, configuration):
...
class AzureContainerConfigurations(object):
def init(self, container_configs=None, max_connections=None):
...
HealthyStarfish45 We are now working on improving the k8s glue (due to be finished next week) after that we can take a stab at slurm, it should be quite straight forward. Will you be able to help with a bit of testing (setting up a slurm cluster is always a bit of a hassle 🙂 )?
as i also noticed that uploads are sometimes slow, and i see here max_connections=2
Makes sense to me, please go ahead and add that as well (basically the same thing on _AzureBlobServiceStorageDriver.upload_object
and an additional variable on the AzureContainerConfigurations
class.
Could you PR a tested draft ? we will be able to take from there
ZanyPig66 it sounds like you need to add the docker args for binding, just add to the Task.create the argument: 'docker_args="-v /mnt/host:/mnt/container"'
Hi CooperativeFox72 trains 0.16 is out, did it solve this issue? (btw: you can upgrade trains to 0.16 without upgrading the trains-server)
That's not possible, right?
That's actually what the "start_locally" does, but the missing part is starting it on another machine without the agent (I mean it totally doable, and if important I can explain how, but this is probably not what you are after)
I really need to have a dummy experiment pre-made and have the agent clone the code, set up the env and run everything?
The agent caches everything, and actually can also just skip installing the env entirely. which would mean ...
Hi UptightMouse31
First, thank you 😊
And to your question:
variable in the project is the kpi,
You mean like add it to the experiment table and get kind of leader-board ?
Okay that actually makes sense, let me check I think I know what's going on
Verified, and already fixed with 1.0.6rc2
Hi OutrageousGiraffe8
Does anybody knows why this is happening and is there any workaround, e.g. how to manually report model?
What exactly is the error you are getting? and with which clearml version are you using?
Regrading manual Model reporting:
https://clear.ml/docs/latest/docs/fundamentals/artifacts#manual-model-logging
SuccessfulKoala55 please post here once the code is available in your pytorch_ignite 🙂
btw: any specific reason to call current_task after you closed the main Task ?
Hi RobustFlamingo1
The ClearML Orchestrator looks interesting. But the website suggests that K8S is required
No k8s is not a must, only an option 🙂
We have a Linux training box (LambdaBox) where we want to run training. Can we place the ClearML orchestrator agent on the machine without needing K8S?
Yes should be quite easy.
If you intent to use containers, make sure you have docker installed.
Then just pip install clearml-agent
and configure it:
https://clear.ml/doc...
Only the dictionary keys are returned as the raw nested dictionary, but the values remain casted.
Using which function ? task.get_parameters_as_dict
does not cast the values (the values themselves are stored as strings on the backend), only task.connect
will cast the values automatically
GiganticTurtle0 adding --stop to the exact daemon execution will stop it (meaning if you have multiple agents on the same machine launched with different parameters, just add the --stop to retire the specific one)
Fixed in pip install clearml==1.8.1rc0
🙂
The current implementation (since 1.6.3 I think) creates the issues in the linked comment (with images to visualize).
Understood, basically the moment we add nested project view to the dataset (and pipelines for that matter, and both are already being worked on), it should solve everything. Is that correct?
Hi UnevenDolphin73
Is there an easy way to add a link to one of the tasks panels? (as an artifact, configuration, info, etc)?
You can add a link as an artifact, that is probably the easiest:tasl.upload_artifact(name="just link", artifact_object="
")
EDIT: And follow up regarding the dataset. As discussed somewhere previously, the datasets are now automatically moved to a hidden "sub-project" prefixed with
.datasets
. This creates several annoyances that I...
So I'd create the queue in the UI, then update the helm yaml as above, and install? How would I add a 3rd queue?
Same process?!
Also I'd like to create the queues pragmatically, is that possible?
Yes, you can, you can also pass an argument for the agent to create the queue if it does not already exist, just add --create-queue
to the agent execution commandline