Reputation
Badges 1
25 × Eureka!CooperativeFox72 you can you start by checking the latest RC :)pip install trains==0.15.2rc0
in clearml.conf we could have:azure.storage { max_connections = 10 # containers: [ # { # account_name: "clearml" # account_key: "secret" # # container_name: # } # ] }
Then in AzureContainerConfigurations
:
` @classmethod
def from_config(cls, configuration):
...
class AzureContainerConfigurations(object):
def init(self, container_configs=None, max_connections=None):
...
HealthyStarfish45 We are now working on improving the k8s glue (due to be finished next week) after that we can take a stab at slurm, it should be quite straight forward. Will you be able to help with a bit of testing (setting up a slurm cluster is always a bit of a hassle 🙂 )?
as i also noticed that uploads are sometimes slow, and i see here max_connections=2
Makes sense to me, please go ahead and add that as well (basically the same thing on _AzureBlobServiceStorageDriver.upload_object
and an additional variable on the AzureContainerConfigurations
class.
Could you PR a tested draft ? we will be able to take from there
okay let's PR this fix ?
ZanyPig66 it sounds like you need to add the docker args for binding, just add to the Task.create the argument: 'docker_args="-v /mnt/host:/mnt/container"'
Hi CooperativeFox72 trains 0.16 is out, did it solve this issue? (btw: you can upgrade trains to 0.16 without upgrading the trains-server)
That's not possible, right?
That's actually what the "start_locally" does, but the missing part is starting it on another machine without the agent (I mean it totally doable, and if important I can explain how, but this is probably not what you are after)
I really need to have a dummy experiment pre-made and have the agent clone the code, set up the env and run everything?
The agent caches everything, and actually can also just skip installing the env entirely. which would mean ...
Hi UptightMouse31
First, thank you 😊
And to your question:
variable in the project is the kpi,
You mean like add it to the experiment table and get kind of leader-board ?
Okay that actually makes sense, let me check I think I know what's going on
Verified, and already fixed with 1.0.6rc2
Hi PompousBeetle71 I'm with SteadyFox10 on this one. Unless you choose a file name based on epoch or step , you are literally overwriting the model file, which Trains will reflect. If you use epoch in the filename you will end up with all your models logged by Trains. BTW we are actively working on integration with pytorch ignite, so if you have any suggestions now is the time :)
Hi OutrageousGiraffe8
Does anybody knows why this is happening and is there any workaround, e.g. how to manually report model?
What exactly is the error you are getting? and with which clearml version are you using?
Regrading manual Model reporting:
https://clear.ml/docs/latest/docs/fundamentals/artifacts#manual-model-logging
SuccessfulKoala55 please post here once the code is available in your pytorch_ignite 🙂
btw: any specific reason to call current_task after you closed the main Task ?
Hi RobustFlamingo1
The ClearML Orchestrator looks interesting. But the website suggests that K8S is required
No k8s is not a must, only an option 🙂
We have a Linux training box (LambdaBox) where we want to run training. Can we place the ClearML orchestrator agent on the machine without needing K8S?
Yes should be quite easy.
If you intent to use containers, make sure you have docker installed.
Then just pip install clearml-agent
and configure it:
https://clear.ml/doc...
Only the dictionary keys are returned as the raw nested dictionary, but the values remain casted.
Using which function ? task.get_parameters_as_dict
does not cast the values (the values themselves are stored as strings on the backend), only task.connect
will cast the values automatically
GiganticTurtle0 adding --stop to the exact daemon execution will stop it (meaning if you have multiple agents on the same machine launched with different parameters, just add the --stop to retire the specific one)
Fixed in pip install clearml==1.8.1rc0
🙂
The current implementation (since 1.6.3 I think) creates the issues in the linked comment (with images to visualize).
Understood, basically the moment we add nested project view to the dataset (and pipelines for that matter, and both are already being worked on), it should solve everything. Is that correct?
Hi UnevenDolphin73
Is there an easy way to add a link to one of the tasks panels? (as an artifact, configuration, info, etc)?
You can add a link as an artifact, that is probably the easiest:tasl.upload_artifact(name="just link", artifact_object="
")
EDIT: And follow up regarding the dataset. As discussed somewhere previously, the datasets are now automatically moved to a hidden "sub-project" prefixed with
.datasets
. This creates several annoyances that I...
So I'd create the queue in the UI, then update the helm yaml as above, and install? How would I add a 3rd queue?
Same process?!
Also I'd like to create the queues pragmatically, is that possible?
Yes, you can, you can also pass an argument for the agent to create the queue if it does not already exist, just add --create-queue
to the agent execution commandline
Hi GrievingTurkey78
I'm assuming similar to https://github.com/pallets/click/
?
Auto connect and store/override all the parameters?
Was going crazy for a short amount of time yelling to myself: I just installed clear-agent init!
oh noooooooooooooooooo
I can relate so much, happens to me too often that copy pasting into bash just uses the unicode character instead of the regular ascii one
I'll let the front-end guys know, so we do not make ppl go crazy 😉
I could merge some steps, but as I may want to cache them in the future, I prefer to keep them separate
Makes total sense, my only question (and sorry if I'm dwelling too much in it) is how would you pass the data between step 2 to step 3, if this is a different process on the same machine ?
Something is off here ... Can you try to run the TB examples and the artifacts example and see if they work?
https://github.com/allegroai/clearml/blob/master/examples/frameworks/tensorflow/tensorflow_mnist.py
https://github.com/allegroai/clearml/blob/master/examples/reporting/artifacts.py
JuicyFox94 maybe you can help here?
How can I make it show progress less often/rewrite?
I'm not sure this is configurable ... you mean like reports on the uploads right? (i.e. report every 5mb I think is the default)
while we are at it, maybe we should use twdm if it is installed
wdyt?