Reputation
Badges 1
25 × Eureka!I found "scheduler" on allegroai github, is it something related to the case I want to make?
MoodyCentipede68 it is exactly what you are looking for π
Do notice that you need to make sure you have your services queue configured and running for that to work π
Hi @<1627478122452488192:profile|AdorableDeer85>
I'm sorry I'm a bit confused here, any chance you can share the entire notebook ?
Also any reason why this is pointing to "localhost" and not IP/host of the clearml-server ? is the agent running on the same machine ?
Wait, so the pipeline step only runs if the pre execute callback returns True? It'll stop if it doesn't run?
Only if you have a Callback function, and that callback function returns False, then it will skip it (otherwise it will process it)
Another question, in the parents sequence in pipe.add_step, we have to pass in the name of the step right?
Correct, the step name is a unique identifier for the pipeline
how would I access the artifact of a previous step within the pre ...
My main query is do I wait for it to be a sufficient batch size or do I just send each image as soon as it comes to train
This is usually a cost optimization issue, generally speaking if GPU up time is not an issue that the process is stochastic anyhow, so waiting for a batch or not is not the most important factor (unless you use batchnorm layer, in that case this is basically a must)
I would not be able to split the data into train test splits, and that it would be very expensiv...
Perhaps this is something that can be made clearer when updating the docu?
Hmm that is a good point, let's open a git issue and explain there, then update the docs, wdyt?
Where are they stored? I could not find a backend they work with, what am I missing?
Hi DangerousDragonfly8
, is it possible to somehow extract the information about the experiment/task of which status has changed?
From the docstring of add_task_trigger```py def schedule_function(task_id): pass ```This means you are getting the Task ID that caused the trigger, now you can get all the info that you need with Task.get_task(task_id)
` def schedule_function(task_id):
the_task = Task.get_task(task_id)
# now we have all the info on the Task tha...
for example, if I somehow start the execution of an agent task in a specific docker container?)
You mean to specify the container from code? or to make sure the agent can access private docker container registry ? Or is it for private pypi container repository ?
Is the agent itself registered on the clearml-server (a.k.a can you see it in the UI?)
Hi NastyOtter17
"Project" is so ambiguous
LOL yes, this is something GCP/GS is using:
https://googleapis.dev/python/storage/latest/client.html#module-google.cloud.storage.client
@<1523701868901961728:profile|ReassuredTiger98> it works on my machine π
WickedGoat98 Same for me, let me ask the UI guys, I think this is a UI bug.
Also maybe before you post the article we could release a fix to both, what do you think?
EDIT:
Never mind π i just saw the medium link, very cool!!!
These are maybe good features to include in ClearML:
or
.
Sure, we should probably add a section into the doc explaining how to do that
Other approach is creating my own API on the top of clearml-serving endpoints and there I control each tenant authentication.
I have to admit that to me this is a much better solution (then my/bento integrated JWT option). Generally speaking I think this is the best approach, it separates authentication layer from execution ...
I wonder if using our own containers which should have most the deps will work better than a simpler container.
Why not, it's transparent, just run in --docker mode and provide a default docker image if the Task doesn't specify one.
Hi RoundSeahorse20
Try the following , let me know if it worked.clear_logger = logging.getLogger('clearml.metrics') clear_logger.setLevel(logging.ERROR)
ConvolutedChicken69
, does it take the agent off the queue? does it know it's not available to take tasks?
You mean will it "release" the GPU? (i.e. the agent will pull another Task) ?
If so, then no it will not, an "Interactive Session" session is (from the agent's perspective) a Task that will end sometime, and it will continue to monitor and run it, until you manually close it. The idea is that you are actually using the GPU, hence no on else can run a job on it.
To shut it down, ...
Hi SmarmySeaurchin8 , you can point to any configuration file by setting the environment variable:TRAINS_CONFIG_FILE=/home/user/my_trains.conf
well.. having the demo server by default lowers the effort threshold for trying ClearML and getting convinced it can deliver what it promises, and maybe test some simple custom use cases. I
This was exactly what we thought when we set it up in the first place π
(I can't imagine the cost is an issue, probably maintenance/upgrades ...)
There is still support for the demo server, you just need to set the env key:CLEARML_NO_DEFAULT_SERVER=0 python ...
In our case, we have a custom YAML instruction
!include
, i.e.
Hmm interesting, in theory this might work since configuration encoding (when passing dicts), is handled with HOCON which does support referencing.
That said currently it is not aware of "remote configurations" only ENV variables and local files.
It will be cool to add, do we have a github issue on that? (would you like to see if you can PR such a thing?)
That said, the arguments are passed Inside the code executed (i.e. monkey patched into the frameworks). This allows it to log and change All the arguments, including the default ones , and allow you to edit them.
Does that make sense ?
Lambdaβs are designed to be short-lived, I donβt think itβs a fine idea to run it in a loop TBH.
Yeah, you are right, but maybe it would be fine to launch, have the lambda run for 30-60sec (i.e. checking idle time for 1 min, stateless, only keeping track inside the execution context) then take it down)
What I'm trying to solve here, is (1) quick way to understand if the agent is actually idling or just between Tasks (2) still avoid having the "idle watchdog" short lived, to that it can...
can configuration objects refer to one-another internally in ClearML?
Interesting, please explain?
understood, can you tryTask.add_requirements("-e path/to/folder/")
Okay, I think this might be a bit of an overkill, but I'll entertain the idea π
Try passing the user as key, and password as secret?
But it should work out of the box ...
Yes it should ....
The user and personal access token are used as is and it propagates down to submodules, since those are simply another git repository.
Can you manually successfully run:git clone --recursive https://user:token@github.com/company/repo_with_submodules
Hmm you mean like overrides ?
Maybe store both before/after resolving ?
(Although that might be confusing? as the before solve should actually be readonly)
StickyBlackbird93 the agent is supposed to solve for the correct version of pytorch based on the Cuda in the container. Sounds like for some reason it fails? Can you provide the log of the Task that failed? Are you running the agent in docker-mode , or inside a docker?