Reputation
Badges 1
25 × Eureka!Hi GreasyPenguin14
clearml-data stores only the difference between versions.
Yes, it is on a file basis granularity. Meaning if you change a file (regardless of the type of the file), the new modified file will be stored. Make sense ?
. I was wondering what is the use of
PipelineController.create_draft
if you can't use it to clone and run tasks, as we have seen
I think the initial thought was to allow to create a pipeline from a pipeline programatically. Then once you have the "pipeline" you can manually enqueue it and modify it. Think a pipeline constructing other pipelines in flight based on some logic, then launching them in parallel.
make sense ?
Hi @<1724960468822396928:profile|CumbersomeSealion22>
It starts the pipeline, logs that the first step is started, and then...does nothing anymore.
How many agents do you have running? by default an agent will run a Task per agent (unless executed with --services-mode which would allow it to run unlimited amount of parallel tasks)
I mean test with:pipe.start_locally(run_pipeline_steps_locally=False)
This actually creates the steps as Tasks and launches them on remote machines
Hi AttractiveWoodpecker16
I think is the correct channel for that question.
(any chance you can move your thread there?)
Specifically just email billing@clear.ml they will cancel (no need to worry about the beginning of the month, just explain and they will not charge over Nov)
EDIT: I know they are working on making it a one click in the UI, main limit is what happens with the data that was stored and was above the free tier threshold, anyhow I think next version will sort that as well.
Hi UpsetBlackbird87
I might be wrong, but it seems like ClearML does not monitor GPU pressure when deploying a task to a worker rather rely only on its configured queues.
This is kind of accurate, the way the agent works is that you allocate a resource for the agent (specifically a GPU), then sets queues (plural) to listen to (by default priority ordered). Then each agent is individually pulling jobs and running on the allocated GPU.
If I understand you correctly, you want multiple ...
GreasyPenguin14 makes total sense.
In that case I would say variants to the accuracy make sense to me, I would suggest:title='trains', series='accuracy/day'
and title='trains', series='accuracy/night'
Regrading hierarchy, from the implementation perspective a unique identifier is always the combination of title/series (or in other words metric/variant), introducing another level is a system wide change.
This means it might be more challenging than expected ...
Hi ReassuredTiger98
Are you running the agent in venv mode ?
DefiantHippopotamus88 you can create a custom endpoint and do that, but it will be running I the same instance , is this what you are after? Notice that Triton actually supports it already, you can check the pytorch example
Hi DisturbedWalrus17
This is a bit of a hack, but will work:from clearml.backend_interface.metrics.events import UploadEvent UploadEvent._file_history_size = 10
Maybe we should expose it somewhere, what do you think?
Yes 🙂 https://discuss.pytorch.org/t/shm-error-in-docker/22755
add either "--ipc=host" or "--shm-size= 8g " to the docker args (on the Task or globally in the clearml.conf extra_docker_args)
notice the 8g depends on the GPU
Finally managed; you keep saying "all projects" but you meant the "All Experiments" project instead. That's a good start
Thanks!
Yes, my apologies you are correct: "all experiments"
As I suspected, from your log:agent.package_manager.system_site_packages = false
Which is exactly the problem of the missing tensorflow (basically it creates a new venv inside the docker, but without the flag On, it does not inherit the docker preinstalled packages)
This flag should have been true.
Could it be that the clearml.conf you are providing for the glue includes this value?
(basically you should only have the sections that are either credentials or missing from the default, there...
Hi ZippyAlligator65
You mean like env vars?
I am running from noebook and cell has returned
Well the Task will close when you shut down the notebook 🙂
Hmm so the concept of "company" wide configuration is supported in the enterprise version.
I'm trying to think of a "hack" to just pass these env/conf ...
How are you spinning the agent machines?
GiganticTurtle0 My apologies, I made a mistake, this will not work 😞
In the example above "step_two" is executed "instantaneously" , meaning it is just launching the remote task, it is not actually waiting for it.
This means an exception will not be raised in the "correct" context (actually it will be raised in a background thread).
That means that I think we have to have a callback function, otherwise there is no actual way to catch the failed pipeline task.
Maybe the only re...
btw,
I launch the agent
daemon
outside docker (with
--docker
) , that’s the way it is supposed to work right?
Yep that should work
is it ?
They could, the problem by the time you set them,they have been read into the variables.
Maybe we should make it lazy loaded, it will also speedup the import.
Hi @<1687653458951278592:profile|StrangeStork48>
- Agreed,
- Notice this user/pass is only used for the initial authentication, after that all authentication is done via a signed JWT tokenHow about a GitHub issue with the feature request, if there is enough interest (or someone jumps in offering implementation) we can push it forward. What do you think?
I am not sure what switching back will solve, here the wheel should have been correct, it's just the architecture of the card that is incompatible
So I tested the "old" code that did the parsing and matching, and it did resolve to the correct wheel (i.e. found that there is no 117 only 115 and installed this one)
I think we should switch back, and have a configuration to control which mechanism the agent uses , wdyt?