
Reputation
Badges 1
25 × Eureka!I do it to get project name
you can still get it from the task object (even after closing it)
another place I was using was to see if i am in a pipeline task
Yes that makes sense, this is one of the use cases (to see get access to the Task that is currently running). The bug itself will only happen after closing the Task (it needs to clear OS variable).
You can either upgrade to the 1.0.6rc2 or you can hack/fix it with :
` os.environ.pop('CLEARML_PROC_MASTER_ID', None)
os.envi...
agentservice...
Not related, the agent-services job is to run control jobs, such as pipelines and HPO control processes.
ElegantKangaroo44 I tried to reproduce the "services mode" issue with no success. If it happens again let me know maybe will better understand how it happened (i.e. the "master" trains-agent gets stuck for some reason)
This is cleaml python client, no need to change the server
we concluded that we don't want to run it through ClearML after all, so we ran it standalone
out of curiosity, what was the conclusion and why?
GaudyPig83
I think there is some mismatch between the code creating the pipeline and the actual Task?! Could that somehow be the case? "relaunch_on_instance_failure" is a missing argument somehow
can you try to launch the entire Pipeline with the latest RC ?pip3 install clearml==1.7.3rc0
Hi JitteryCoyote63
Is this close ?
https://github.com/allegroai/clearml/issues/283
I ran the test, but there was no result.
what do you mean by no result, no data after the new query?
Hi ThankfulOwl72 checkout TrainsJob
object. It should essentially do what you need:
https://github.com/allegroai/trains/blob/master/trains/automation/job.py#L14
It may have been killed or evicted or something after a day or 2.
Actually the ideal setup is to have a "services" pod running all these service on a single pod, with clearml-agent --services-mode. This Pod should always be on and pull jobs from a dedicated queue.
Maybe a nice way to do that is to have the single Task serialize itself, then have the a Pod run the Task every X hours and spin it down
So I would like to to know what it send to the server to create the task/pipeline, ...
Task.force_requirements_env_freeze()
This might be very brittle, if users are running on a diff OS, or python versions...
I would actually go with:
you like poetry, update your lock file in git you do not use poetry, work on your own branch and delete poetry lock file
wdyt?
Hi @<1607909176359522304:profile|UnevenCow76>
followed the below documentation to implement the clearml monitoring using prometheus and grafana
Did you try following this example, it includes both deploying a model and adding grafana metrics:
None
How did you add the args? Is it argparser? If so the help is automatically picked so you can see it in yhe UI. BTW, the ability to provide a list of options is a really cool feature to have, I'll make sure to pass ot to product 😀
Okay found the issue, to disable SSL verification global add the following env variable:CLEARML_API_HOST_VERIFY_CERT=0
(I will make sure we fix the actual issue with the config file)
The docker crashes and I want to be abel to debug it exactly as it is run by the agent
On your machine (any machine)
pip install clearml-agent
clearml-agent build --id <taskID> --docker "local_mydocker_name"
docker run -it local_mydocker_name bash
i hope can run in same day too.
Fix should be in the next RC 🙂
DM me the entire log, I would assume this is something with the configuration
SmilingFrog76 this is not a weird mechanism at all , this is proper HPC scheduler 🙂trains-agent
is not actually aware of other nodes, it is responsible for launching a Task on its own hardware (with whatever configuration it was set). What can be done is to use the trains-agent
inside a 3rd party scheduler and have the scheduler allocate the node and trains-agent spin the experiment. There is a k8s example here: basically pulling jobs for the trains-server queue and pushing ...
Hi ReassuredTiger98
Agent's queue priory can be translated to the order the agent will pull jobs from.
Now let's assume we have two agents with priorities A,B for one and B,A for the other. If we only push a Task to queue A, and both agents are idle (implying queue B is empty), there is no guarantee which one will pull the job.
Does that make sense ?
What is the use-case you are trying to solve/optimize for ?
Where can I find information about that? I'd love to join!
This awesome , we have a few things in mind that we would love to improve. Do you have a lot of experience working with Trains? If you do, what would be most appealing for you ?
Is this per Task or for all the Tasks always ?
Could it be someone deleted the file? this is inside the temp venv folder but it should not get there
ERROR: Error checking for conflicts. ... AttributeError: _DistInfoDistribution__dep_map
Seems like pip package install issue of a sort
Yes, albeit not actually "intercept" as the user will be able to directly put Task sin queues B_machine_a/B_machine_b , but any time the user is pushing Tasks into queue B, this service will pull it and push to the individual machines queue.
what do you think?
I think it's inside the container since it's after the worker pulls the image
Oh that makes more sense, I mean it should not build the from source, but make sense
To solve for build for source:
Add to the "Additional ClearML Configuration" section the following line:agent.package_manager.pip_version: "<21"
You can also turn on venv caching
Add to the "Additional ClearML Configuration" section the following line:agent.venvs_cache.path: ~/.clearml/venvs-cache
I will make sure w...
ReassuredTiger98
(for some reason it kind of jumps over PyTorch, but then installs torchvision?!)
Could you run with the latest with --debug
We just added but you will have to install from git:pip3 install git+
Then run with --debug:clearml-agent --debug daemon ...