Reputation
Badges 1
25 × Eureka!It all depends how we store the meta-data on the performance. You could actually retrieve it from the say val metric and deduce the epoch based on that
is there GPU support
That's basically depends on your template yaml resources, you can have multiple of those each one "connected" with a diff glue pulling from a diff queue. This way the user can enqueue a Task in a specific queue, say single_gpu
, then the glue listens on that queue and for each clearml Task it creates a k8s job the single gpu as specified in the pod template yaml.
yes, TrickySheep9 use the k8s glue from here:
https://github.com/allegroai/clearml-agent/blob/master/examples/k8s_glue_example.py
and you have clearml v0.17.2 installed on the "system" packages level, and 0.17.5rc6 installed inside the pyenv venv ?
Hmm that is odd, could it be you are changing the sys.path ?
(What I'm assuming is happening is that it detects the packages in the PYTHONPATH and for some reason the order is different so it finds the "system" package before the "venv" package, hence the incorrect version)
SmallBluewhale13
And the Task.init registers 0.17.2 , even though it prints (while running the same code from the same venv) 0.17.2 ?
SmallBluewhale13 in your code what are you getting when you print the version:from clearml import __version__ print(__version__)
Hmmm, are you running inside pycharm, or similar ?
Hi, what is host?
The IP of the machine running the ClearML server
i hope can run in same day too.
Fix should be in the next RC 🙂
Hi @<1523701868901961728:profile|ReassuredTiger98>
Could you send the full log ? Also what's the clearml-agent
version?
I'm looking into the savefig issue, meanwhile you can disable the popup by adding at the top of your code the following:import matplotlib matplotlib.rcParams['backend'] = 'agg' import matplotlib.pyplot matplotlib.pyplot.switch_backend('agg')
This is a horrible setup, it means no authentication will pass, it will literally break every JWT authentication scheme
Sure SharpDove45 ,from clearml import Model model = Model('model_id_aabbcc') model.system_tags += ['archived']
You mean one machine with multiple clearml-agents ?
(worker is a unique ID of an agent, so you cannot have two agents with the exact same worker name)
Or do you mean two agents pulling from the same queue ? (that is supported)
not really, the OS will almost never allow for that, actually it is based on fairness and priority. we can set the entire agent to have the same low priority for all of them, then the OS will always take CPU when needed (most of the time it won't) and all the agents will split the CPU's among them, no one will get starved 🙂 With GPUs , it is a different story, there is no actual context switching or fairness mechanisms like in CPU
the use case i have is to allow people from my team to run their workloads on set of servers without stepping over each other..
So does that mean CPU only workloads?
Also are we afraid of fairness? (i.e. someone "taking" all the CPU for themselves)
PompousParrot44 with pleasure. If during your search for a solution you come across something that solves it, and might integrate to the agent, do not hesitate to suggest it :)
Hi PompousParrot44
Well this kind of control is tricky. If you don't mind processes "fighting over cpu" you can just spin two trains-agents in cpu-mode. It will work as long as they have a different TRAINS_WORKER_NAME
The other option (might be a bit of an overkill) is to use K8s, which will set the CPU % for the entire agent.
What do you think?
PompousParrot44 now that I think about it, you might be able to limit the cpu affinity, would that help?
CharmingBeetle38 try adding "General/" before the arguments. This means batch_size becomes General/batch_size. This is only because we are accessing the parameters externally, when the task is executed it is resolved automatically
it handles 2FA if my repo lies in Github and my account needs 2FA to sign in
It does not 😞
BTW: from the instance name it seems like it is a VM with preinstalled pytorch, why don't you add system site packages, so the venv will inherit all the preinstalled packages, it might also save some space 🙂
DeterminedToad86 see here:
https://github.com/allegroai/clearml-agent/blob/0462af6a3d3ef6f2bc54fd08f0eb88f53a70724c/docs/clearml.conf#L55
Change it on the agent's conf file to:system_site_packages: true
LOL, Okay I'm not sure we can do something that one.
You should probably increase the storage on your instance 🙂
This is exactly what I did here, and it is working 😞
https://demoapp.demo.clear.ml/projects/0e919ea1cc5c499b99e1ab85004b6e97/experiments/887edef09d4549e88b829a34c87d4d5b/output/execution
Basically create a token and use it as user/password
EDIT:
With read-only permissions 🙂
But there is no need for 2FA for cloning repo
Is it possible in Clearml to somehow allocate resources so that maybe after running a number of Alice's tasks, Bob's task get processed (Like maybe Round robin fashion)
Hi DeliciousBluewhale87
A few options here:
set the agent with high / low priority queues. Make sure Alice pushes into low priority (aka HPO) then Bob can push into high priority when he needs. This makes a lot of sense when you have automation processes spinning many experiments. expanding (1) you could set differe...