Hi SubstantialElk6
We can't seem to find a way for the end user to pass in their git credentials when they run their codes in both agent and non-agent scenarios. Any advice here?
The bottom line is the agent needs to have read-only access to all the repositories so it can launch any Task. I would recommend to create an agent git user with read-only credentials and configure the agent to use it. wdyt?
Hi RipeGoose2
What exactly is being uploaded ? Are those the actual model weights or intermediate files ?
Hmmm, are you running inside pycharm, or similar ?
this issue on when trying to set up on our remote machines
You mean setting up the trains-server on remote machine?
Instead you can do: TRAINS_WORKER_NAME = "trains-agent":$DYNAMIC_INSTANCE_ID
Then the Worker ID will running instance appended to the worker name. This means that even if you use the same $DYNAMIC_INSTANCE_ID twice, you will not have two agent registering on the same name.
MysteriousBee56 I see...
So yes, you can with the APIClient you have full RESTful access to the backend.
I think there was a similar discussion https://allegroai-trains.slack.com/archives/CTK20V944/p1593524144116300
HandsomeCrow5 how did you end up solving it? I think you had a similar use case?!
Thanks MuddyCrab47 !!!
I found it!
It turns out the artifact upload will always upload from stream (aka no multi-upload). I will make sure we fix it in the next RC (I think the plan is to have it out this weekend)
2023-02-15 12:49:22,813 - clearml - WARNING - Could not retrieve remote configuration named 'SSH'
This is fine, it means it uses the default identity keys
The thing is - when I try to connect with normal SSH there are no issues
Now I'm lost, so when exactly do you see the issue ?
CourageousLizard33 Are you using the docker-compose to setup the trains-server?
SteadyFox10 I suspect you are correct ๐
CourageousLizard33 see also section (4) here:
https://github.com/allegroai/trains-server/blob/master/docs/install_linux_mac.md#launching-the-trains-server-docker-in-linux-or-macos
So I think there are two bugs here?
--args overrides="key=value" does not work request: add --hydra to override hydra arguments (and if this is added the first one is not needed)Is that correct?
CourageousLizard33 specifically section (4) is the issue (and it's related to any elastic docker, nothing specific to trains-server)echo "vm.max_map_count=262144" > /tmp/99-trains.conf sudo mv /tmp/99-trains.conf /etc/sysctl.d/99-trains.conf sudo sysctl -w vm.max_map_count=262144 sudo service docker restart
Did you try the above, and you are still getting the same error ?
Lambdaโs are designed to be short-lived, I donโt think itโs a fine idea to run it in a loop TBH.
Yeah, you are right, but maybe it would be fine to launch, have the lambda run for 30-60sec (i.e. checking idle time for 1 min, stateless, only keeping track inside the execution context) then take it down)
What I'm trying to solve here, is (1) quick way to understand if the agent is actually idling or just between Tasks (2) still avoid having the "idle watchdog" short lived, to that it can...
Hi VivaciousWalrus21 I tested the sample code, and the gap was evident in Tensorboard as well. This is not clearml generating this jump this is internal (like the auto de/serialization and continue of the code base)
So as you say, it seems hydra kills these
Hmm let me check in the code, maybe we can somehow hook into it
Many thanks! I'll pass on to technical writers ๐
CourageousLizard33 VM?! I thought we are talking fresh install on ubuntu 18.04?!
Is the Ubuntu in a VM? If so, I'm pretty sure 8GB will do, maybe less, but I haven't checked.
How much did you end up giving it?
I think that clearml should be able to do parameter sweeps using pipelines in a manner that makes use of parallelisation.
Use the HPO, it is basically doing the same thing with some more sophisticated algorithm (HBOB):
https://github.com/allegroai/clearml/blob/master/examples/optimization/hyper-parameter-optimization/hyper_parameter_optimizer.py
For example - how would this task-based example be done with pipelines?
Sure, you could do something like:
` from clearml import Pi...
CourageousLizard33 so you have a Linux server running Ubuntu VM with Docker inside?
I would imagine that you could just run the docker on the host machine, no?
BTW, I think 8gb is a good recommendation for a VM it's reasonable enough to start with, I'll make sure we add it to the docs
Hi YummyFish22
Looks like the task does not have "Task.init" call on the main script (or an import of clearml)? could that be the case?
Probably less secure though :)
Hi SubstantialElk6
ClearML-Serving is already out with a new version, the ETA for the next ClearML-serving full 1.0 (which is the new redesign version) is the end of May
Hi @<1523715429694967808:profile|ThickCrow29>
Is there a way to specify a callback upon an abort action from the user
You mean abort of the entire pipeline?
None
Hi MoodyCentipede68 , I think I saw something like it, can you post the full log? The triton error is above, also I think it restarted the container automatically and then it worked
NICE! MoodyCentipede68 this is awesome ๐
MoodyCentipede68 could it be that the model is on one account (workspace) and your credentials (the ones provided to the docker compose) are from another workspace?
The error itself point to the triton helper failing to get the model ID from the backend. The models are uploaded to a a specific workspace, and it looks like a mismatch (I.e. the model Id is nowhere to be found) wdyt?
WickedElephant66 this seems like a general network issue, like the docker service is missing your companies firewall certificate.
Can you pull any container from docker hub ?