Reputation
Badges 1
25 × Eureka!Hi @<1600661423610925056:profile|StrongMouse81>
using serving base url and also other endpoint of model we add using:
clearml-serving model add
we get the attached respond:
And other model endpoints are working for you?
I want to be able to delete only the logs since they are taking a lot of space in my case.
I see... I do not think this is possible 😞
You can disable the auto logging though ... pass auto_connect_streams=False
to Task.init
Then in theory (since the backend is python based) you just need to find a base docker image to build it on.
and those env variables are credentials for ClearML. Since they are taken from k8s secrets, they are the same for every user.
Oh ...
I can create secrets for every new user and set env variables accordingly, but perhaps you see a better way out?
So the thing is, if a User spins the k8s job, the user needs to pass their credentials (so the system knows who it is)... You could just pass the user's key/secret (not nice, but probably not a big issue, as everyone is an Admin anyhow,...
RoughTiger69 whats the clearml version you are using ?
btw: you are running it locally, then enqueuing and running it remotely via the agent ?
which part of the code?
the main script?!
but is not part of the package
is the repo it self a package ?
Ok, I think figured it out.
Nice!
ClearML doesn't add all the imported packages needed to run the task to the Installed Packages
It does (but not derivative packages, that are used by the required packages, the derivative packages will be added when the agent is running it, because it creates a new clean venv and then it add the required packages, then it updates back with everything in pip freeze, because it now represents All the packages the Task needs)
Two questions:
Is t...
MuddySquid7
are you saying that for some reason the models pick the artifacts ? Is that reproducible ? (they are two different things)
Can you see the df.pkl on the Models section of the Task (in the UI) ?
SuccessfulKoala55 please post here once the code is available in your pytorch_ignite 🙂
Basically if I pass an arg with a default value of False, which is a bool, it'll run fine originally, since it just accepted the default value.
I think this is the nargs="?"
, is that right ?
PompousBeetle71 just making sure, and changing the name solved it?
Hi DrabCockroach54
Do we know if gpu_0_mem_usage and gpu_0_mem_used_gb, both shows current GPU usage?
the first is percentage used (memory % used at any specific moment) and the second is memory used GiB , both for the video memory
How to know from this how much GPU is reserved for the task if this task is in progress?
What do you mean by how much is reserved ? Are you running with an agent?
CourageousLizard33 if the two series are on the same graph, just click on the series in the legend, you can enable/disable it, and the scale will adjust automatically.
Regarding grouping, this is a feature that can be turned off, the idea is that we split the tag to title/series... So if you have the same prefix you get to group the TF scalars on the same graph, otherwise they will be on a diff title graph. That said you can make force it to have a series per graph like in TB. Makes sense?
Where are you seeing this message?
Hmm CourageousLizard33 seems you stumbled on a weird bug,
This piece of code only tries to get the username of the current UID, but since you are running inside a docker and probably set the environment UID but there is no "actual" UID by that number on /etc/passwd , and so it cannot resolve it.
I'm attaching a quick fix, please let me know if it solved the problem.
I'd like to make sure we have it in the next RC as soon as possible.
CourageousLizard33 specifically section (4) is the issue (and it's related to any elastic docker, nothing specific to trains-server)echo "vm.max_map_count=262144" > /tmp/99-trains.conf sudo mv /tmp/99-trains.conf /etc/sysctl.d/99-trains.conf sudo sysctl -w vm.max_map_count=262144 sudo service docker restart
Did you try the above, and you are still getting the same error ?
Probably less secure though :)
CourageousLizard33 Are you using the docker-compose to setup the trains-server?
:) yes on your gateway/firewall set http://demoapi.trains.allegro.ai to 127.0.0.1 . That's always good practice ;)
This doesn't seem to be running inside a container...
What's the clearml-agent launch command you are using ? (i.e. do you have --docker flag)
I understand but how do you launch the cleaml-agent
itself:clearml-agent daemon --detached --queue default --docker
SmallDeer34 No worries, I'm happy to hear the issue disappeared 🙂
SmallDeer34
I think this is somehow related to the JIT compiler torch is using.
My suspicion is that JIT cannot be initialized after something happened (like a subprocess, or a thread).
I think we managed to get around it with 1.0.3rc1.
Can you verify ?
BoredHedgehog47 you need to configure the clearml k8s glue to spin pods (instead of allocating agents per pods statically) does that make sense ?
Would you have an example of this in your code blogs to demonstrate this utilisation?
Yes! I definitely think this is important, and hopefully we will see something there 🙂 (or at least in the docs)