
Reputation
Badges 1
25 × Eureka!help_models is a dir in the git
And the git is registered on the experiment correctly ?
This is odd , and it is marked as failed ?
Are all the Tasks marked failed, or is it just this one ?
Thanks @<1527459125401751552:profile|CloudyArcticwolf80> ! let me see if we can reproduce it
Hi MagnificentSeaurchin79
Yes this is a bit confusing 🙂
Datasets are stored as delta changes from parent versions.
A dataset contains a list of files and list of artifacts where these files exist. This means that if we add a new file to a dataset we create a new dataset from a parent dataset and want to add a file, we have to add a link to the file, and have a new artifact containing just the delta (i.e. the new file) from the parent version When you delete a file you just remove the li...
Sure. JitteryCoyote63 so what was the problem? can we fix something?
Hi FancyChicken53
This is a noble cause you are after 😉
Could you be more specific on what you had in mind, I'll try to find the best example once I have more understanding ...
Out of curiosity, what ended up being the issue?
Hi @<1618056041293942784:profile|GaudySnake67>Task.create
is designed to create an External task not from the current running process.Task.init
is for creating a Task from your current code, and this is why you have all the auto_connect parameters. Does that make sense ?
the only port configurations that will work are 8080 / 8008 / 8081
So the naming is a by product of the many TB created (one per experiment), if you add different naming ot the TB files, then this is what you'll be seeing in the UI. Make sense ?
Hmm so you are saying you have to be logged out to make the link work? (I mean pressing the link will log you in and then you get access)
Agent works when I am running it from virtual environment but stucks in the same place all the time when I using Docker
Can you please provide a log? I'm not sure what it means stuck
agentservice...
Not related, the agent-services job is to run control jobs, such as pipelines and HPO control processes.
Hmm, let me check, there is a chance the level is dropped when manually reporting (it might be saved for internal critical reports). Regardless I can't see any reason we could not allow to control it.
Hi NaughtyFish36
c++ module fails to import, anyone have any insight? required c++ compilers seem to be installed on the docker container.
Can you provide log for the failed Task?
BTW: if you need build-essentials
you can add it as the Task startup scriptapt-get install build-essentials
Ohh that cannot be pickled... how would you suggest to store it into a file?
Hi TrickyFox41
is there a way to cache the docker containers used by the agents
You mean for the apt get install part? or the venv?
(the apt packages themselves are cached on the host machine)
for the venv I would recommend turning on cache here:
https://github.com/allegroai/clearml-agent/blob/76c533a2e8e8e3403bfd25c94ba8000ae98857c1/docs/clearml.conf#L131
The 'on-premise' server fails to connect to the ClearML server because of the VPN I think
I think you are correct.
You can quickly test it, try ti run curl
http://local-server:8008 see if that works
We have tried to manually restart tasks reloading all the scalars from a dead task and loading latest saved torch model.
Hi ThickKitten19
how did you try to restart them ? how are you monitoring dying instances ? where . how they are running?
Containers (and Pods) do not share GPUs. There's no overcommitting of GPUs.
Actually I am as well, this is Kubernets doing the resource scheduling and actually Kubernetes decided it is okay to run two pods on the Same GPU, which is cool, but I was not aware Nvidia already added this feature (I know it was in beta for a long time)
https://developer.nvidia.com/blog/improving-gpu-utilization-in-kubernetes/
I also see thety added dynamic slicing and Memory Proteciton:
Notice you can control ...
p.s. any chance you can get me the nvidia driver version? I can't seem to find the one for v22 on amazon
model upload and registration i should pass something like
'xgboost': False
or
'xgboost': False, 'scikit': False
?
Exactly! which framework are you using ?
about 2, I refer to the names of the models.
Hmm that is a good point to test, usually this is based on the Task name (I think), so if the Task name contains the HPO params in the name it should be the same on the model name. Do you see the HPO params on the Task name ? Should we open a Gi...
SmallDeer34 in theory no reason it will not work with it.
If you are doing a single node (from Ray's perspective)
This should just work, the challenge might be multi-node ray+cleaml (as you will have to use clearml to set the environment and ray as messaging layer (think openmpi etc.)
What did you have in mind?
I see. If you are creating the task externally (i.e. from the controller), you should probably call. task.close() it will return when everything is in order (including artifacts uploaded, and other async stuff).
Will that work?
This is a horrible setup, it means no authentication will pass, it will literally break every JWT authentication scheme
Hi GrittyCormorant73
When I archive the pipeline and go into the archive and delete the pipeline, the artifacts are not deleted.
Which clearml-server version are you using? The artifact delete was only recently added
Nothing that can't be worked around but for automation I don't think creating a TriggerScheduler with an existing name should be allowed
DangerousDragonfly8 I think I understand , basically you are saying the fact a user can create two triggers with the same name can create some confusion ?
It also sucks a bit that each TriggerScheduler will run in it's own pod in kubernetes.
Actually this depends on how you spin it, and you can actually spin a a service agents running multiple...