yup. in this case it wasn't root. Removing that USER and -u
in pip solves the problem. However, in our production images, we are required to remove root access.
` FROM nvidia/cuda:10.1-cudnn7-devel
ENV DEBIAN_FRONTEND noninteractive
RUN apt-get update && apt-get install -y
python3-opencv ca-certificates python3-dev git wget sudo ninja-build
RUN ln -sv /usr/bin/python3 /usr/bin/python
create a non-root user
ARG USER_ID=1000
RUN useradd -m --no-log-init --system --uid ${USER_ID} a...
Hi CostlyOstrich36 , thanks. I will check with the Enterprise team then.
Do you mean by this that you want to be able to seamlessly deploy models that were tracked using ClearML experiment manager with ClearML serving?
Ideally that's best. Imagine that i used Spacy (Among other frameworks) and i just need to add the one or two lines of clearml codes in my python scripts and i get to track the experiments. Then when it comes to deployment, i don't have to worry about Spacy having a model format that Triton doesn't recognise.
Do you want clearml serving ...
Hi AgitatedDove14 , that's what i am trying to figure out as well. The task has nothing to do with torch, and the requirements.txt doesn't have any torch packages as well.
Can i dig into the mongodb or ES to pull these data?
That didn't work as well...
Unfortunately it's not. The problem previously encountered with the docker method surfaced again. In this case, the BASE DOCKER IMAGE
nvidia/cuda:10.1-runtime-ubuntu18.04 --env GIT_SSL_NO_VERIFY=true
is not taking effect with the k8s glue.
Hi AgitatedDove14 , i was refering totask.set_base_docker("nvcr.io/nvidia/tensorflow:19.11-tf2-py3 --env TRAINS_AGENT_GIT_USER=git_username_here --env TRAINS_AGENT_GIT_PASS=git_password_here")
The above will give errorskipping docker argument TRAINS_AGENT_GIT_USER=git_username_here (only -e --env supported) TRAINS_AGENT_GIT_PASS=git_username_here (only -e --env supported)
The first stage is a rank0 pytorch script. The downstream stages are rankN scripts, they are waiting for the IP address of the first stage. But the first stage doesn’t return, it simply waits for the rankN scripts to connect to it. But in this case, the rankN scripts doesn’t start. So its probably necessary to have just a single stage.
If i were to start a single rank0, and subsequent rankN tasks, it would be rather messy on ClearML Dashboard. Best to have either a single clearml application...
Thanks TimelyPenguin76 , is there an env var for the S3 connection as well?
I did notice that in the tmp folder, .clearml_agent.xxxxx.cfg does not exists.
Ok. That brings me back to the spawned pod. At this point, clearml-agent and its config would be a controbuting factor. Is the absence of /tmp/.clearml_agent.xxxxxx.cfg
an issue?
Sorry AgitatedDove14 i missed your reply. So this means that in the community version, when i have an experiment using clearml and it uses clearml datasets SDK, the dataset id that was used will not be reflected on the clearml experiment in any way, thus making it impossible to establish Data Lineage/Provenance. (E.g. Link data used to experiment). This feature is however available in the Enterprise Version as HyperDatasets. Am i correct?
Code example.
` from clearml import Task, Logger
tas...
This would be solved if --env GIT_SSL_NO_VERIFY=true is passed to the k8s pod that's spawned to run the job. Currently its not.
I think a related question is, ClearML replies heavily on Triton (Good thing) but Triton only support a few frameworks out of the box. So this 'engine' need to make sure its can work with Triton and use all its wonderful features such as request batching, GPU reuse...etc.
Hi.
We tried as advised above and it still didn't work.
Host: http://ecs.ai:443
output_uri = S3://ecs.ai:443/bucketname
This time round the client gave this error.
Botocore.exceptions.connectiinclosederror: connection was closed before we received a valid response from endpoint URL: ' http://ecs.ai/bucketname/.clearml.test '.
It's quite apparent that whatever clearml passed to boto3 ends up as a http call instead of https, which is wrong.
Having same issues. Looks like Google DNS can't resolve the DNS at all.
` %nslookup app.clear.ml - 8.8.8.8
Server: 8.8.8.8
Address: 8.8.8.8#53
** server can't find app.clear.ml: SERVFAIL `
Ok thanks. that explains alot. We have been doing this wrongly the whole time, thinking that the clearml.conf on the client side would be acknowledged by the remote agent execution. In reality, only the API section is utilised.
Hi, so you meant i need to installl virtualenv in my base image?
AgitatedDove14 , will these be fixed?
Passing env via the code Passing env via template yaml
Yeah that'll cover the first two points, but I don't see how it'll end up as a dataset catalogue as advertised.
The server is running only the ClearML components. Could you advise on the ELB part, how should we optimise it?
Thanks SuccessfulKoala55 . I can try my hand on a patch. But the pod spinning is handled by the k8s glue, which has no link to the client side. How should the client pass the key over to k8s glue during runtime via clearml server?
ah ok, so if i see Jax's workspace on https://app.community.clear.ml/dashboard , then i'm on the right track? How regular does this reset then?
I also see this on my logs, noting that the config is read in but its still printing the supposedly hidden keys on the logs and UI.agent.hide_docker_command_env_vars.enabled = true agent.hide_docker_command_env_vars.extra_keys.0='TRAINS_AGENT_GIT_USER' ..... docker_cmd=harbor.ai/public/detectron2:v3 --env TRAINS_AGENT_GIT_USER=gituser
Ok, that seems clearer, thanks.
Hi, by deployment strategies I meant by canary, blue-green...etc..etc. I figured this should be done by clearml-serving and maybe seldon as well.
Thanks. Which brings me to the question. How does ClearML deal with all the CVEs? What is your process for response?