Reputation
Badges 1
282 × Eureka!Setting the credentials on agent machine means the users cannot use their own credentials since an k8s glue agent serves multiple users.
Referencing your suggestion, we can configure output_uri on task.set_base_docker() but how should we do this for the credentials?
Hi, i changed it, but it still point to https://files.pythonhosted.org/packages .
ok thanks. this would mean that increasing the disk space for my ClearML is the only option as we are not at liberty to delete.
Hi, clearml-agent==0.17.2rc3 did work. I'm on a 1.19 k8s cluster, and has this error when a task is pulled. Is the glue not compatible with 1.19?
` Pulling task 3a90802d1dfa4ec09fbccba0beffbaa8 launching on kubernetes cluster
Pushing task 3a90802d1dfa4ec09fbccba0beffbaa8 into temporary pending queue
Kubernetes scheduling task id=3a90802d1dfa4ec09fbccba0beffbaa8
kubectl output:
Flag --replicas has been deprecated, has no effect and will be removed in the future.
Flag --generator has been depre...
Hi AgitatedDove14 , that's what i am trying to figure out as well. The task has nothing to do with torch, and the requirements.txt doesn't have any torch packages as well.
AlertBlackbird30 , Actually the log says 10.2.docker_cmd = nvidia/cuda:10.2-devel-ubuntu18.04 -e GIT_SSL_NO_VERIFY=true
I meant the dataset id.
Ok thanks.
Congrats on v1.0. 🎉
so the clearml-agent daemon needs higher privilege?
I managed to find out why. The docker image I'm using is not set as root user thus the error. But I'm wondering why this is the case as docker best practices does indicate we should use a non root on production images.
Clearing the cache entirely works. Thanks.
Executing task id [228caa5d25d94ac5aa10fa7e1d02f03c]:
repository = https://192.168.50.88:18443/tkahsion/pytorchmnist
branch = master
version_num = cfb833bcc70f3e10d3b6a96cfad3225ed682382b
tag =Â
docker_cmd = nvidia/cuda:10.1-runtime-ubuntu18.04
entry_point = pytorch_mnist.py
working_dir = .
Warning: could not locate requested Python version 3.9, reverting to version 3.6
Using base prefix '/usr'
New python executable in /root/.clearml/venvs-builds/3.6/bin/python3.6
Also creating executable i...
No i didn't indicate this particular issue on the git issue. Only the apply template.yml is on the issue.
Nice, what are the names of the talks?
Hi thanks. How about Agent, does its docker mode or k8s mode require docker.sock to be exposed?
It would make sense on a very large resource cluster. Unfortunately we only have less than 50 GPUs to share across. A multi-tenant SAAS would cut the resources into even more smaller clusters and not help with efficiency. Or would you have a suggestion?
Hi.
We tried as advised above and it still didn't work.
Host: http://ecs.ai:443
output_uri = S3://ecs.ai:443/bucketname
This time round the client gave this error.
Botocore.exceptions.connectiinclosederror: connection was closed before we received a valid response from endpoint URL: ' http://ecs.ai/bucketname/.clearml.test '.
It's quite apparent that whatever clearml passed to boto3 ends up as a http call instead of https, which is wrong.
Going back to the open source, I think that adding the credentials as part of the source code might allow to have "credentials" auto populate as part of the remote execution, wdyt?
Not sure how this will work when i can't supply the credentials to ClearML programatically.
Ah ok. So it will be fixed on the ClearML server web UI as well? (See screenshots).
Do you have more info on vault?
Actually it only make sense if the entire department or organisation are saving their models in a common repo. In our case this is not possible due to client security (e.g. training data from clients can potentially be 'reverse engineered' from trained models in future). So each department and even projects will need their own repo.
Ok thanks. that explains alot. We have been doing this wrongly the whole time, thinking that the clearml.conf on the client side would be acknowledged by the remote agent execution. In reality, only the API section is utilised.
first line to make sure kubectl is connected to k8s.
Ok, that seems clearer, thanks.
does the bash script need clearml-agent to be able to communicate to the https clearml-server first? If yes, there's a chicken/egg problem here.
Hi,
basically i run this block first and ended the script.task = Task.init(project_name="afro-nmt", task_name=args.taskname, continue_last_task=args.taskid) Logger.current_logger().report_scalar(title="BLEU",series="JW300",value=args.jwbleu, iteration=args.lastiter)Then i run another script, with series different.
` task = Task.init(project_name="afro-nmt", task_name=args.taskname, continue_last_task=args.taskid)
Logger.current_logger().report_scalar(title="BLEU",series="SS900",value=arg...
The first is probably done using pipeline controllers, the second using Datasets or HyperDatasets. Its not very clear how the last one is achieved, especially on the searchable data catalogs.
The agent is running on a disconnected server on docker mode. I have a client that runs clearml-session and i saw from the agent's logs that the installation of vscode fails.