Reputation
Badges 1
32 × Eureka!Yeap. Thanks - I've already did that and faced another issue - https://github.com/allegroai/clearml/issues/740
Now model stays local and is not uploaded to s3.
Hi SweetBadger76
So - I have turned off SSL for minio and tried a test script for uploading those two artifacts.
The result is that it works - the file got uploaded to a bucket.
Although it has taken a long time to finish upload and the files are less than 1Mb$ python3 test.py ClearML Task: overwriting (reusing) task id=72e7c0b098e14197a9ffe82d7444337f ClearML results page:
`
2022-06-10 14:14:00,894 - clearml.Task - INFO - Waiting to finish uploads
2022-06-10 14:14:11,888 - clearml.T...
Thanks Jake SuccessfulKoala55 !
I used to have problems with clearML agents and multi-GPU training with agents - have put it on hold.
Now my problem is with ClearML serving.
I have managed to run a demo https://clear.ml/docs/latest/docs/clearml_serving/clearml_serving_tutorial
But had problems :
` clearml-serving --id c605bf64db3740989afdd9bee87e6353 model add --engine sklearn --endpoint "test_model_sklearn" --preprocess "examples/sklearn/preprocess.py" --name "initial model training" --p...
clearml-agent --version CLEARML-AGENT version 1.2.3
Yeap. It is configured this wayforce_git_ssh_protocol: true
But I don't see the mount of .ssh
One thing though - my container is running on behalf of non-root user.
` Here are my docker mounts:
docker_internal_mounts {
sdk_cache = /clearml_agent_cache
# apt_cache = /var/cache/apt/archives
ssh_folder = /home/testuser/.ssh
pip_cache = /home/testuser/.cache/pip
poetry_cache = /home/testuser/.cache/pypoetry
vcs_cache = /home/testuser/.clearml/vcs-cache
venv_build =...
This time it runs smoothly - here's the output:
` Local file not found [torch @ file:///home/testuser/.clearml/pip-download-cache/cu113/torch-1.11.0%2Bcu113-cp39-cp39-linux_x86_64.whl], references removed
Local file not found [torchvision @ file:///home/testuser/.clearml/pip-download-cache/cu113/torchvision-0.12.0%2Bcu113-cp39-cp39-linux_x86_64.whl], references removed
Adding venv into cache: /home/nino/.clearml/venvs-builds/3.9
Running task id [b15553c045ab4c3283bbdb040ec19f1f]:
[src/models]...
I am using
WebApp: 1.5.0-186
Server: 1.5.0-186
API: 2.18
On client side:
clearml==1.4.1
clearml-agent==1.2.3
It was some issue with the server - after restarting all seems to work.
AgitatedDove14 thanks - that makes it clear.
In particular there are some optional dependencies - in my case I am using pandas.read_xls
and it requires openpyxl
.
Meanwhile I have changed a script and inserted an explicit import - now agent makes this install.
Are you running in docker mode ? the venv inside the docker inherits from all the installed packages, how come it is missing?
My fault - I was in pip mode...
Hi David. Sorry I got stuck with agent in docker mode training on multiple GPUs. Will get that sorted and finish that stuff with minio.
So I have switched back to ssl to give a try to the script again - and it works with ssl now.
I even have tried it with big files - still works.
SweetBadger76 thanks for giving a hand - don't know what was the issue but now that works.
thanks a bunch - that has worked just fine
Yeap. That's an arch thing but in case of arch --gpus is not enough.
One thing though - I am running agent on behalf of a regular user.
Made an upgrade to the latest version from 1.5 and have stumbled upon an issue with webserver:
I am saving all artefacts to a custom s3 server. Used to work fine - saving and downloading them from webserver. Now I can now download anything that resides on s3 - getting the following errors in browser console:
Unable to parse "https None " as a whatwg URL.
ERROR EndpointError: Custom endpoint https [None](//storage.yandexcloud.net)
was not a valid URI
Back at 1....
Well actually I have tried a different approach and it works.
` task = Task.init(project_name=args['cml_project_name'],
task_type=TaskTypes.data_processing,
task_name=f'Dataset for {os.path.basename(OBJECT_NAME)}',
tags=args['cml_tags'].split(','),
output_uri = args['cml_output_uri'],
auto_connect_frameworks=True)
dataset = Dataset.create(
dataset_n...
and this is inside a container to check that package is installed:docker run -it --rm torch2022 pip show torch
Name: torch Version: 1.11.0 Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration Home-page:
Author: PyTorch Team Author-email: packages@pytorch.org License: BSD-3 Location: /opt/conda/lib/python3.8/site-packages Requires: typing_extensions Required-by: torchmetrics, pytorch-lightning, torchvision, torchtext, torchelastic
I build my own ...
SweetBadger76 thanks for looking into this. Here's a screenshot that displays files in clearML that should be available in minio. I can see them in clearML (I refer to this as clearML metadata) but when I press the link it redirects me to minio and shows that this file is not there. Also when I explore minio with console - I don't see those files there. But notebooks and datasets get uploaded just fine.
Currently I have the following config re S3:
` aws {
s3 {
# default, used for any bucket not specified below
key: ""
secret: ""
region: ""
credentials: [
{
# This will apply to all buckets in this host (unless key/value is specifically provided for a given bucket)
host: "mydomain.com:9000"
key: "minio"
secret: "secret data"
...
Hi Martin. Sorry - missed your reply.
Yeap I am aware that docker_internal_mounts is inside agent section.
Here is the actual docker command from the log
` INFO Executing: ['docker', 'run', '-t', '--gpus', '"device=0"', '-v', '/tmp/ssh-XXXXXXnfYTo5/agent.8946:/tmp/ssh-XXXXXXnfYTo5/agent.8946', '-e', 'SSH_AUTH_SOCK=/tmp/ssh-XXXXXXnfYTo5/agent.8946', '-l', 'clearml-worker-id=agent-gpu:gpu0', '-l', 'clearml-parent-worker-id=agent-gpu:gpu0', '-e', 'CLEARML_WORKER_ID=agent-gpu:gpu0', '-e', 'CLEARM...
Hi AgitatedDove14
Thanks for the update.
Well, it's a pain... I use specifically pytorch docker image and still agent will download it?
My image is build based on FROM pytorch/pytorch:1.11.0-cuda11.3-cudnn8-devel
And a portion of agent log on top of that image:
` Package(s) not found: torch
Torch CUDA 113 download page found
Found PyTorch version torch==1.11.0 matching CUDA version 113
Package(s) not found: torchvision
Found PyTorch version torchvision==0.12.0 matching CUDA version 113
Co...
Hi David,
In my case I have a remote minio with ssl enabled - do you want me to run a local one with HTTP to test if all works fine in that config?
Thanks @<1523701087100473344:profile|SuccessfulKoala55>
I've looked into the docker-compose and found a new image async_delete
Not sure what it does and if I should include it into upgraded installation.
If I do - there is a parameter CLEARML__services__async_urls_delete__fileserver__url_prefixes: "[${CLEARML_FILES_HOST:-}]"
I guess I should set it to fileserver
in case of one docker-compose?
Thanks @<1523703436166565888:profile|DeterminedCrab71>
I've tried that - it does not work - I have a valid endpoint in settings but a missing colon in js console
Waiting for a fix 🙏
Hi! any update on that fix? @<1523703436166565888:profile|DeterminedCrab71>
maybe it is not present in 1.8 and will just use that version?
ClearML is awesome - all works fine now! Will test the rest.