Reputation
Badges 1
93 × Eureka!I am bit confused because I can see configuration sections Azure storage in the clearml.conf files, but these are on the client pc and the clearml-agent compute nodes.
So do these parameters have to be set on the clients and compute nodes individually, or is something that can be set on the server?
I think I failed in explaining my self, I meant instead of multiple CUDA versions installed on the same host/docker, wouldn't it make sense to just select a different out-of-the-box docker with the right CUDA, directly from the public nvidia dockerhub offering ? (This is just another argument on the Task that you can adjust), wouldn't that be easier for users?
Absolutely aligned with you there AgitatedDove14 . I understood you correctly.
My default is to work with native VM images, a...
I should say, the company I am working Malvern Panalytical, we are developing an internal MLOps capability, and we are starting to develop a containerized deployment system, for developing, training and deploying machine learning models. Right now we are at the early stages of development, and our current solution is based on using Azure MLOps, which I personally find very clunky.
So I have been tasked with investigating alternatives to replace the training and model deployment side of thing...
AgitatedDove14 I would love to help the project.
I am just about to move house, which is stressful enough without a global pandemic(!), so until that's completed I won't commit to anything. However, once settled in the new place, and I have a bit more time, I would very much welcome contributing.
I think so.
I am doing this with one hand tied behind my back at the moment because I waiting to get an Azure AD App and Services policy setup, to enable the autoscaler to authenticate with the Azure VMSS via the Python SDK.
AgitatedDove14
Just compared two uploads of the same dataset, one to Azure Blob and the other to local storage on clearml-server.
The local storage didn't report any statistics, so it might be confined to the cloud storage method, and specifically Azure.
Oh cool!
So when the agent fire up it get's the hostname, which you can then get from the API, and pass it back to take down a specific resource if it is deemed idle?
I dip in and out of Docker, and that one gets me almost every time!
SuccessfulKoala55 I am not that familiar with AWS. Is that essentially a port forwarding service, where you have a secure end point that redirects to the actual server?
Hi AgitatedDove14 ,
Thanks for your points.
I have updated and https://github.com/allegroai/clearml-agent/issues/66 relating to this issue.
For completeness, you were right it being an issue with PyTorch, there is a breaking issue with PyTorch 1.8.1 and Cuda 11.1 when installed via PIP.
It's recommended that PyTorch be installed with Conda to circumvent this issue.
I also think clearing out the venv directory before executing the change of package manager really helped.
Is there a helper function option at all that means you can flush the clearml-agent working space automatically, or by command?
I have created a Github issue 3 on the clearml-serving repo.
Yup, I can confirm that's the case.
I have just literally installed the latest commit via the master branch and it works.
I have changed the configuration file created by Certbot to listen on port 8080 instead of port 80, however, when I restart the NGINX service, I get errors relating to bindings.
server { listen 8080 default_server; listen [::]:8080 ipv6only=on default_server;
Restarting the service results in the following errors:
` ● nginx.service - A high performance web server and a reverse proxy server
Loaded: loaded (/lib/systemd/system/nginx.service; enabled; vendor preset: ...
This is very cool, any reason for not using dockers the multiple CUDA versions?
AgitatedDove14 my inexperience in using them a lot until recently. I can see how that is a better solution and it's something I am actively getting trying to improve my understanding of, and use of.
I am now relatively comfortable with producing a Dockerfile for example, although I've not got as far as making any docker-compose related things yet.
I have also tried training a variety of network architectures from a number of libraries (Torchvision, pytorchcv, TIMM), as well as a simple VGG implementation from scratch, and come across the same issues.
The following code is the training script that was used to setup the experiment. This code has been executed on the server in a separate conda environment and verified to run fine (minus the clearml code).
` from future import print_function, division
import os, pathlib
Clear ML experiment
from clearml import Task, StorageManager, Dataset
Local modules
from cub_tools.trainer import Ignite_Trainer
from cub_tools.args import get_parser
from cub_tools.config import get_cfg_defaults
#...
EnviousStarfish54 we are at the beginning phases exploring potential solutions to MLops. So I have only been playing with the tools, including the dataset side of things. However, I think that an integral part of capturing a model in its entirety is being able to make sure that you know what into making it. So I see being able to version and difference datasets as just as important as the code, or the environment in which it is run.
The error after the first iteration as follows:
` [INFO] Executing model training...
1621437621593 ecm-clearml-compute-gpu-001:0 DEBUG Epoch: 0001 TrAcc: 0.296 ValAcc: 0.005 TrPrec: 0.393 ValPrec: 0.000 TrRec: 0.296 ValRec: 0.005 TrF1: 0.262 ValF1: 0.000 TrTopK: 0.613 ValTopK: 0.026 TrLoss: 3.506 ValLoss: 5.299
Current run is terminating due to exception: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
You can try to repro this exception using the following code snippet. If that doesn't trigger...
Ah ok, so it's the query string you use with the SAS box. Great.
SuccessfulKoala55 WearyLeopard29 could this be a potential idea?
It appears here the setup is for apps on different ports, and it seems to me to be exactly the clearml problem?
So could extrapolate and put in an API app and a FILESERVER app description with the correct ports?
https://gist.github.com/apollolm/23cdf72bd7db523b4e1c
` # the IP(s) on which your node server is running. I chose port 3000.
upstream app_geoforce {
server 127.0.0.1:3000;
}
upstream app_pcodes{
server 12...
Does "--ipc=host" make it a dynamic allocation then?
This one got me the first time I set it up as well.
The default settings taken from the environment variables, the docker-compose file and the code itself.
You only need add configuration items that you want to change from the default.
FYI, I am training the model again, this time in a project which is not nested, just to rule out any funnies with regards to issues with nested projects.
The following was reported by the agent during the setup phase of the compute environment on the remote compute resource:
Log file is attached.
AnxiousSeal95
I think I can definitely see value in that.
I found that once you go beyond the easy examples, where you are largely using datasets that curated as part of a python package, then it took a bit of effort to get my head around the dataset tools.
Likewise with the deployment side of things, and the Triton inference engine, there are certain aspects of that which I am relatively new to, so to go from the simple Keras example, to getting a feeling that the tool will cover the use ...
AgitatedDove14 that started out a lot shorter, and I read it twice, but I think it answers your question..... 😉