Reputation
Badges 1
93 × Eureka!I'll just take a screenshot from my companies daily standup of data scientists and software developers..... that'll be enough!
If I did that, I am pretty sure that's the last thing I'd ever do...... 🤣
Pffff security.
Data scientist be like....... 😀
Network infrastructure person be like ...... 😱
Like AnxiousSeal95 says, clearml server will version a dataset for you and push it to a unified storage place, as well as make it differenceable.
I’ve written a workshop on how to train image classifiers for the problem of bird species identification and recently I’ve adapted it to work with clearml.
There is an example workbook on how to upload a dataset to clearml server, in this a directory of images. See here: https://github.com/ecm200/caltech_birds/blob/master/notebooks/clearml_add...
This appears to confirm it as well.
https://github.com/pytorch/pytorch/issues/1158
Thanks AgitatedDove14 , you're very helpful.
Hi AgitatedDove14 ,
Thanks for your points.
I have updated and https://github.com/allegroai/clearml-agent/issues/66 relating to this issue.
For completeness, you were right it being an issue with PyTorch, there is a breaking issue with PyTorch 1.8.1 and Cuda 11.1 when installed via PIP.
It's recommended that PyTorch be installed with Conda to circumvent this issue.
The following code is the training script that was used to setup the experiment. This code has been executed on the server in a separate conda environment and verified to run fine (minus the clearml code).
` from future import print_function, division
import os, pathlib
Clear ML experiment
from clearml import Task, StorageManager, Dataset
Local modules
from cub_tools.trainer import Ignite_Trainer
from cub_tools.args import get_parser
from cub_tools.config import get_cfg_defaults
#...
AgitatedDove14 yes that's great.
I finally got the clearml-agent working correctly using CONDA to create the environment, and indeed, it used PIP when it couldn't find the package on CondaCloud. This is analogous to how I create python environments manually.
The following was reported by the agent during the setup phase of the compute environment on the remote compute resource:
Log file is attached.
The error after the first iteration as follows:
` [INFO] Executing model training...
1621437621593 ecm-clearml-compute-gpu-001:0 DEBUG Epoch: 0001 TrAcc: 0.296 ValAcc: 0.005 TrPrec: 0.393 ValPrec: 0.000 TrRec: 0.296 ValRec: 0.005 TrF1: 0.262 ValF1: 0.000 TrTopK: 0.613 ValTopK: 0.026 TrLoss: 3.506 ValLoss: 5.299
Current run is terminating due to exception: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED
You can try to repro this exception using the following code snippet. If that doesn't trigger...
I have also tried training a variety of network architectures from a number of libraries (Torchvision, pytorchcv, TIMM), as well as a simple VGG implementation from scratch, and come across the same issues.
This one got me the first time I set it up as well.
The default settings taken from the environment variables, the docker-compose file and the code itself.
You only need add configuration items that you want to change from the default.
I also think clearing out the venv directory before executing the change of package manager really helped.
Is there a helper function option at all that means you can flush the clearml-agent working space automatically, or by command?
Does "--ipc=host" make it a dynamic allocation then?
Oh it's a load balancer, so it does that and more.
But I suppose the point holds though, it provides an end-point for external locations, and then handles the routing to the correct resources.
I love the new design of the site.
When is clearml-deploy coming to the open source release?
Or is this a commercial only part?
Looking at the _resolve_base_url()
method of the StorageHelper
class I can see that it is using furl
to handle the path splitting for getting at the Azure storage account and container names.
Replicating the commands, the first one to get the Storage Account seems to have worked ok:
f = furl.furl(uri) account_name = f.host.partition(".")[0]
Replicating above manually seems to give the same answer for both and it looks correct to me:
` >>> import furl
f_a = furl.fu...
AgitatedDove14 in this remote session on the compute node, where I am manually importing the clearml
sdk, what's the easiest way to confirm that the Azure credentials are being imported correctly?
I assume from our discussions yesterday on the dockers, that when the orchestration agent daemon is run with a given clearml.conf
, I can see that the docker run command has various flags being used to pass certain files and environment variables from the host operating system of the co...
AgitatedDove14 Ok I can do that.
I was just thinking it through.
Would this be best if it were executed in the Triton execution environment?
Ok I think I managed to create a docker image of the Triton instance server, just putting the kids to bed, will have a play afterwards.
I don’t have a scooby doo what that pickle file is.
Thanks for the last tip, "easy mistaker to maker"
So moving onto the container name.
Original code has the following calls:
if not f.path.segments: raise ValueError( "URI {} is missing a container name (expected " "[https/azure]://<account-name>.../<container-name>)".format( uri ) ) container = f.path.segments[0]
Repeating the same commands locally results in the following:
` >>> f_a.path.segments
['artefacts', 'Caltech Birds%2FTraining', 'TRAIN...
Fixes and identified issues can be found in these github comments.
Closing the discussion here.
Just another thought, this couldn’t be caused by using a non default location for clearml.conf
?
I have a clearml.conf
in the default location which is configured for training agents and I created a separate one for the inference service and put it in a sub folde of my home dir. The agent on the default queue to be used for inference serving was execute using clearml-agent daemon —config-file /path/to/clearml.conf
When I run the commands above you suggested, if I run them on the compute node but on the host system within conda environment I installed to run the agent daemon from, I get the issues as we appear to have seen when executing the Triton inference service.
` (py38_clearml_serving_git_dev) edmorris@ecm-clearml-compute-gpu-002:~$ python
Python 3.8.10 (default, May 19 2021, 18:05:58)
[GCC 7.3.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
...
AgitatedDove14
So can you verify it can download the model ?
Unfortunately it's still falling over, but then I got the same result for the credentials using both URI strings, the original, and the modified version, so it points to something else going on.
I note that the StorageHelper.get()
method has a call which modifies the URI prior to it being passed to the function which gets the storage account and container name. However, when I run this locally, it doesn't seem to do a...
Mr AgitatedDove14 Good spot sir!
Sounds like a good candidate, I will test now and report back.