
Reputation
Badges 1
25 × Eureka!That is odd, can you send the full Task log? (Maybe some oddity with conda/pip ?!)
This line π
None
Notice Triton (and so is clearml-serving) needs the pytorch model to be converted into torchscript, so that the triton backend can load it
trains-agent RC (which they tell me will be out tomorrow) will have a switch to do that, just so it is easier π
TroubledHedgehog16
but doesn't run when I deploy it using clearml. Here's the log of the error:
...
My guess is that clearml is reimporting keras somewhere, leading to circular dependencies.
It might not be circular, but I would guess it does have something to do with order of imports. I'm trying to figure out what would be the difference between local run and using an agent
Is it the exact same TF version?
Although I didn't understand why you mentioned
torch
in my case?
Just a guess π other frameworks do multi-process as well,
I would guess it relates to parallelization of Tasks execution of the
HyperParameterOptimizer
class?
Yes that might be it, it's basically by product of using python "Process" class for multiprocessing. we are working on a fix, not a trivial unfortunately
However I'm quite confident, that plots and scalars are not visible online only when 'git diff to large to store' appears.
These should be unrelated, are you seeing console outputs ?
but i still think the same should be possible using the Task.init
This is the part the I find confusing:Task.init(..., output_uri=True)
is working for me, what is that setup that caused this line to "fail"?
Now that we have the free tier (a.k.a community server) we might change the default behavior.
The idea is always to allow an easy way to on-board and test the system.
ReassuredTiger98
BTW: what's the scenario where your machine reverted to the default configuration (i.e. no configuration file) ?
Hmm I just noticed:
'--rm', '', 'bash'
This is odd this is an extra argument passed as "empty text" how did that end up there? could it be you did not provide any docker image or default docker container?
MysteriousBee56 Okay, let's try this one:docker run -t --rm nvidia/cuda:10.1-base-ubuntu18.04 bash -c "echo 'Binary::apt::APT::Keep-Downloaded-Packages \"true\";' > /etc/apt/apt.conf.d/docker-clean && apt-get update && apt-get install -y git python3-pip && python3 -m pip install trains-agent && echo done"
hmm this might help:
https://pip.pypa.io/en/stable/topics/configuration/#environment-variables
basically you might be able to define:PIP_NO_USE_PEP517=1
MotionlessCoral18 so did it solve the issue ?
ERROR: torch-1.12.0+cu102-cp38-cp38-linux_x86_64.whl is not a supported wheel on this platform
TartBear70 could it be you are running on a new Mac M1/2 ?
Also quick question, any chance you can test with the latest RC?pip3 install clearml-agent==1.3.1rc6
Hmm this is odd, when you press on the parent dataset in the UI, and go to full-details, then the INFO tab. Can you copy here everything ?
ERROR: Could not install packages due to an EnvironmentError:
[Errno 28] No space left on device
BTW: @<1523703080200179712:profile|NastySeahorse61> this sounds like docker out of space on the Main disk '/var/` where it stores all the images and temp file systems
This will cause you code to fail as any runtime change to the container file system will raise this out of disk space error
Hi @<1846360404628869120:profile|HelpfulBadger74>
Is pixi a drop in replacement for pip? is it like UV?
Hi UnevenDolphin73
This differentiable storage - does it only work on file additions/removal, or also on intra-file changes?
This is on a file level, meaning you change a single byte in the file, the entire file will be packaged in the new version.
Make sense ?
These are maybe good features to include in ClearML:
or
.
Sure, we should probably add a section into the doc explaining how to do that
Other approach is creating my own API on the top of clearml-serving endpoints and there I control each tenant authentication.
I have to admit that to me this is a much better solution (then my/bento integrated JWT option). Generally speaking I think this is the best approach, it separates authentication layer from execution ...
Hi ZanyPig66
I used tensorboard as clearml claims to auto-capture tensorboard outputs, but it was a no go.
The auto TB logging should work out of the box, where is it failing ?
Also,task = Task.current_task()
Why aren't you using Task.init in the original script?
The idea is that you run your code on your machine (where the environment works), ClearML auto detects code + python packages + args etc.
Then you clone it in the UI and launch it on a remote machine.
What am I missing ...
Is there a way to do this all elegantly?
Of yes there is, this is how TaskB code will look:
` task = Task.init(..., 'task b')
param = {'TaskA' :'TaskAs ID HERE'}
task.connect(param)
taska_model = Task.get_task(param['TaskA']).models['output''][-1]
torch.load(taska_model.get_local_copy())
train
torch.save('modelb') `I might have missed something there, but generally speaking this will let you:
Select TASKA as a parameter of TaskB training process Will register automagically Tasks'A...
throw an error when running withoutΒ
clearml.conf
Β which tells the user to run clearml-init first?
I would like potential users to be able to just run the example code and get the experience, or even integrate with their code, without the need to run a single configuration
(Basically to alleviate as many potential hurdles from getting users on board clearml)
Could not find a version that satisfies the requirement open3d==0.15.2 .. from versions: 0.10.0.0, 0.11.0, 0.11.1, 0.11.2, 0.12.0, 0.13.0)
This points to the agent installing using a different python version that you run the original code, I would guess python3.6
JitteryCoyote63 any chance you have a log of the failed torch 1.7.0 ?
with ?
multipart: false
secure: false
If so, can you post here your aws.s3 section of the clearml.conf? (of course replacing the actual sensitive information with *s)
Hi DefeatedCrab47
You should be able to change the Web server port , but API port (8008) cannot be changed. If you can login to the web app and create a project it means everything is okay. Notice that when you configure trains ( trains-init
) the port numbers are correct π
the storage configuration appears to have changed quite a bit.
Yes I think this is part of an the cloud ready effort.
I think you can find the definitions here:
https://artifacthub.io/packages/helm/allegroai/clearml
Hi DepressedChimpanzee34 , took me a while but I think there is a solution:
In your docker file, replace:
https://github.com/allegroai/clearml-server/blob/a64c4d264d00eadd2d11818b37151d3cc6266d99/docker/docker-compose.yml#L5
withentrypoint: /bin/bash command: -c "mkdir -p /var/log/clearml && cd /opt/clearml/ && python3 -m apiserver.apierrors_generator && gunicorn -w 4 -t 600 --bind=0.0.0.0:8008 apiserver.server:app"
Hi SubstantialElk6
Generically, we would 'export' the preprocessing steps, setup an inference server, and then pipe data through the above to get results. How should we achieve this with ClearML?
We are working on integrating the OpenVino serving and Nvidia Triton serving engiones, into ClearML (they will be both available soon)
Automated retraining
In cases of data drift, retraining of models would be necessary. Generically, we pass newly labelled data to fine...
PanickyMoth78
and I would definitely prefer the command
executing_pipeline
to
not
kill
the process that called it.
I understand why it would be odd from a notebook perspective, the issue is that the actual code is being "sent" to the backend to be execcuted on a remote machine. It is important to understand, that this is the end of the current process. Does that make sense ?
(not saying we could not add an argument for that, just trying to ...
Hi ItchyJellyfish73
This seems aligned with scenario you are describing, it seems the api server is overloaded with simultaneous connections.
Add an additional apiserver instance to the docker-compose and an nginx as load balancer:
https://github.com/allegroai/clearml-server/blob/09ab2af34cbf9a38f317e15d17454a2eb4c7efd0/docker/docker-compose.yml#L4
`
apiserver:
command:
- apiserver
container_name: clearml-apiserver
image: allegroai/clearml:latest
restart: unless-sto...