
Reputation
Badges 1
25 × Eureka!Hi DeliciousBluewhale87
This is the latest clearml-serving (stable release at GTC at the end of the month)
https://github.com/allegroai/clearml-serving/tree/dev
Generally speaking, clearml-sering is a control plane, preprocessing, ML inference, with Nvidia Triton for DL inference (fully transparent).
It allows you to spin an entire fully dynamic & scalable serving on top of k8s cluster. Once you spin the base containers, you can configure them live with a CLI, this includes adding new en...
Hi SubstantialElk6 I believe you just need to use clearml 1.0.5 , and make sure you rae passing the correct OS environment to the agent
I mean clone the Task in the UI (right click Clone), then go to the execution Tab, to the "installed packages" section, then click on Edit -> go to the torchvision http link, and replace it with torchvision == 0.7.0
and save.
Then right enqueue the Task (to the default queue) and see if the Agent can run it,
DeterminedToad86 Make sense ?
Hi ConvolutedSealion94
Yes this seems like the correct curl
How did you spin the clearml-serving containers? is it with the docker-compose or with the helm chart (I remember that there are some pitfalls with the helm chart, and I would actually start with the local docker-compose to debug it)
I think it fails because it tries to install trains twice. Could you remove the trains package, and test? I'm also curious how do you have both installed?!
Hi BrightGoat74
So merging general purpose plotly plots is very hard (i.e. putting both on the same graph)
But if you report using logger.report_scatter(...) the UI will merge the ROC curves into the dame graph, wdyt?
https://clear.ml/docs/latest/docs/guides/reporting/scatter_hist_confusion_mat_reporting#2d-scatter-plots
Hi GreasyPenguin14
This is what I did, but I could not reproduce the hang, how is this different from your code?
` from multiprocessing import Process
import numpy as np
from matplotlib import pyplot as plt
from clearml import Task, StorageManager
class MyProcess(Process):
def run(self):
# in another process
global logger
# Create a plot
N = 50
x = np.random.rand(N)
y = np.random.rand(N)
colors = np.random.rand(N)
area = ...
I should manually copy it to the remote services agents?
The code itself needs to run somewhere, currently this has to be your machine, either you manually run the AWS autoscaler or an agents runs it for you. Make sense ?
If the only issue is this linetask.execute_remotely(..., exit_process=True)
It has to finish the static analysis of the entire repository (which usually happens in the background but now we have to wait for it). If the repo is large this could actually take 20sec (depending on CPU/drive of the machine itself)
Hi SubstantialElk6
32 CPU cores, 64GB ram
Should be plenty, this sounds like network bottle neck issue, I can't imagine the server is actually CPU bounded
We are using k8s glue to spawn the job. ...
I think this is actual network latency, nothing to do with the jobs, could it be the server is very far away?
What happens when you manually start a Task from your machine ?
Is the latency fixed? Is it just when starting a new Task?
Hi MagnificentSeaurchin79
This means the tensorflow was not directly imported in the repository (which is odd, it might point to the auto package analysis failing to find a the package, if this is the case please let me know)
Regardless, if you need to make sure a package is listed in the requirements either import it or use.Task.add_requirements('tensorflow')
or Task.add_requirements('tensorflow', '2.3.1')
PompousBeetle71 I think that was you saw as tags in previous version was actually systems tags, now we also have users tags (i.e. .tags). If you still want to access the system tags can you try:InputModel('aabbcc')._get_base_model().data.system_tags
https://clear.ml/docs/latest/docs/references/sdk/task#mark_stopped
Maybe we should add an argument so you could do:mark_stopped(force=False, message='it was me who stopped it')
And we will automatically add the user name as well ?
GrittyKangaroo27 any chance you can open a GitHub issue so this is not forgotten ?
(btw: we I think 1.1.6 is going to be released later today, then we will have a few RC with improvements on the pipeline, I will make sure we add that as well)
Hi @<1576381444509405184:profile|ManiacalLizard2>
If you make sure all server access is via a host name (i.e. instead of IP:port, use host_address:port), you should be able to replace it with cloud host on the same port
Basically create a token and use it as user/password
EDIT:
With read-only permissions 🙂
I think you can force it to be started, let me check (I pretty sure you can on aborted Task).
no need for it actually
Thank you DilapidatedDucks58 for the ping!
totally slipped my mind 😞
Hi @<1727497172041076736:profile|TightSheep99>
I think you are correct! it will use the internal individual file upload retry but does not let you control it.
Could you please open a github issue so that we do not forget to add it?
JitteryCoyote63 I think this only holds for the conda distribution.
(Actually quite interesting, I wonder what happens if you already installed cudatoolkit...)
CrookedWalrus33 I found the issue, this is only failing with Python 3.6.
Let me check something
So if I am not using remote machine can I disable this?
yes I think you can, add to your clearml.conf
sdk.development.store_jupyter_notebook_artifact = false
BTW: why would you turn it off ?