Reputation
Badges 1
25 × Eureka!But from the log it seems that:
you are not running as root in the docker? Python3.8 is installed (and not python 3.6 as before)
Hi BitterStarfish58
What's the clearml version you are using ?
dataset upload both work fine
Artifacts / Datasets are uploaded correctly ?
Can you test if it works if you change " http://files.community.clear.ml " to " http://files.clear.ml " ?
Click on the "k8s_schedule" queue, then on the right hand side, you should see your Task, click on it, it will open the Task page. There click on the "Info" Tab, there look for "STATUS MESSAGE" and "STATUS REASON". What do you have there?
SubstantialElk6 this is odd, how are they passed ? what's the exact setup ?
Hmm yes this is exactly what should not happen 🙂
Let me check it
MoodyCentipede68 is diagram 2 a batch processing workflow?
DeterminedToad86
Yes I think this is the issue, on SageMaker a specific compiled version of torchvision was installed (probably part of the image)
Edit the Task (before enqueuing) and change the torchvision URL to:torchvision==0.7.0Let me know if it worked
Parent makes sense if you are changing the data of the parent version, but some data is preserved. Which will make the delta-based storage only store the diff.
If everything is different, and you call sync for example, then it will not reference any previous "snapshot", so there will be no redundancy in storage, but you still get a pointer to the "parent" version.
Make sense ?
cuda 10.1, I guess this is because no wheel exists for torch==1.3.1 and cuda 11.0
Correct
how can I enforce a specific wheel to be installed?
You mean like specific CUDA wheel ?
you can simple put the http link to the wheel in the "installed packages", it should work
Hi JumpyPig73
Funny enough this is being fixed as we speak 🙂
The main issue is that as you mentioned, ClearML does not "detect" the exit code when os.exit() is called, and this is why it is "missing" the failed test (because as mentioned, all exceptions are caught). This should be fixed in the next RC
correct on both.
notice that with upload you can specify any storage (S3/GS/Azure atc)
Hi ElegantCoyote26
sometimes the agents load an earlier version of one of my libraries.
I'm assuming some internal package that is installed from a wheel file not a direct git repo+commit link ?
How can I make it show progress less often/rewrite?
I'm not sure this is configurable ... you mean like reports on the uploads right? (i.e. report every 5mb I think is the default)
while we are at it, maybe we should use twdm if it is installed
wdyt?
ReassuredTiger98 after 20 hours, was it done uploading ?
What do you see in the Task resource monitoring? (notice there is network_tx_mbs metric that should be accordig to this, 0.152)
These are maybe good features to include in ClearML:
or
.
Sure, we should probably add a section into the doc explaining how to do that
Other approach is creating my own API on the top of clearml-serving endpoints and there I control each tenant authentication.
I have to admit that to me this is a much better solution (then my/bento integrated JWT option). Generally speaking I think this is the best approach, it separates authentication layer from execution ...
Hi @<1540142641931358208:profile|FancyBaldeagle86>
You mean in the UI? i.e. clone an experiment hover over the Configuration / Hyperparameter section and clicking edit ?
that does make more sense 🙂
Thanks ScantChimpanzee51 !
Let me see what I can find, should be easy enough to fix now 🙂
DilapidatedDucks58 You might be able to, check the links, they might be embedded into the docker, so you can map diff png file from the host 😛
BTW: what would you change the icons to?
Yey! BTW: what the setup you are running it with ? does it include "manual" tasks? Do you also report on completed experiments (not just failed ones)? Do you filter by iteration numbers?
DilapidatedDucks58
all our workers went down after starting the slack bot, is it expected?)
Oh dear... I can;t see any connection... What is the last log you have there?
No, an old experiment changed, nothing was rerun
ohh, that is odd. I think the max iteration value is stored on the DB, which is odd if it changed after an update.
BTW: just making sure, could it be these Tasks were imported ? (i.e. offline execution + import)
Hi PerplexedCow66
I would like to know how to serve a model, even if I do not use any serving engine
What do you mean no serving engine, i.e. custom code?
Besides that, how can I manage authorization between multiple endpoints?
Are you referring to limiting access to all the endpoints?
How can I manage API keys to control who can access my endpoints?
Just to be clear, accessing the endpoints has nothing to do with the clearml-server credentials, so are you asking how to...
ConvolutedSealion94 Let me try to explain how it works, I hope this will help in debugging.
There are two different entities here
Clearml-server: In this context clearml server acts as a control-plane, it stores configuration on the different endpoints, models, preprocessign code etc. It does Not perform any compute or serving clearml-serving-inference https://github.com/allegroai/clearml-serving/blob/e09e6362147da84e042b3c615f167882a58b8ac7/docker/docker-compose-triton-gpu.yml#L77 . This ...
and about a month later for some reason the initial iteration seems to have changed to 0
Hmm, I see your point. Just so I fully understand, your are not saying Old experiments were changed, but new experiments (running the same code-ish) have a totally different max iterations value. Is this correct ?
Hi WearyLeopard29
Yes 🙂 this is exactly how it should work
2,3 ) the question is whether the serving is changing from one tenant to another, does it?
Besides that, what are your impressions on these serving engines? Are they much better than just creating my own API + ONNX or even my own API + normal Pytorch inference?
I would separate ML frameworks from DL frameworks.
With ML frameworks, the main advantage is multi-model serving on a single container, which is more cost effective when it comes to multiple model serving. As well as the ability to quickly update models from the clearml model repository (just tag + publish and the end...