Reputation
Badges 1
25 × Eureka!but it fails during env setup due to trying to install an obscure version of pytorch. Been trying to solve this for three days!
AdventurousButterfly15 it tries to resolve the correct pytorch version based on the cuda inisde the container
ERROR: torch-1.12.1+cu116-cp310-cp310-linux_x86_64.whl is not a supported wheel on this platform.
seems like it is trying to install pytoch for python 3.10 with cuda 11.6 support, this seems reasonable, no?
Hi AbruptHedgehog21
can you send the two models info page (i.e. the original and the updated one) ?
do you see the two endpoints ?
BTW: --version would add a version to the model (i.e. create a new endpoint with version "endpoint/{version}"
Hi DeliciousBluewhale87
This is the latest clearml-serving (stable release at GTC at the end of the month)
https://github.com/allegroai/clearml-serving/tree/dev
Generally speaking, clearml-sering is a control plane, preprocessing, ML inference, with Nvidia Triton for DL inference (fully transparent).
It allows you to spin an entire fully dynamic & scalable serving on top of k8s cluster. Once you spin the base containers, you can configure them live with a CLI, this includes adding new en...
I know there is a aux cfg with key value pairs but how can use it in the python code?
This is actually for helping to configure Triton services, you cannot (I think) easily access it from the code
Nested in the UI is not possible I think?
Yes, but the next version will have nested projects, that's something π
I mean that it is possible to start the subtask while the main task is still active.
You cannot call another Task.init while a main one is running.
But you can call Task.create and log into it, that said the autologging is not supported on the newly created Task.
Maybe the easiest solution is just to do the "sub-tasks" and close them. That means the main Task i...
Hi ObnoxiousStork61
but unfortunately I can't fetch them from my local computer,
is this intended?
By default ClearML will only log the wights files.
It can also automatically upload them, if you pass a destination for storage at Task.init/
For example, to store on the files server:Task.init(..., output_uri=True)
To store on S3 (sub folders will be created automatically based on the Task IDTask.init(..., output_uri='
')
Yey π !
So now you can add some logic based on the model
object passed as the second argument (see WeightsFileHandler.ModelInfo)
The easiest is based on the model name see model.local_model_path
Hi MuddySquid7 issue is verified, v1.1.1 will be released in a few hours with a fix.
Thank you for noticing!
If Task.init() is called in an already running task, donβt reset auto_connect_frameworks? (if i am understanding the behaviour right)
Hmm we might need to somehow store the state of it ...
Option to disable these in the clearml.conf
I think this will be to general, as this is code specific , no?
hmmm, somehow I have a bed feeling about it... Could you check the log, it should say something like "Collecting torch==1.6.0.dev20200421+cu101 from https://"
It should be right at the top of the installation. What do you have there?
I have to admit mounting it to a different drive is a good reason to bring this feature back, the reasoning was it means the agent needs to make sure it manages them (e.g. multiple agents running on the same machine)
Hi @<1695969549783928832:profile|ObedientTurkey46>
Why do tags only show on a version level, but not on the dataset-level? (see images)
Tags of datasets are tags on "all the dataset versions" i.e. to help someone locate datasets (think locating projects as an analogy). Dataset Version tags are tags on a specific version of the dataset, helping users to locate a specific version of the dataset. Does that make sense ?
task.update({'script': {'version_num': 'my_new_commit_id'}})
This will update to a specific commit id, you can pass empty string '' to make the agent pull the latest from the branch
We are working hard on release 1.7 once that is out we will push an RC for review (I hope) π
PompousParrot44 unfortunately not yet π
But the gist is :
MongoDB stores experiment data (i.e. execution parameters, git ref etc.)
ElasticSearch stores results (i.e. metrics console logs, debug image links etc.)
Does that help?
This should have worked with the latest clearml RC.
And you verified it is not working?
Hi SubstantialElk6
You are uploading an artifact, a good use case for numpy artifact would be a feature table.
If you want to upload an image use either report_media or report_image or upload PIL image as artifact.
What do you think?
DilapidatedDucks58 by default if you continue to execution, it will automatically continue reporting from the last iteration . I think this is what you are seeing
well from 2 to 30sec is a factor of 15, I think this is a good start π
Hmm, #790 should be solved in 1.7.2
Yes, I always see the "model uploaded completed" for such stuck tasksAny chance this is reproducible ?
How many processes do you see running (i.e. ps -Af | grep python) ?
What is the training framework? is it multiprocess ? how are you launching the process itself? is it Linux OS? is it running inside a specific container ?
And other question is clearml-serving ready for serious use?
Define serious use? KFserving support is in the pipeline, if that helps.
Notice that clearml-serving is basically a control plane for the serving engine, not to neglect the importance of it, the heavy lifting is done by Triton π (or any other backend we will integrate with, maybe Seldon)
I assume every fit starts reporting from step 0 , so they override one another. Could it be?
So how can I temporarily fix it?
Try:task.output_uri = task.get_output_destination()
It can also work by running on multiple known nodes.
Horovod sits on top of openmpi that needs ssh to open multiple nodes, I'm not sure how one would connect it without passing the SSH keys from one node to the other, and making sure they can directly communicate. (Not saying it is not possible, but just a few things to configure before it works, the enterprise edition remove the need for the direct SSH connection between the nodes)
How would i add a glue for multinode?
Basic...
but it still not is able to run any task after I abort and rerun another task
When you "run" a task you are pushing it to a queue, so how come a queue is empty? what happens after you push your newly cloned task to the queue ?
Could you send me the cosnole log of both tasks, failing and passing one?
Multi-threaded multi-processes multi-nodes π
one of the two experiments for the worker that is running both experiments
So this is the actual bug ? I need some more info on that, what exactly are you seeing?