I think you are correct 😞 Let me make sure we add that (docstring and documentation)
SmarmyDolphin68 , All looks okay to me...
Could you verify you still get the plot on debug samples as image with the latest trains RCpip install trains==0.16.4rc0
worker nodes are bare metal and they are not in k8s yet
By default the agent will use 10022 as an initial starting port for running the sshd that will be mapped into the container. This has nothing to do with the Host machine's sshd. (I'm assuming agent running in docker mode)
Hi JuicyFox94
you pointed to exactly the issue 🙂
In your trains.conf
https://github.com/allegroai/trains/blob/f27aed767cb3aa3ea83d8f273e48460dd79a90df/docs/trains.conf#L94
Do people use ClearML with huggingface transformers? The code is std transformers code.
I believe they do 🙂
There is no real way to differentiate between, "storing model" using torch.save
and storing configuration ...
Great ascii tree 🙂
GrittyKangaroo27 assuming you are doing:@PipelineDecorator.component(..., repo='.') def my_component(): ...
The function my_component
will be running in the repository root, so in thoery it could access the packages 1/2
(I'm assuming here directory "project" is the repository root)
Does that make sense ?
BTW: when you pass repo='.'
to @PipelineDecorator.component
it takes the current repository that exists on the local machine running the pipel...
ElegantCoyote26 what is the model input layer definition? This implies the data format to pass to the serve endpoint
BTW:
Error response from daemon: cannot set both Count and DeviceIDs on device request.
Googling it points to a docker issue (which makes sense considering):
https://github.com/NVIDIA/nvidia-docker/issues/1026
What is the host OS?
Does it wok if you remove the Task.init call?
Ok the doc needs fix (edited)
suggestion?
Is this reproducible? I tried to run the same example code on my machine, and it started training ...
Do you have issues with other pytorch examples? Could you try simple reporting example:
https://github.com/allegroai/clearml/blob/master/examples/reporting/scalar_reporting.py
Also btw, is this supposed to be screenshot from community verison
Hmm seems like screenshot from an enterprise version, I'll ask them to update 🙂
I am also not understanding how clearml-serving is doing the version for models in triton.
Basically you have two Tasks, one is the "controller" checking model changes and updating itself.
The other is the engine, checking on the "controller" Task, which models it needs to download/configure and replaces them.
This way you can ha...
Where do you store those ?
GreasyPenguin14 yes there is 🙂
https://github.com/allegroai/clearml/issues/209
Set environment variable CLEARML_NO_DEFAULT_SERVER=1
Maybe this is part of the paid version, but would be cool if each user (in the web UI) could define their own secrets,
Very cool (and actually how it works), but at the end someone needs to pay for salaries 😉
The S3 bucket credentials are defined on the agent, as the bucket is also running locally on the same machine - but I would love for the code to download and apply the file automatically!
I have an idea here, why not use the "docker bash script" argument for that ?...
Hi RoughTiger69
How about using the pipeline decorator as a way to run this logic?
https://github.com/allegroai/clearml/blob/master/examples/pipeline/pipeline_from_decorator.py
I think I'm missing the context of where the code is executed....
btw: you can now set the configuration_objects directly when calling add_step 🙂
https://clearml.slack.com/archives/CTK20V944/p1633355990256600?thread_ts=1633344527.224300&cid=CTK20V944
I’m not sure ifÂ
https
 will work because I want to use ssh keys for creds.
BTW: I was not aware github provide pypi like artifactory, do they ?
Regrading SSH keys, they are passed from the host machine (i.e. in venv mode it will use the SSH keys from the user running the agent, and n docker mode, they are automatically mapped into the container)
Is there still an issue? Could it be the browser cannot access the file server directly?
MortifiedCrow63 , hmmm can you test with manual upload and verify ?
(also what's the clearml version you are using)
HarebrainedBear62 this is what I have.
clearml-data will store all the files for you, and version the entire thing, make is a breeze to abstract the dataset from the code. Querying data is available using Apache Drill (though currently it is still not built into the platform, but we are planning to get there soon) Since this is Image based data/meta-data, I know the paid tier of ClearML, has n additional dedicated data management solution specifically for images, with full ability to query m...
The quickest workaround would be, In your final code just do something like:my_params_for_hpo = {'key': omegaconf.key} task.connect(my_params_for_hpo, name='hpo_params') call_training_with_value(my_params_for_hpo['key'])
This will initialize the my_params_for_hpo
with the values from OmegaConf, and allow you to override them in the hyperparameyter section (task.connect is two, in manual it stores the data on the Task, in agent mode, it takes the values from the Task and puts them ba...
SmarmyDolphin68 okay what's happening is the process exists before the actual data is being sent (report_matplotlib_figure is an async call, and data is sent in the background)
Basically you should just wait for all the events to be flushedtask.flush(wait_for_uploads=True)
That said, quickly testing it it seems it does not wait properly (again I think this is due to the fact we do not have a main Task here, I'll continue debugging)
In the meantime you can just dosleep(3.0)
And it wil...
- Suppose that the serving project A is serving some model version 1 and a new model is trained and it starts serving model version 2, but on runtime due to some reason reason we need to revert to model version 1, what would be the best way to achieve the above?
If you archive the model, then the cleaml-session will pick the "latest" non-archived model, essentially reverting to the previous version. Also notice that it supports multiple versions on a single endpoint (again also a feat...
Hi SubstantialElk6
I can't see that is was removed, could you send the full log ?
additionally, I found is that clearml==1.0.5 package is able to find these partial changes, newer versions find nothing at all, maybe it's because it's always comparing against remote
Hmm it was always from remote...
it is actually doing the following:git rev-parse --abbrev-ref --symbolic-full-name @{u}
Then with the branch name output,git diff --submodule=diff <add_branch_name_here>
SmilingFrog76
there is no internal scheduler in Trains
So obviously there is a scheduler built into Trains, this is the queues (order / priority)
What is missing from it is multi node connection, e.g. I need two agents running the exact same job working together.
(as opposed to, I have two jobs, execute them separately when a resource is available)
Actually my suggestion was to add a SLURM integration, like we did with k8s (I'm not suggesting Kubernetes as a solution for you, the op...