Reputation
Badges 1
25 × Eureka!Ohh StraightCoral86 did you check cleaml-task
? This is exactly what it does
(this is the CLI, from code you basically call Task.create & Task.enqueue)
Will this solve it ?
@<1571308003204796416:profile|HollowPeacock58> seems like an internal issue copying this object config.model
This is a complex object, and it seems that for some reason
None
As a workaround just do not connect this object. it seems you cannot pickle it / copy it (see GH issue)
well I do not think you set your pytorch lightining to use cuda:
GPU available: True (cuda), used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/code/.venv/lib/python3.9/site-packages/lightning/pytorch/trainer/setup.py:176: PossibleUserWarning: GPU available but not used. Set `accelerator` and `devices` using `Trainer(accelerator='gpu', devices=1)`.
PompousBeetle71 , basically reset experiment will clear all the outputs, and input model model is well, input, it is not cleared. In the next execution it will be overridden. There is actually a way to change it from the UI, and override the initial model weights.
Hi @<1566596960691949568:profile|UpsetWalrus59>
All correct with the exception of " ...or 1GB Metric" this is a limit, since metrics (and meta data) is always stored on the clearml-server, so it is metered. There is also an API limit, basically anti abuse, which of course resets every month, but if you are running tens of experiments at the same time you will hit this limit. Make sense ?
TenseOstrich47 it's based on free "index" so the first index not in used will be captured, but if you remove agents, then the order will change e.g. you take down worker #1 , the next worker you spin will be #1 becuase it is not taken)
BattyLion34 I have a theory, I think that any Task on the "default" queue qill fail if a Task is running on the "service" queue.
Could you create a toy Task that just print "." and sleeps for 5 seconds and then prints again.
Then while that Task is running, from the UI launch the Task that passed on the "default" queue. If my theory holds it should fail, then we will be getting somewhere π
Hmmm, yes we should definitely add --debug (if you can, please add a GitHub issue so it is not forgotten).
FiercePenguin76 Specifically are you able to ssh manually to <external_address>:<external_ssh_port> ?
Hi SubstantialElk6ClearML-Data
doesn't actually "load" the data, it brings it locally and returns a folder with all your data files, from that point onward, it's up to your code to load it to the framework. Make sense ?
DefiantHippopotamus88 you can create a custom endpoint and do that, but it will be running I the same instance , is this what you are after? Notice that Triton actually supports it already, you can check the pytorch example
maybe worth updating the main Readme.md in the github.. if someone try to follow the instructions there it breaks
Hmm I thought we already did, Yes you are absolutely correct, I'll make sure we do
WackyRabbit7 basically starting v1.1 if you are running code without any configuration file, you will get an error (in contrast to previous versions where it defaulted to the demo-server)
Hi @<1720249421582569472:profile|NonchalantSeaanemone34>
pipeline decorator where lambda function call another function(say
xyz
) and during pipeline execution, error is thrown that
xyz
is not defined?
Each pipeline function becomes a standalone "script", which I assume if the lambda function is defined outside of the decorated pipeline component function, would throw an undefined error.
My suggestion would be to define the lambda function as a nes...
because step can be constructed with multiple
sub-components
but not all of them might be added to the UI graph
Just to make sure I fully understand when we decorate with @sub_node we want that to also appear in the UI graph (and have it's own Task / metrics etc)
correct?
ohh, the copy paste thing when you generate credentials ?
BTW: there is a full Pipeline class that does everything for you, example here:
https://github.com/allegroai/clearml/tree/master/examples/pipeline
Hi @<1538330703932952576:profile|ThickSeaurchin47>
Specifically Iβm getting the error βcould not access credentialsβ
Put your minio credentials here:
None
corporate firewall... let's start with http π
That sounds like an issue with "working dir" , check the "Execution" "Working Directory" field.
'.' means the root of the git repository
'subfolder' means run the script from the subfolder etc. also make sure that the script path is adjusted accordingly.
btw: Trains should have filled in all the correct paths... If you have time get the latest trains (0.14.3) and run again see if the problem consts, we should probably fix that bug π
BattyLion34
if I simply clone nntraining stage and run it in default queue - everything goes fine.
When you compare the Task you clone manually and the Task created by the pipeline , what's the difference ?
are you referring toΒ
extra_docker_shell_
scrip
t
Correct
the thing is that this runs before you create the virtual environment, so then in the new environment those settings are no longer there
Actually that is better, because this is what we need to setup the pip before it is used. So instead of passing --trusted-host
just do:
` extra_docker_shell_script: ["echo "[global] \n trusted-host = pypi.python.org pypi.org files.pythonhosted.org YOUR_S...
Yes, there is no real limit, I think the only requirements id docker v19+
WackyRabbit7 I do 'pkill -f trains' but it's the same... If you need to debug and test run with --foreground and just hit ctrl-c to end the process (it will never switch to background...). Helps?
ItchyJellyfish73
Unfortunately this needs backend support, and only available in the enterprise version, what is your use case for it? (It was designed to allow out of the box bare-metal multi gpu dynamic allocation, think DGX with 8 GPUs that instead of spinning down agents when you want to change the queue->num-gpu mapping you can do it on the fly)
GiganticTurtle0 is it just --stop that throws this error ?
btw: if you add --queue default
to the command line I assume it will work, the thing is , without --queue it will look for any queue with the "default" tag on it, since there are none, we get the error.
regardless that should not happen with --stop
I will make sure we fix it
Just so we do not forget, can you please open an issue on clearml-agent github ?