but is there any other way to get env vars / any value or secret from the host to the docker of a task?
if this is docker -e/--env as argument would do the same-e VAR=somevalue
this
from fastai.callbacks.tensorboard import LearnerTensorboardWriter
doesnβt exist anymore in fastai2
Hmm we should definitely update the example to fastai2 API
maybe the fastai bindings in clearml package are outdated
Are you getting any scalars reported to clearml?
they also appear to be relying on the tensorboard callback which seems not to work on distributed training
Yes that is correct, usually the way it works all nodes report back to "master...
Hi @<1523701079223570432:profile|ReassuredOwl55> let me try ti add some color here:
Basically we have to parts (1) pipeline logic, i.e. the code that drives the DAG, (2) pipeline components, e.g. model verification
The pipeline logic (1) i.e. the code that creates the dag, the tasks and enqueues them, will be running in the git actions context. i.e. this is the automation code. The pipeline components themselves (2) e.g. model verification training etc. are running using the clearml agents...
can you tell me what the serving example is in terms of the explanation above and what the triton serving engine is,
Great idea!
This line actually creates the control Task (2)clearml-serving triton --project "serving" --name "serving example"
This line configures the control Task (the idea is that you can do that even when the control Task is already running, but in this case it is still in draft mode).
Notice the actual model serving configuration is already stored on the crea...
it certainly does not use tensorboard python lib
Hmm, yes I assume this is why the automagic is not working π
Does it have a pythonic interface form the metrics ?
Hmm, how does your preprocessing code looks like?
what do you see in the console when you start the trains-agent , it should detect the cuda version
Out of interest, is there a reason these are read-only?
Yes, we should probably change that... they are designed to be pre-populated, but there should not be any reason you could not remove them
The code for these tasks is on github right?
Correct
Actually that is less interesting, as it is quite straight forward
We should probably change it so it is more human readable π
Once a model is saved and published, it should be downloadable right
Well that depends if you configured CLearML to autoupload it (by default it will just log the "local location").
To auto-upload add output_uri=True to Task.Init (or specify a destination with output_uri= ` s3://bucket/ )
You can also configure it as default here:
https://github.com/allegroai/clearml/blob/65f1c0baa124efb05fb7894a5386f0dd52c0536b/docs/clearml.conf#L163
AttractiveCockroach17 I verified this is an issue with hypeparemeters with "." or section names with ".", thank you for noticing!
I will make sure I pass it along, should be part of the next version (ETA a week) π
Hi AstonishingSwan80 , what do you mean by "ec2 API"?
odd message though ... it should have said something about boto3
Hi Martin, of course not,
Smart!
I was just wondering if it has been patched yet and if not what is the expected timeline for patching it
Yes, I believe the target is a patch version 1.15.1 to be released in a couple of weeks. This is not a major issue but it's always better to have have it fixed. (btw: the enterprise version never had this issue to being with, because it is of course authenticated, as well as it has additional RBAC layer on top.)
So the agent installed okay. It's the specific Task that the agent is failing to create the environment for, correct?
if this is the case, what do you have in the "Installed Packages" section of the Task (see under the Execution tab)
is it consistent ? (the error), meaning it happens on other integer values ?
This is odd, what is the parameter?
I assume it needs sorting and one time this is Integer, and the next it is a String, so the server cannot sort based on it. Could that be ?
the latter is an ec2 instance
and the agent fails to install on the ec2 machine ?
CrookedWalrus33 from the log it seems the code is trying to use "kwcoco" but it is not listed under any "Installed packages" nor do you see any attempt to install it. Can you confirm ?
GreasyPenguin66 Nice !!!
Very cool setup, and kudos on making it work with multiple users!
Quick question, shouldn't the JUPYTERHUB_API_TOKEN env variable be enough to gain access to the server? Why did you need to add it to the 'nbserver-x.json' as well?
Hi QuaintPelican38 can you manually access the machine based on the IP it registered
(Look under the DevOps project, you'll see a running Task "interactive session" under the configuration tab, user properties you should find the IP
docstring ?
Usually the preferred way is StorageManager
https://clear.ml/docs/latest/docs/references/sdk/storage
https://clear.ml/docs/latest/docs/integrations/storage
Hi @<1576381444509405184:profile|ManiacalLizard2>
You can also use env vars, it might be easier, I'm assuming this is kind of CI/CD process
'''
export CLEARML_API_ACCESS_KEY="your-public-key"
export CLEARML_API_SECRET_KEY="your-private-secret"
export CLEARML_API_HOST=" https://api.clear.ml "
export CLEARML_WEB_HOST=" https://app.clear.ml "
export CLEARML_FILES_HOST=" https://files.clear.ml "
'''
[https://clear.ml/do...
@<1571308003204796416:profile|HollowPeacock58> seems like an internal issue copying this object config.model
This is a complex object, and it seems that for some reason
None
As a workaround just do not connect this object. it seems you cannot pickle it / copy it (see GH issue)
Hi VexedCat68
can you supply more details on the issue ? (probably the best is to open a github issue, and have all the details there, so we have better visibility)
wdyt?
We have tried to manually restart tasks reloading all the scalars from a dead task and loading latest saved torch model.
Hi ThickKitten19
how did you try to restart them ? how are you monitoring dying instances ? where . how they are running?
Hmm is this similar to this one https://allegroai-trains.slack.com/archives/CTK20V944/p1597845996171600?thread_ts=1597845996.171600&cid=CTK20V944
(as i see the services worker is only in the services-queue, and not my default queue (where my other servers/workers are)
So basically the service-mode is just a flag passed to the agent, and the services queue is the name of the queue it will pull from.
If i want a normal worker also
You can just add another section to the docker-compose, or run it manually after you spin the docker-compose.
LazyFox65 wdyt ?