Is there any references (vlog/blog) on deploying real-time model and do the continuous training pipeline in clear-ml?
Something along the lines of this one ?
https://clear.ml/blog/creating-a-fully-automatic-retraining-loop-using-clearml-data/
Or this one?
https://www.youtube.com/watch?v=uNB6FKIi8Wg
Yes that should work, only thing is you need to call Task init on the master process (and make sure you call Task.current_task() on the subprocesses, if you want to automagic to kick in, that said, usually there is no need, they are supposed to report everything back to the main one anyhow
basically
` @call_parse
def main(
gpus:Param("The GPUs to use for distributed training", str)='all',
script:Param("Script to run", str, opt=False)='',
args:Param("Args to pass to script", nargs=...
Hi @<1684010629741940736:profile|NonsensicalSparrow35>
however for the remote file it always creates the name with the following pattern:
{filename_prefix}checkpoint{n}.pt
..
Is this the main issue?
Notice that the model name (i.e. the entry on the Task itself) is not directly connected with the stored file name on the target file server (or S3)
BroadMole98 thank you for noticing !
I'll make sure it is fixed (a few other properties are also missing there, not sure why, I'll ask them to take a look)
I would recommend reading this blog post, it should give you a glimpse of what can be built 🙂
https://medium.com/pytorch/how-trigo-built-a-scalable-ai-development-deployment-pipeline-for-frictionless-retail-b583d25d0dd
Hi RoughTiger69
A. Yes makes total sense . Basically you can use Task.export Task.import to do achieve this process (notice we assume the dataset artifacts links are available on both, usually this is the case)
B. The easiest way would be to use Process , then one subprocess is exporting from dev , where the credentials and configuration is passed with os environment. The another subprocess imports it to the prod server (again with os environment pointing to the prod server). Make sense?
We suddenly have a need to setup our logging after every
task.close()
Hmm that gives me a handle on things, any chance it is easily reproducible ?
JitteryCoyote63 I think I found the bug in clearml-task
it adds it at the end instead of before everything else
Hi LovelyHamster1 ,
you mean totally ignore the "installed packages" section, and only use the requirements.txt ?
Would it suffice to provide the git credentials ...
That should be enough, basically this is where they should be:
https://github.com/allegroai/clearml-agent/blob/0462af6a3d3ef6f2bc54fd08f0eb88f53a70724c/docs/clearml.conf#L18
LethalCentipede31 I think seaborn is using matplotlib, it should just work:
https://github.com/allegroai/clearml/blob/6a91374c2dd177b7bdf4c43efca8e6fb0d432648/examples/frameworks/matplotlib/matplotlib_example.py#L48
Thanks!
I think this one will cover both case (the issue is with files on the root of the dataset)if not (fnmatch(k, path) and fnmatch(k if '/' in k else '/{}'.format(k), '*/' + wildcard))}
Hi DilapidatedDucks58
eg, we want max validation accuracy and all other metric values for the corresponding epoch
Is this the equivalent of nested sort ?
Wouldn't you get the requested behavior if you add all metric columns but sort based on the "accuracy" column ?
Okay, make sure that in your trains.conf
on all the trains-agent machine you add the following:agent.extra_docker_arguments: ["-v", "/etc/hosts:/etc/hosts",]
EcstaticGoat95 any chance you have an idea on how to reproduce? (even 1 out of 6 is a good start)
With pleasure, I'll make sure we officially release RC1 soon :)
I'll make sure we have conda ignore git:// packages, and pass them to the second pip stage.
Check the log to see exactly where it downloaded the torch from. Just making sure it used the right repository and did not default to the pip, where it might have gotten a CPU version...
PunySquid88 RC1 is out with a fix:pip install trains-agent==0.14.2rc1
Try adding this environment variable:export TRAINS_CUDA_VERSION=0
See the last package in the package list:
- wget~=3.2
- trains~=0.14.1
- pybullet~=2.6.5
- gym-cartpole-swingup~=0.0.4
- //github.com/ajliu/pytorch_baselines
Hi CourageousWhale20
Most documentation is here https://allegro.ai/docs
PunySquid88 do you want to test a fix?
Change to add_missing_installed_packages=False,
here, and see if you end up with git diff
https://github.com/allegroai/clearml/blob/1f82b0c4010799be6157f5c845c7f6ac48e71c0c/clearml/backend_interface/task/populate.py#L158
See if this helps
Hi RipeGoose2
Yes, the "services-mode" of an agent will take multiple Tasks, that said, these are "service" i.e. light CPU tasks, think pipeline controllers etc.