Are you sure you added the pytorch channel in clearml.conf ?
https://github.com/allegroai/clearml-agent/blob/822984301889327ae1a703ffdc56470ad006a951/docs/clearml.conf#L64
you should have something like 192.168... or 10.0 ....
command line ๐
cmd.exe / bash
One last question: Is it possible to set the pip_version task-dependent?
no... but why would it matter on a Task basis ? (meaning what would be a use case to change the pip version per Task)
NaughtyFish36
No module named 'leap.learn.data_tools.merge_data.merge_data'
This seems to be the error but I cannot see leap in the installed packages , Notice that if the Task has "Installed Packages" section then the agent will use that Not the "requirements.txt" , Only if this section is Empty it will revert to the "requirements.txt" in the repo.
How did you create the Task in the first place?
I see that you added "leap" into the initial bashscript, actually you should add i...
ExcitedFish86 this is a general "dummy agent" that tasks and executes them (no env created, no code cloned, as you suggested)
hows does this work with HPO?
The HPO clones Tasks, changes arguments, push them into a queue, and monitors the metrics in real time. The missing part (from my understanding) was the the execution of the Tasks themselves required setup, and that you wanted multiple machine support, in order to overcome it, I post a dummy agent that just runs the Tasks.
(Notice...
Hi LudicrousParrot69
A bit of background:
A Task is a job executed in the system (sometime it is an experiment training, sometime a controller like the pipeline). Basically everything process can be a task.
Specifically the pipeline controller itself (i.e. the process running the Bayesian optimization) is Task in the system (i.e. a job running). What it does (using the HyperParameterOptimizer) is cloning previously executed Tasks (e.g. training experiments), change their parameters and moni...
If I call explicitlyย
task.get_logger().report_scalar("test", str(parse_args.local_rank), 1., 0)
ย , this will log as expected one value per process, so reporting works
JitteryCoyote63 and do prints get logged as well (from all processes) ?
Because of that, I cannot create a task in this project programmatically locally because it tries to access the bucket and fails. And there is no easy way to change the default output location (not in the web UI, not in the sdk)
JitteryCoyote63 hmm that is a pickle ...
let me check the code ...
A few implementation / design details:
When you run code with Trains (and call init) it will record your environment (python packages, git code, uncommitted changes etc) Everything is stored on the Task object in the trains-server, when you clone a task you literally create a copy of the Task object (i.e. a second experiment). on the cloned experiment, you can edit everything (parameters, git, base docker image etc) When you enqueue a Task you add its ID to the execution queue list a trains-a...
Okay so my thinking is, on the pipelinecontroller / decorator we will have:abort_all_running_steps_on_failure=False (if True, on step failing it will abort all running steps and leave)
Then per step / component decorator we will havecontinue_pipeline_on_failure=False (if True, on step failing, the rest of the pipeline dag will continue)
GiganticTurtle0 wdyt?
` Collecting inplace-abn==1.0.12
Downloading inplace-abn-1.0.12.tar.gz (137 kB)
ERROR: Command errored out with exit status 1:
command: /home/ubuntu/.clearml/venvs-builds/3.8/bin/python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-xf3qf6et/inplace-abn_15b6998cb4af4199a7692be5d3a3538f/setup.py'"'"'; file='"'"'/tmp/pip-install-xf3qf6et/inplace-abn_15b6998cb4af4199a7692be5d3a3538f/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f...
I callย
Task.init
ย after I import tensorflow (and thus tensorboard?)
That should have worked...
Can you manually add a TB report before calling opennmt function ?
(I want to verify the Task.init is indeed catching the TB calls, my theory is that somewhere inside the opennmt we loose the TB)
I failed to update the "STARTED AT" and the "COMPLETED AT" attributes in the "INFO" tab.
I'm not sure this can actually be overridden...
Hi LazyTurkey38
Configuring these folders will be pushed later today ๐
Basically you'll have in your clearml.conf
` agent {
docker_internal_mounts {
sdk_cache: "/clearml_agent_cache"
apt_cache: "/var/cache/apt/archives"
ssh_folder: "/root/.ssh"
pip_cache: "/root/.cache/pip"
poetry_cache: "/root/.cache/pypoetry"
vcs_cache: "/root/.clearml/vcs-cache"
venv_build: "/root/.clearml/venvs-builds"
pip_download: "/root/.clearml/p...
DM me the entire log, I would assume this is something with the configuration
I have to admit mounting it to a different drive is a good reason to bring this feature back, the reasoning was it means the agent needs to make sure it manages them (e.g. multiple agents running on the same machine)
Hmm, not a bad idea ๐
Could you please open a Git Issue, so it will not get forgotten ?
(btw: I'm not sure how trivial it is to implement, nonetheless obviously possible ๐
๐ anything that can be done?
Thanks MinuteGiraffe30 , fix will be pushed later today
Hi @<1628565287957696512:profile|AloofBat92>
Yeah the name is confusing, we should probably change that. The idea is it is a low code / high code , train your own LLM and deploy it. Not really chatgpt 1:1 comparison, more like, GenAI for enterprises. make sense ?
Hmm yes this is exactly what should not happen ๐
Let me check it
Are these experiments logged too (with the train-valid curves, etc)?
Yes every run is log as a new experiment (with it's own set of HP). Do notice that the execution itself is done by the "trains-agent". Meaning the HP process creates experiments with new set of HP an dputs them into the execution queue, then trains-agent pulls them from the queue and starts executing them. You can have multiple trains-agent on as many machines as you like with specific GPUs etc. each one ...
@<1539780258050347008:profile|CheerfulKoala77> make sure the AMI id matches the zone of the EC2 machine
JitteryCoyote63 next week is the Trains next release with upgrade to ES 7, do you want to wait or sort a solution for this one ?
(BTW: I think that you can mount a license file or delete one, and it should be okay, I'll ask the backend guys regradless)
i'm Jax, not Manoj! lol.
I know ๐ I just mentioned that this issue is being actively discussed
so that one app I am using inside the Task can use the python packages installed by the agent and I can control the packages using clearml easily
That's the missing part for me, You have all the requiremnts on the Task (that you can fully control), the agent is setting a brand new venv for each Task inside a container (the venv is cahced, and you can also make the agent just use the default python without installing anything). The part where I'm lost is why would you need the path to t...
My plan is to have a AWS Step Functions state machine (DAG) that treats running a ClearML job as one step (task) in the DAG.
...
Yep, that should work
That said, after you have that working, I would actually check pipelines + clearml aws autoscaler, easier setup, and possibly cheaper on the cloud (Lambda vs EC2 instance)
If this works, we might be able to fully replace Metaflow with ClearML!
Can't wait for your blog post on it ๐