which part of the code?
the main script?!
but is not part of the package
is the repo it self a package ?
Hi CooperativeFox72
Sure 🙂task.set_resource_monitor_iteration_timeout(seconds_from_start=1800)
CharmingStarfish14 can you check something from code, just to see if this would solve the issue?
Yes the "epoch_loss" is the training epoch loss (as expected I assume).
thought that was just the loss reported at the end of the train epoch via tf
It is, isn't that what you are seeing ?
what do you mean? the same env for all components ? if they are using/importing exactly the same packages, and using the same container, then yes it could
are you referring to
extra_docker_shell_
scrip
t
Correct
the thing is that this runs before you create the virtual environment, so then in the new environment those settings are no longer there
Actually that is better, because this is what we need to setup the pip before it is used. So instead of passing --trusted-host
just do:
` extra_docker_shell_script: ["echo "[global] \n trusted-host = pypi.python.org pypi.org files.pythonhosted.org YOUR_S...
Yes, that sounds like a good start, DilapidatedDucks58 can you open a github issue with the feature request ?
I want to make sure we do not forget
Hi GrittyKangaroo27
Maybe check the TriggerScheduler , and have a function trigger something on k8s every time you "publish" a model?
https://github.com/allegroai/clearml/blob/master/examples/scheduler/trigger_example.py
the only thing that missing is some plots on the clearml server (app ) when i got to the details of the train i cannot see the matrix confusion for example ( but its exists on the bucket )
How do you report the "matrix confusion" ? (I might have an idea on what's the difference)
The pod has an annotation with a AWS role which has write access to the s3 bucket.
So assuming the boto environment variables are configured to use the IAM role, it should be transparent, no? (I can't remember what the exact envs are, but google will probably solve it 🙂 _
AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY and AWS_SESSION_TOKEN. I was expecting clearml to pick them by default from the environment.
Yes it should, the OS env will always override the configuration file sect...
Or am I forced to do a get, check if the latest version is fainallyzed,
Dataset Must be finalized before using it. The only situation where it is not is because you are still in the "upload" state.
, then increment de version of that version and create my new version ?
I'm assuming there is a data processing pipeline pushing new data?! How do you know you have new data to push?
Hi @<1560798754280312832:profile|AntsyPenguin90>
The image itself is uploaded in a blackground process, flush just triggers the starting of the process.
Could it be that it is showing a few seconds after?
Ok, but when
nvcc
is not available, the agent uses the output from
nvidia-smi
right? On one of my machine,
nvcc
is not installed and in the experiment logs of the agent runnin there,
agent.cuda =
is the version shown with
nvidia-smi
Already added to the next agent's version 😉
This is something that we do need if we are going to keep using ClearML Pipelines, and we need it to be reliable and maintainable, so I don’t know whether it would be wise to cobble together a lower-level solution that has to be updated each time ClearML changes its serialisation code
Sorry if I was not clear, I do not mean for you ti do unstable low-level access, I meant that pipelines are Designed to be editable externally, they always deserialize themselves.
The only part that is mi...
is it displaying that it is running anything?
Would that go under
arguments
?
yes 🙂
Also what is the base path where the git repo is cloned? So if my repo is called myProject.git, what would the full path be?
For example https://github.com/ <user>/myProject.git
btw: how come you do not have this field auto populated from running the code locally or using clearml-task
CLI?
SmugOx94
after having installed
numpy==1.16
in the first case or
numpy==1.19
in the second case. Is it correct?
Correct
the reason is simply that I'd like to setup an MLOps system where
I see the rational here (obviously one would have to maintain their requirements.txt)
The current way trains-agent
works is that if there is a list of "installed packages" it will use it, and if it is empty it will default to the requirements.txt
We cou...
OK, so if I've got, like, 2x16GB GPUs ...
You could do:clearml-agent daemon --queue "2xGPU_32gb" --gpus 0,1
Which will always use the two gpus for every Task it pulls
Or you could do:clearml-agent daemon --queue "1xGPU_16gb" --gpus 0 clearml-agent daemon --queue "1xGPU_16gb" --gpus 1
Which will have two agents, one per GPU (with 16gb per Task it runs)
Orclearml-agent daemon --queue "2xGPU_32gb" "1xGPU_16gb" --gpus 0,1
Which will first pull Tasks from the "2xGPU_32gb" qu...
ExcitedFish86
How do I set the config for this agent? Some options can be set through env vars but not all of them
Hmm okay if you are running an agent inside a container and you want it to spin "sibling" containers, you need to se the following:
mount the docker socket to the container running the agent itself (as you did), basically adding " --privileged -v /var/run/docker.sock:/var/run/docker.sock
" Allow the host to mount cache and configuration from the host into the siblin...
ReassuredTiger98 if this user passes to the task as docker args the following, it might work:
'-e CLEARML_AGENT_SKIP_PYTHON_ENV_INSTALL=1'
UnevenDolphin73 go to the profile page, I think at the bottom right corner you should see it
(Also ctrl-F5 to reload the web application, if you upgraded the server 🙂 )
Hi @<1523701523954012160:profile|ShallowCormorant89>
This means the system did not detect any "iteration" reporting (think scalars) and it needs a time-series axis for the monitoring, so it just uses seconds from start
VictoriousPenguin97 I'm not sure there is an easy solution, basically you have to edit both MongoDB (artifacts) and Elastic (think debug samples) 😞
HurtWoodpecker30 currently in the open source only AWS is supported, I know the SaaS pro version supports it (I'm assuming enterprise as well).
You can however manually spin an instance on GCP and launch an agent on the instance (like you would on any machine)
Give me a minute, I'll check something