Hi ClumsyElephant70
s there a way to run all pipeline steps, not in isolation but consecutive in the same environment?
You mean as part of a real-time inference process ?
Yes, that means the nvidia drivers are present (as you mentioned the GPU seems to be detected).
Could you check you have libnvidia-ml.so.1 inside the container ?
For example in /usr/lib/nvidia-XYZ/
Setting the credentials on agent machine means the users cannot use their own credentials since an k8s glue agent serves multiple users.
Correct, I think "vault" option is only available on the paid tier 😞
but how should we do this for the credentials?
I'm not sure how to pass them, wouldn't it make sense to give the agent an all accessing credentials ?
is this a config file on your side or something I can change, if we had enterprise version?
Yes, this is one of the things you can configure
Scheduled training is what I’m looking forward to
I'll try to remember to update here once we pushed into the GitHub repo, feedback is always appropriated 🙂
If in the next two weeks you hear nothing, please ping here to make sure I did not forget 😉
ConfusedPig65 could you send the full log (console) of this execution?
Is there an option to do this from a pipeline, from within theÂ
add_step
 method? Can you link a reference to cloning and editing a task programmatically?
Hmm, I think there is an open GitHub issue requesting a similar ability , let me check on the progress ...
nope, it works well for the pipeline when not I don't choose to continue_pipeline
Could you send the full log please?
EnchantingWorm39 you have great timing ;)
New version will contain much more advanced search (including all the task fields)
are there any more fields in this function with partial matching? for example project? tags?
Yes they can all be filtered (basically everything you see in the UI)
notice: tags are strings (you can provide list of tags), project is an ID of the project
(Use Task.get_project_id, I think)
Hi @<1523702969063706624:profile|PoisedShark13>
However, INSTALLED PACKAGES of my task is misses many of installed packages (any idea why?)
It automatically detects the directly imported packages, literally analyzing your code base and looking for imports
The derivative packages (i.e. the one that any of the "main" packages need, will be listed after the first time the agent installs everything)
If something specific is missing, you can manually add it with:
Task.add_requiremen...
@<1523703080200179712:profile|NastySeahorse61> / @<1523702868694011904:profile|AbruptCow41>
Is there a way to avoid each task to create a new environment?
You can just define CLEARML_AGENT_SKIP_PYTHON_ENV_INSTALL=1
it will just use whatever you have there (notice it will totally ignore requirements.txt and "installed packages" on the Task)
BTW I would recommend turning on the venv caching, this is per docker/python/packages caching so the next time you are using th exact requi...
Hi ItchyJellyfish73
The behavior should not have changed.
"force_repo_requirements_txt" was always a "catch all option" to set a behavior for an agent, but should generally be avoided
That said, I think there was an issue with v1.0 (cleaml-server) where when you cleared the "Installed Packages" it did not actually cleared it, but set it to empty.
It sounds like the issue you are describing.
Could you upgrade the clearml-server
and test?
Hi SubstantialElk6
No need for that, you can use the helm chart (or spin them once with kubctl) then they take care of scheduling by themselves.
You can also use the k8s glue (basically spinning kubernetes pods automatically for you, based on the Tasks that you push into the ClearML queue)
https://github.com/allegroai/clearml-agent/blob/master/examples/k8s_glue_example.py
In short, two possible deployments
Static k8s pod running the agent (then the agent runs all the experiments inside t...
A few implementation / design details:
When you run code with Trains (and call init) it will record your environment (python packages, git code, uncommitted changes etc) Everything is stored on the Task object in the trains-server, when you clone a task you literally create a copy of the Task object (i.e. a second experiment). on the cloned experiment, you can edit everything (parameters, git, base docker image etc) When you enqueue a Task you add its ID to the execution queue list a trains-a...
That sounds like an internal tritonserver error.
https://forums.developer.nvidia.com/t/provided-ptx-was-compiled-with-an-unsupported-toolchain-error-using-cub/168292
Okay ConfusedPig65 I found the problem. For some reason the latest TF.keras.load_model . save_model is not tracked.
I'll make sure we push a fix later today
Thanks @<1523701868901961728:profile|ReassuredTiger98>
From the log this is what conda is installing, it should have worked
/tmp/conda_env1991w09m.yml:
channels:
- defaults
- conda-forge
- pytorch
dependencies:
- blas~=1.0
- bzip2~=1.0.8
- ca-certificates~=2020.10.14
- certifi~=2020.6.20
- cloudpickle~=1.6.0
- cudatoolkit~=11.1.1
- cycler~=0.10.0
- cytoolz~=0.11.0
- dask-core~=2021.2.0
- decorator~=4.4.2
- ffmpeg~=4.3
- freetype~=2.10.4
- gmp~=6.2.1
- gnutls~=3.6.13
- imageio~=2.9.0
-...
BattyLion34 are you saying you do not have the "APP CREDENTIALS" section in the profile page?
LOL EnormousWorm79 you should have a "do not show again" option, no?
Long story short, work in progress.
BTW: are you referring to manual execution or trains-agent
?
Hi
The Squash operation copies all the data and is no longer linked to previous commits?
Yes, basically the idea is if you have data version that relies on many parents that needs to be merged, the squash will create a merged copy and push it all as a single version, and then yes the parent versions are no longer needed
I thought this operation is like git squash but it seems to me
yeah... we did not want to actually delete the parents because unlike git, the operation is done ...
BTW: Full RestAPI reference here
https://allegro.ai/clearml/docs/rst/references/clearml_api_ref/index.html
Hi Guys, just curious here, what's was the final issue?
Also out of curiosity, what does that mean? "1.12.2 because some bug that make fastai lag 2x" ?
Hi @<1570220858075516928:profile|SlipperySheep79>
Is there a way to specify the working dir from the decoratoe
not directly, but why would that change anything? I mean the coponent code will be created in the git root, and you can still access files inside the subfolders
from .subfolder import something
what am I missing?
Okay this is a bit hacky but will work
@PipelineDecorator.component(...)
def step(...)
import sys
import os
sys.path.append(os.path.join(os.path.abspath(os.path.dirname(__file__)), "projects", "main" ))
from file import something
"sub nodes" inside pipeline, in my opinion, makes them much more useful, in sense that all the steps are visible.
Yeah I really like this idea... continuing this thread, would it also make sense to have a Task object per "sub-node" and run the sub-nodes as subprocess of the parent Node? I'm thinking this sounds like a combination of both local pipeline execution and remote pipeline execution.
wdyt?
This seems to only work for a single file (weights_path implies a single file, not multiple ones). Is that the case?See update_weights_package
actually packages an entire folder as zip and will do the extraction when you get it back (check the function docstring, I think you can also specify wildcard etc if needed)
Why do you see this as preferred to the dataset method we have now?
So it answers a few requirements that you raised
It is fully visible as part of the project and se...
Hi VivaciousPenguin66
Seems like a CUDA/CUDNN issue.
You argent is configured to work in venvmode, which mean it will pull the correct pytorch version based on the detected CUDA driver support. Speicifally you can see in the log "agent.cuda_version = 111" which means CUDA 11.1 and from the log it found the correct pytorch version:
` Torch CUDA 111 download page found
Found PyTorch version torch==1.8.1 matching CUDA version 111
Found PyTorch version torchvision==0.9.1 matching CUDA version 1...
feature is however available in the Enterprise Version as HyperDatasets. Am i correct?
Correct
BTW you could do:datasets_used = dict(dataset_id="83cfb45cfcbb4a8293ed9f14a2c562c0") task.connect(datasets_used, name='datasets') from clearml import Dataset dataset_path = Dataset.get(dataset_id=datasets_used['dataset_id']).get_local_copy()
This will ensure that not only you have a new section called "datasets" on the Task's configuration, buy tou will also be able to replace the datase...
it will only if oom killer is enabled
true, but you will still get OOM (I believe). I think the main issue is the even from inside the container, when you query the memory, you see the entire machine's memory... I'm not sure what we can do about that