In Windows setting
system_site_packages
to
true
allowed all stages in pipeline to start - but doesn't work in Lunux.
Notice that it will inherit from the system packages not the venv the agent is installed in
I've deleted tfrecords from master branch and commit the removal, and set the folder for tfrecords to be ignored in .gitignore. Trying to find, which changes are considered to be uncommited.
you can run git diff
it is essentially...
Okay Now I get it!
Let me think about it for an hour or two 😄
assume clearml has some period of time that after it, shows this message. am I right?
Yes you are 🙂
is this configurable?
It is 🙂task.set_resource_monitor_iteration_timeout(seconds_from_start=1800)
ElegantKangaroo44 I tried to reproduce the "services mode" issue with no success. If it happens again let me know maybe will better understand how it happened (i.e. the "master" trains-agent gets stuck for some reason)
Feel free to open an issue on GitHub making sure this is not forgotten
Yes please, just to verify my hunch.
I think that somehow the docker mounts the agent is creating are (for some reason) messing it up.
Basically you can just run the following (it will do everything automatically) (replace the <TASK_ID_HERE> with the actual one)
` docker run -it --gpus "device=1" -e CLEARML_WORKER_ID=Gandalf:gpu1 -e CLEARML_DOCKER_IMAGE=nvidia/cuda:11.4.0-devel-ubuntu18.04 -v /home/dwhitena/.git-credentials:/root/.git-credentials -v /home/dwhitena/.gitconfig:/root/.gitconfig ...
Hi ExcitedFish86
In Pytorch-Lightning I use DDP
I think a fix for pytorch multi-node / process distribution was commited to 1.0.4rc1, could you verify it solves the issue ? (rc1 should fix this specific issue)
BTW: no problem working with cleaml-server < 1
AbruptHedgehog21 the bucket and the full link are registered on the model object itself, you can see them in the ui, under the models tab. The only thing you actually need to pass inside is the credentials. Make sense?
So I checked the code, and the Pipeline constructor internally calls Task.init, that means that after you constructs the pipeline object, Task.current_task() should return a valid object....
let me know what you find out
IntriguedRat44 If the monitoring only shows a single GPU (the selected one) it means it reads the correct CUDA_VISIBLE_DEVICES (this is how it knows that you are only using a selected GPU not all of them).
There is nothing else in the code that will change the OS environment.
Could you print os.environ['CUDA_VISIBLE_DEVICES'] while running the code to verify ?
GloriousPenguin2 hmm the UI might strip it?! I mean in most case it should not be there in the first place. Maybe we need to make sure that if provided the web UI will use the stored plotly definition, if this is the case we need to make sure that by default we do not store it, so in most cases the UI can use it to improve the layout. wdyt?
So the main difference is kedro pipelines are function based steps (I might be overly simplifying, so please take it with a grain of salt), while in ClearML pipeline is Job, i.e. it needs its own environment and is longer than a few seconds (as opposed to a single function)
Try adding this environment variable:export TRAINS_CUDA_VERSION=0
One example is a node that resizes the images, this node receives as input a Dataset and iterates over each image, resizes it an outputs a new Dataset, which is used in the next node downstream in the Pipeline.
I agree, this sounds like a "function" rather than a job, so better suited for Kedro.
organization structure
and see for yourself (this pipeline has two nodes
train_model
and
predict
)
Interesting! let me dive into that and ...
GrotesqueDog77 this should just work, decorate the functions with @PipelineDecorator.component
and call the functions one after the otherpaths = step_one() step_two(paths)
ClearML will make sure it serializes the strings and pass them to step two (of course step two should actually run on a machine with access to the same folder, but this is another issue 🙂 )
BTW: CloudyHamster42 I think this issue was discussed on GitHub, and the final "verdict" was we should have an option to split/combine graphs on the UI side (i.e. similar to the "smoothing" or wall-time axis etc.)
I was expecting the remote experiment to behave similarly, why do I need to import pandas there?
The only problem os that the remote code did not install pandas
, once the package is there we can read the artifacts
(this is in contrast to the local machine where pandas is installed and so we can create/read the object)
Does that make sense ?
Maybe that's the issue :
https://github.com/googleapis/python-storage/issues/74#issuecomment-602487082
It actually started executing your code, but it did not capture it correctly:
/root/.clearml/venvs-builds/3.10/bin/python -u /root/.clearml/venvs-builds/3.10/code/colab_kernel_launcher.py
Which I assume means the actual Task had bad code.
What do you have under the Task execution tab in the UI (the one you were launching, i.e. enqueueing )
parser.add_argument( "--dataset_mean", type
=
float, nargs
=
"+", default
=
0.5)
I think providing nargs='+ ' assumes the type is a list. nonetheless we should be able to support it. Could you please add a GitHub issue so we do not forget ?
on the side note, is there any way to automatically give more meaningful names to the running docker containers?
What do you mean by that? running where? and where will you see them ?
No by definition the agent will only execute one Task at a time, you can spin a second agent on the same GPU :)
Just wanted to know how many people are actively working on clearml.
probably 30+ 🙂
ReassuredTiger98 are you afraid from lack of support? or are you offering some (it is always welcomed) ?
Hi SteadyFox10
I promised to mention here once we start working on ignite integration, you can check it here:
https://github.com/jkhenning/ignite/tree/trains-integration
Feel free to provide insights / requests 🙂
As for the model upload. The default behavior is
torch.save() calls will only be logged , nothing more. But, if you pass to the Task.init output_uri field, then all your models will be uploaded automatically. For example:
` task = Task.init('examples', 'model upload test', o...
I just set the git credentials in the
clearml.conf
and it works out of the box
git has issues with passing the user/token from the main repo to the submodules, hence my surprise that it is working out-of-the-box.
Do notice that if you are ussing ssh-key this is a none issue.
Nope, no
.netrc
defined anywhere, ...
If this is the case can you try to add the following to your "extra_vm_bash_script"
` echo machine example.com > ~/.netrc && echo log...
Essentially the example provide just prints out ids to the log file,
What do mean?
Hi @<1724960468822396928:profile|CumbersomeSealion22>
As soon as I refactor my project into multiple folders, where on top-level I put my pipeline file, and keep my tasks in a subfolder, the clearml agent seems to have problems:
Notice that you need to specify the git repo for each component. If you have a process (step) with more than a single file, you have to have those files inside a git repository, otherwise the agent will not be able to bring them to the remote machine
Yes, I do have my files in the git repo. Although I have not quite understood which part it takes from the remote git repo, and which part it takes from my local system.
it will do "git pull" on the remote machine and then apply any uncommitted changes it has stored in the Task
It seems that one also needs to explicitly hand in the git repo in the pipeline and task definitions via PipelineController,
Correct, unless the pipeline logic and the steps are the same git repo, you can...
Hi CurvedHedgehog15
User aborted: stopping task (3)
?
This means "someone" externally aborted the Task, in your case the HPO aborted it (the sophisticated HyperBand Bayesian optimization algorithms we use, both Optuna and HpBandster) will early stop experiments based on their performance and continue if they need later