This is assuming you can just run two copies of your code, and they will become aware of one another.
IntriguedRat44 If the monitoring only shows a single GPU (the selected one) it means it reads the correct CUDA_VISIBLE_DEVICES (this is how it knows that you are only using a selected GPU not all of them).
There is nothing else in the code that will change the OS environment.
Could you print os.environ['CUDA_VISIBLE_DEVICES'] while running the code to verify ?
WickedGoat98
I will try to collect the installation steps in a document and share it to the community once ready
Thank you! this will be awesome !
We're here if you need anything 🙂
This is odd, can you send th full log of the failed Task and if possible the code?
@<1571308003204796416:profile|HollowPeacock58> seems like an internal issue copying this object config.model
This is a complex object, and it seems that for some reason
None
As a workaround just do not connect this object. it seems you cannot pickle it / copy it (see GH issue)
Right so this is checksum based?
correct
Are there plans to only store delta changes for files (i.e. store the changed byte instead of the entire file)?
Long story short, no 😞
Basically delta changes are not scaleable. and work only in text based files, see git, and breaks very quickly when large files are involved, see the fun of git-lfs ...
Does that make sense? is there a specific reason you are thinking about byte granularity ?
JitteryCoyote63 if this is simulating an agent, the assumption is that the Task was already created, hence the task ID.
If i am working with Task.set_offline(True)
How would the two combine ? I mean off-line is be definition not executed by an agent, what am I missing ?
which part of the code?
the main script?!
but is not part of the package
is the repo it self a package ?
HappyLion37 did you check the https://github.com/allegroai/trains/tree/master/examples/services/hyper-parameter-optimization ?
You can very quickly get it distributed as well
Hmm, how does your preprocessing code looks like?
Yes, you are too quick for the resource monitoring 🙂
The issue itself is changing the default user.
USER appuser
WORKDIR /home/appuser
Any reason for it ?
the only port configurations that will work are 8080 / 8008 / 8081
Hi RipeGoose2
Can you try with the latest from git ?pip install -U git+
Hi ConfusedPig65
Any keras model will be automatically uploaded if you pass an upload url to the Task init:task = Task.init('examples', 'keras upload test', output_uri="
")
(You can also pass to output_uri s3://buckket/folder or change the default output_uri in the clearml.conf file)
After this line any keras model will be automatically uploaded (you will see it under the Artifacts Tab)
Accessing models from executed tasks:
` trains_task = Task.get_task('task_uid_here')
last_check...
ReassuredTiger98 I can verify the code snippet reproduces the issues with packages missing from "installed package".
If you feel this is important, please open a GitHub issue.
Also, you can manually add packages:
Task.add_requirements('package_name_here', 'optional version here')
So when you manually load the package you can make sure it will be listed, do remember to call it before the Task.init call.
Hi PunyGoose16 ,
next release includes it (eta after this weekend 😉 )
That is a good question, usually the cuda version is automatically detected, unless you overrride it with the conf file or OS env. What's the setup? Are you using as package manager ? (conda actually installs CUDA drivers, if the original Task was executed on a machine with conda, it will take the CUDA version automatically, reason is to match the CUDA/Torch/TF)
@<1545216070686609408:profile|EnthusiasticCow4>
Is there currently a way to bind the same GPU to multiple queues? I believe the agent complains last time I tried (which was a bit ago)
run multiple agents on the same GPU,
CLEARML_WORKER_NAME=host-gpu0a clearml-agent daemon --gpus 0
CLEARML_WORKER_NAME=host-gpu0b clearml-agent daemon --gpus 0
FierceHamster54 are you sure you have write permissions ?
SuperiorDucks36 you mean to manually set an experiment (and the dummy Task is just a way to have an entry to configure), do I understand you correctly ?
Following on that, we are thinking of doing it all for you with a CLI , that will basically create a task from a code/repo you already have on your machine. What do you think?
Yes you can 🙂 (though not on the open-source version)
JitteryCoyote63 try to add the prefix to the parameter name, e.g. instead of "artifact_name" use "Args/artifact_name"
Queues can have multiple workers, and that implies multiple instances of a task can run concurrently.
@<1533619716533260288:profile|SmallPigeon24> as long as these are the Exact same instances you can have them runing simultaneously (think multi node training), that said each one should "know" not to report over the others, because of course it will overwrite the reports.
Back to your point on multiple agents:
You cannot have two Tasks in the same queue, that means that a single agen...
HI SubstantialElk6
Yes you are correct the glue only needs to change the yaml and it will work.
When you say "Dev end" , what do you mean? I was thinking adding additional glue for multi node and just adding queues , for example add 4nodes queue and attach a glue to it, wdyt?
Regrading horovod, horovod is spinning its own nodes so integration with k8s is not trivial (regardless of ClearML). That said I know that they do have support for horovod in the Enterprise edition, but I'm not sure ...
Could it be the code is not in a git repository ?clearml
support either a single script or a git repository, but Not a collection of standalone files. wdyt?
It should have been:output_uri="s3://company-clearml/artifacts/bethan/sales_journeys/artifacts/examples/load_artifacts.f0f4d1cd5eb54795b11508dd1e739145/artifacts/filename.csv.gz/filename.csv.gz
This is by design, they cannot use the exact same venv because if the code starts creating files/change them it happens inside the venv and might cause them to crash.
That said if you are running with venv cache, the first one will create the venv and the second one will create a copy from the cache.