Reputation
Badges 1
25 × Eureka!This task is picked up by first agent; it runs DDP launch script for itself and then creates clones of itself with task.create_function_task() and passes its address as argument to the function
Hi UnevenHorse85
Interesting use case, just for my understanding, the idea is to use ClearML for the node allocation/scheduling and PyTorch DDP for the actual communication, is that correct ?
passes its address as argument to the function
This seems like a great solution.
the queu...
Yes, i basically plan to use ClearML as user-friendly cluster manager
and it is π
I think the main "drawback" is that you cannot "reserve" nodes for the multi-node training. The easiest solution is to have high-priority queue that is never used, and then have the DDP master process push into the high priority queue, which will ensure these are the next Tasks to be executed (now the only thing that is missing is preemption to running Tasks, but this automation policy is unfortunate...
link to the line please π
Thanks EnviousStarfish54
Let me check if I can reproduce it
I think your use case is the original idea behind "use_current_task" option, it was basically designed to connect code that creates the Dataset together with the dataset itself.
I think the only caveat in the current implementation is that it should "move" the current Task into the dataset project / set the name. wdyt?
Yep the automagic only kick in with Task.init... The main difference and the advantage of using a Dataset object is the underlying Task resides in a specific structure that is used when searching based on project/name/version, but other than that, it should just work
Hmm interesting...
of course you can do:dataset._task.connect(...)
But maybe it should be public?!
How are you using that (I mean in the context of a Dataset)?
I just think that the create function should expect
dataset_name
to be None in the case of
use_current_task=True
(or allow the dataset name to differ from the task name)
I think you are correct, at least we should output a warning that it is ignored ... I'll make sure we do π
EnviousStarfish54 quick update, regardless of the logging.config.dictConfig
issue, I will make sure that even when the logger is removed, the clearml logging will continue to function π
The commit will be synced after the weekend
Guys FYI:params = task.get_parameters_as_dict()
FYI: These days TB became the standard even for pytorch (being a stand alone package), you can actually import it from torch.
There is an example here:
https://github.com/allegroai/trains/blob/master/examples/frameworks/pytorch/pytorch_tensorboard.py
HealthyStarfish45 did you manage to solve the report_image issue ?
BTW: you also have
https://github.com/allegroai/trains/blob/master/examples/reporting/html_reporting.py
https://github.com/allegroai/trains/blob/master/examples/reporting/...
DepressedChimpanzee34 I cannot find cfg.py here
https://github.com/allegroai/clearml/tree/master/examples/frameworks/hydra/config_files
(or anywhere else)
This works.
great!
So it is still in master and should be included in 1.0.5?
correct, RC will be released soon with this fix included
Is this a logging
issue, or clearml issue ?
@<1615519322766053376:profile|DrainedOctopus19> if your code is a single file (which was stored on the clearml server), then ity is stored on the Task:
task = Task.get_task("task UID here")
# this should be your entire code
print(task.data.script.diff)
PompousBeetle71 a few questions:
is this like using PyTorch distributed , only manually? Why don't you use call trains.init
in all the sub processes? We had a few threads on that, it seems like a recurring question, I'll make sure we have an example on GitHub. Basically trains will take care of passing the arg-parser commands to the sub processes, and also on torch node settings. It will also make sure they all report to the tame experiment.What do you think?
Also, can the image not be pulled from dockerhub but used from the local build instead?
If you have your docker configured to pull from local artifactory, then the agent will do the same π (it is calling the docker command just like you do)
agent.default_docker.arguments: "--mount type=bind,source=$DATA_DIR,target=/data"
Notice that you are use default docker arguments in the example
If you want the mount to always be there use extra_docker_arguments :
https://github.com/...
Hi JitteryCoyote63
Is this close ?
https://github.com/allegroai/clearml/issues/283
Hi StaleHippopotamus38
I imagine I could make the changes specified in the warning toΒ
/etc/security/limits.conf
Yep seems like elastic memory issue, but I think the helm chart takes care of it,
You can see a reference in the docker compose:
https://github.com/allegroai/clearml-server/blob/09ab2af34cbf9a38f317e15d17454a2eb4c7efd0/docker/docker-compose.yml#L41
We should probably change it so it is more human readable π
Does the clearml module parse the python packages?
Yes it analyzes the installed packages based on the actual mports you have in the code.
If I'm using a private pypi artifact server, would I set the PIP_INDEX_URL on the workers so they could retrieve those packages when that experiment is cloned and re-ran?
Correct π the agent basically calls pip install
on those packages, so if you configure it, with PIP_INDEX_URL it should just work like any other pip install
and the agent default runtime mode is docker correct?
Actually the default is venv mode, to run in docker mode add --docker
to the command line
So I could install all my system dependencies in my own docker image?
Correct, inside the docker it will inherit all the preinstalled packages, But it will also install any missing ones (based on the Task requirements. i.e. "installed packages" section)
Also what is the purpose of the
aws
block in the clearml.c...
How does a task specify which docker image it needs?
Either in the code itself 'task.set_base_docker' or with the CLI, or set it in the UI when you clone an experiment (everything becomes editable)
Hi VirtuousFish83 ,
Is it throwing an exception? Are you seeing the plot in the UI but the title is incorrect?
Then the type hints are not removed from helper and the code immediately crashes when being run
Oh yes I see your point, that does make sense (btw removing the type hints will solve the issue)
regardless let me make sure this is solved
That is a good question ... let me check π