Reputation
Badges 1
25 × Eureka!When you set the pod make sure you mount the clearml local cache folder to the PV
basically /root/.clearml/cache/
Thank you @<1719524641879363584:profile|ThankfulClams64> for opening the GI, hopefully we will be able to reproduce it and fox ot quickly
We already redesigned the implementation so it should be quite easy to extend to GCP and Azure, what are you planning ?
I am trying to see if the user can submit a list of resource requirements (e.g 4GPUs, 12 cores, 100GB diskspace)
This will be quite easy to implement using the cleamrl k8s glue, just use user-properties and change the template based on it. I can point to where you need to modify the code
SuperficialGrasshopper36 regrading the codeartifact
I think the easiest will be to have a bash script authenticating the codeartifact with the aws command at the beginning of each docker spin. This can be done by adding it to:
https://github.com/allegroai/clearml-agent/blob/81edd2860fbc09e2a179985d8315ffaba851dcd7/docs/clearml.conf#L136
For example:extra_docker_shell_script: ["apt-get install -y aws_cli_or_something", "aws cli authenticate me command"]
wdyt?
You should have metric :monitor:gpu
variant gpu_0_utilization
Since I see you have none of those, that points to no GPU driver ...
Could that be ?
(So it re-reads the configuration file)
If you are using the "default" queue for the agent, notice you might need to run the agent with --services-mode
to allow for multiple pipeline components on the same machine
Hi RobustGoldfish9 ,
I'd much rather just have trains-agent just automatically build the image defined there than have to build the image separately and make it available for all the agents to pull.
Do you mean there is no docker image in the artifactory built based on your Dockerfile ?
This task is picked up by first agent; it runs DDP launch script for itself and then creates clones of itself with task.create_function_task() and passes its address as argument to the function
Hi UnevenHorse85
Interesting use case, just for my understanding, the idea is to use ClearML for the node allocation/scheduling and PyTorch DDP for the actual communication, is that correct ?
passes its address as argument to the function
This seems like a great solution.
the queu...
VexedCat68 yes π you can also pass the parent folder and it will zip the entire subfolders into a single artifact
Hi PompousBeetle71
I remember it was an issue, but it was solved a while ago. Which Trains version are you using?
Hmm let me check something
Hi SkinnyPanda43
Let's say that I install the shared libs with pip in editable mode on my development evironment, how does the clearml-agent will handle those libraries if I submit a job
So installing packages from local folders with "-e" is in general ill-advised.
But using a full git path should work out of the box. for example if you install pip install
https://github.com/user/repo/repo.git then the agent will be able to install it on the remote machine. The main challenge...
Hi OutrageousSheep60
AS-IS
- without compressing or breaking it up into chunks.
So for that I would suggest to manually archive it, and upload as external link?
Or are you saying you want to control the compression used by Dataset class ?
https://github.com/allegroai/clearml/blob/72d9b22e0d27f317a364acfeacbcf5c70f852e8c/clearml/datasets/dataset.py#L603
Oh, then no, you should probably do the opposite π
What is the flow like now? (meaning what are you using kubeflow for and how)
VivaciousWalrus99 any chance the original Task was executed with python2 ?
what do you have for:ls -la /cs/usr/gal.hyams/.trains/venvs-builds/3.7/bin/
Hmm @<1523701279472226304:profile|SoreHorse95> this is a good point, I think you are correct we need to fix that,
- Could you open a GitHub issue so this is not forgotten ?
- As a workaround I would use clone=True, then after the call I would call task.close() on the original task, wdyt?
Hi LazyLeopard18 ,
So long story short, yes it does.
Longer version, to really accomplish full federated learning with control over data at "compute points" you need some data abstraction layer. Without data abstraction layer, federated learning is just averaging derivatives from different location, this can be easily done with any distributed learning framework, such as horovod pr pytorch distributed or TF distributed.
If what you are after is, can I launch multiple experiments with the sam...
IrritableGiraffe81 could it be the pipeline component is not importing pandas inside the function? Notice that a function decorated with pipeline component become a stand-alone, this means that if you need pandas you need to import inside the function. The same goes for all the rest of the packages used.
When you are running with run_loclly or debug_pipeline you are using your local env , as opposed to the actual pipeline where a new env is created inside the repo.
Can you send the Entire p...
Hi @<1556450111259676672:profile|PlainSeaurchin97>
Is there any simple way to use
argparse
to pass a clearml task name?
need to call
args = task.connect(args)
.
noooo π there is no need to do that, the arguments are automatically detected
see for yourself
args = parse_args()
task = Task.init(task_name=args.task_name)
Follow up: I see that if I move an Experiment to a new project, it does not copy the associated model files and must be done manually.Β Once I moved the models to the new project, the query works as expected.
Correct π
Nice catch!
Hi DepressedFish57
In my case download each part takes ~5 second, and unzip ~15.
We run into that, and the new version will employ multithreading approach for the unzip (meaning the unzipping will happen in the background)
` from time import sleep
from clearml import Task
import tqdm
task = Task.init(project_name='debug', task_name='test tqdm cr cl')
print('start')
for i in tqdm.tqdm(range(100)):
sleep(1)
print('done') `The above example code will output a line every 10 seconds (with the default console_cr_flush_period=10) , can you verify it works for you?
This points to the wrong cu117 / driver - could that be?
Hi JitteryCoyote63 a few implementation details on the services-mode, because I'm not certain I understand the issue.
The docker-agent (running in services mode) will pick a Task from the services queue, then it will setup the docker for it spin it and make sure the Task starts running inside the docker (once it is running inside the docker you will see the service Task registered as additional node in the system, until the Task ends) once that happens the trains-agent will try to fetch the...
pytorch DDP
with what backend ? gloo ? nvcc ? openmpi ?
Hi CleanPigeon16
Put the specific git into the "installed packages" section
It should look like:... git+
...
(No need for the specific commit, you can just take the latest)