Reputation
Badges 1
25 × Eureka!you can also specify additional packages on the decorator@PipelineDecorator.component(..., packages=["tqdm>=2.1", "scikit-learn"]) def step_one(...): # code here
Maybe permissions?!
you can test it manually by installing pynvml
and running:from pynvml.smi import nvidia_smi nvsmi = nvidia_smi.getInstance() nvsmi.DeviceQuery('memory.free, memory.total')
Can you post here the docker-compose.yml you are spinning? Maybe it is the wring one?
Step 4 here:
https://github.com/thepycoder/asteroid_example#deployment-phase
The latest TAO doesn't use python for fine tuning, rather it uses the CLI entirely
It's a good question, but I think the CLI actually just runs a python code (the CLI is their interface). Generally speaking I'm pretty sure it will not be complicated to convert the TLT integration to support TAO (Nvidia helps with that, and I think we had a similar proces with Nvidia Clara/MONAI)
BTW: how are you using Nvidia TAO ?
So is there any tutorial on this topic
Dude, we just invented it ๐
Any chance you feel like writing something in a github issue, so other users know how to do this ?
Guess Iโll need to implement job schedule myself
You have a scheduler, it will pull jobs from the queue by order, then run them one after the other (one at a time)
SoreDragonfly16 the torchvision warning has nothing to do with the Trains
warning.
The Trains warning means that somehow someone changes the state of the Task from running (in_progress) to "stopped" (aborted). Could it be one of the subprocesses raised an exception ?
I simplified the code, just so I could test it, this one seems to work, feel free to add the missing argparser parts :)
` from argparse import ArgumentParser
from trains import Task
model_snapshots_path = 'mnt/trains'
task = Task.init(project_name='examples', task_name='test argparser', output_uri=model_snapshots_path)
logger = task.get_logger()
def main(args):
print('Got args: %s' % args)
if name == 'main':
parent_parser = ArgumentParser(add_help=False)
parent_parser....
My current experience is there is only print out in the console but no training graph
Yes Nvidia TLT needs to actually use tensorboard for clearml to catch it and display it.
I think that in the latest version they added that. TimelyPenguin76 might know more
JitteryCoyote63 I think I found the bug in clearml-task
it adds it at the end instead of before everything else
I want to keep the above setup, the remote branch that will track my local will be onย
fork
ย so it needs to pull from there. Currently it recognizesย
origin
ย so it doesnโt work because the agent then canโt find the commit.
So you do not want to push the change set ?
You can basically add the entire change set (uncomitted changes) from the last pushed commit).
In your clearml.conf, set store_code_diff_from_remote: true
https://github.com/allegroai...
Sorry my bad:config_obj['sdk']['stuff']['here'] = value
SmarmySeaurchin8 what do you think?
https://github.com/allegroai/trains/issues/265#issuecomment-748543102
task.connect_configuration
@<1560074028276781056:profile|HealthyDove84> if you want you can PR a fix, it should be very simple basically:
None
elif np_dtype == str:
return "STRING"
elif np_dtype == np.object_ or np_dtype.type == np.bytes_:
return "BYTES"
return None
Hmm that makes sense, I "think" the enterprise offering has a solution for that as well (i.e. full separation over static cluster), but probably the best way to constituent this avenue is talk to Sales (I'm assuming they'll setup a call to discuss the details)
Going back to the open source, I think that adding the credentials as part of the source code might allow to have "credentials" auto populate as part of the remote execution, wdyt?
ConvolutedSealion94 Let me try to explain how it works, I hope this will help in debugging.
There are two different entities here
Clearml-server: In this context clearml server acts as a control-plane, it stores configuration on the different endpoints, models, preprocessign code etc. It does Not perform any compute or serving clearml-serving-inference https://github.com/allegroai/clearml-serving/blob/e09e6362147da84e042b3c615f167882a58b8ac7/docker/docker-compose-triton-gpu.yml#L77 . This ...
Also, on the ClearML dashboard, I can see theย
clearml-agent
ย log:
Is the clearml-agent running in docker mode ?
and then in Preprocess:
self.model = get_model(task_id=os.environ['TASK_ID'], model_name=os.environ['MODEL_NAME'])
That's the part I do not get, Models have their own entity (with UID), this is in contrast to artifacts that are only stored on Tasks.
The idea when you are registering a model with clearml-serving, you can specify the model ID, this should replace the need for the TASK_ID+model_name in your code, and the clearml-serving will basically bring it to you
Basically this fun...
You do not need the cudatoolkit package, this is automatically installed if the agent is using conda as package manager. See your clearml.conf for the exact configuration you are running
https://github.com/allegroai/clearml-agent/blob/a56343ffc717c7ca45774b94f38bd83fe3ce1d1e/docs/clearml.conf#L79
Hi EagerOtter28
The agent knows how to do the http->ssh conversion on the fly, in your cleaml.conf (on the agent's machine) set force_git_ssh_protocol: true
https://github.com/allegroai/clearml-agent/blob/42606d9247afbbd510dc93eeee966ddf34bb0312/docs/clearml.conf#L25
Yey! okay let me make sure we add this feature to the Task.init arguments so one can control it from code ๐
100% of things withย
task_overrides
ย would be the most convenient way
I think the issue is that you have to pass the project ID not project name (the project unique IS is the property that is actually stored on the Task)
@<1523707653782507520:profile|MelancholyElk85> can you check the following works:
pipe.add_task(, ..., task_overrides={'project': Task.get_project_id(project_name='examples')},)
Or you want to generate it from a previously executed run?
So as you say, it seems hydra kills these
Hmm let me check in the code, maybe we can somehow hook into it
What's strange is that the remote jobs, as soon as they are launched, if I compare their configs while in state pending, they have the right all different configs, but later, while running,
Wait I think I found it, since usuallyu the case with hydra you configure everything from overrides / config, when launched remotely it looks at it by default. But with the launch plugin it should be overwritten with the Task
` task = Task.init(...)
task.set_parameter(name="Hydra/_allow_omegaconf_ed...
Hi NastyFox63
What do you mean not all of them are shown?
Do they have diff series/titles, are they plots or scalars ? How are you reporting them ?
EnviousStarfish54 good news, this is fully reproducible
(BTW: for some reason this call will pop the logger handler clearml installs, hence the lost console output)
Good question ๐from clearml import Task Task.init('examples', 'test')
curl seems okay, but this is odd https://<IP>:8010
it should be http://<IP>:8008
Could you change and test?
(meaning change the trains.conf and run trains-agent list
)
Awesome, PRs are always welcome, and we try to help with any request and feature coming for users. We just added audio support (RC releasing in a few days) based only on users request.
https://github.com/allegroai/trains/issues/120