Reputation
Badges 1
25 × Eureka!store_code_diff_from_remote
Β don't seem to change anything in regards of this issue
Correct, it is always from remote
i'll be using the update_task, that worked just fine, thanksΒ
Β (edite
Sure thing.
ShakyJellyfish91 , I took a quick look at the diff between the versions can you hack a non working version (preferably the latest) and verify the issue for me?
- Yes Task.init should be called on each subprocess (because torch forks them before they ar epatched)
- I think the main issue is that we patch the argparse on the Subprocess (this is assuming you did not manually parse non argv argument)
- If you can create a mock test I think we can work around the issue, as long as the way you spin it is the standard pytorch distub way
Hi @<1556450111259676672:profile|PlainSeaurchin97>
Is there any simple way to use
argparse
to pass a clearml task name?
need to call
args = task.connect(args)
.
noooo π there is no need to do that, the arguments are automatically detected
see for yourself
args = parse_args()
task = Task.init(task_name=args.task_name)
Since I can't use the
torchrun
comand (from my tests, clearml won't use it on the clearm-agent), I went with the
@<1556450111259676672:profile|PlainSeaurchin97> did you check this example?
None
Hi ConfusedPig65
Any keras model will be automatically uploaded if you pass an upload url to the Task init:task = Task.init('examples', 'keras upload test', output_uri=" ")(You can also pass to output_uri s3://buckket/folder or change the default output_uri in the clearml.conf file)
After this line any keras model will be automatically uploaded (you will see it under the Artifacts Tab)
Accessing models from executed tasks:
` trains_task = Task.get_task('task_uid_here')
last_check...
Hmm yeah I can see why...
Now that I think about it, at least in theory the second process that torch creates, should inherit from the main one, and as such Task.init is basically "ignored"
Now I wonder why your first version of the code did not work?
Could it be that we patched the argparser on the subprocess and that we should not have?
ImmensePenguin78 this is probably for a different python version ...
Ok, but it must be somewhere in the bst class
It is the XGboost callback feature, basically just reporting everything xgbosst reports:
None
It is deployed on an on premise, secured network that has no access to the outside world.
Is it password protected or something of that nature?
Perhaps we could find a different solution or work around, rather than solving a technical issue.
Solving it means allowing the python code to ask the JupyterLab server for the notebook file
However, once working with ClearML and using a venv (and not the default python kernel),
Are you saying on your specific setup (i.e. OpenShif...
@<1535793988726951936:profile|YummyElephant76> oh you mean like jupyter server was running, then inside the notebook you would start a new venv, in that venv "notebook" package was missing, hence it failed detecting the notebook ?
btw, I looked deeper into the log:
File "/tmp/tmpfa8ifmka.py", line 80, in <module>
model.train(data='coco128.yaml',epochs=20)
I'm assuming this all starts here, I think that the pipeline is Not running the code from the same folder, and you are just missing the 'coco128.yaml' try to pass a full path, wdyt?
PompousParrot44 That should be very easy to do, basically a service mode code that clones a base task and puts it into a queue:
This should more or less do what you need :)
` from trains import Task
task = Task.init('devops', 'daily train', task_type='controller')
stop the local execution of this code, and put it into the service queue, so we have a remote machine running it.
task = execute_remotely('services')
while True:
a_task = Task.clone(base_task_id='aaabb111')
Task.enqueu...
Sure this is basically REST query π
` from clearml.backend_api.session.client import APIClient
client = APIClient()
models = client.models.get_all(name='regexp', tags=['demo'], project=['project_id'])
print(models) `
files_server:
://genuin-ai/
should be:
files_server:
Make sure you have the S3 credentials in your agent's clearml.conf :
https://github.com/allegroai/clearml-agent/blob/822984301889327ae1a703ffdc56470ad006a951/docs/clearml.conf#L210
I think non-master processes trying to log something, but have no Logger instance because have no Task instance.
Hmm is your code calling Logger.current_logger() directly ?
Logs in master process include all training history or I need to concatenate logs from different nodes somehow?
So the main problem is that you need to pass the TASK ID that the master node creates to the second node, so it can report to the same Task.
I know that the enterprise version of ClearML support...
Hi @<1569496075083976704:profile|SweetShells3>
Try to do:
import torch.distributed as dist
if dist.get_rank()==0:
task = Task.init(...)
This will make sure only the "master" process is logged
or
if int(os.environ.get('RANK'))==0:
task = Task.init(...)
BTW: why use CLI? the idea of clearml it becomes part of the code, even in the development process, this means add "Task.init(...)" at the beginning of the code, this creates the Tasks and logs them as part of the development. Which means that xecuting them is essentially cloning and enqueuing in the UI. Of course you can automate it directly as part of the code.
Hi SmugLizard24
The question is what is the reason of the issue?
That is a good question, could it be out of memory? (trying to compress or send the file in one chunk?)
BTW: we are now adding "datasets chunks for a more efficient large dataset storage"
but I donβt get to this line, because my task is already of type data_processing
Ohh I see now, it should have added the Tag regardless, you are correct.
Sen the full Task log, you can DM it if it is easier
Hmm I think you are correct:param auto_create: Create new dataset if it does not exist yetit should have created it, this seems like a bug, I'll make sure to pass along π
/home/npuser/.clearml/venvs-builds/3.7/task_repository/commons-imagery-models-pyYep I see it now, could you simulate locally (i.e have the other folders in the path as well)?
could it be you also have a file somewhere that is called sfi or imagery or models or chip_classifier that it accidently tries to import first from ?
Depends on what you want to do, what do you want to do ?
GleamingGrasshopper63 can you ping to your api server ?!ping api.server.hereAlso what's the api server you configured ? (ip:8008 ?)
Any chance this is a Local machine, i.e. the colab machine cannot get back into the clearml server cunning locally ?
DepressedChimpanzee34 I cannot find cfg.py here
https://github.com/allegroai/clearml/tree/master/examples/frameworks/hydra/config_files
(or anywhere else)