Thanks @<1630377234361487360:profile|RoughSeaturtle43>
server certificate verification failed. CAfile: none CRLfile: none
Oh I see this is an https issue inside the container, you need to mount your self signed certificate
add something like that to your agent.conf:
extra_docker_arguments: ["-v", "/path/to/cert.pem:/etc/ssl/certs/myca.pem"]
Funny enough I’m running into a new issue now.
Sorry my bad, I thought have known 😉 yes it probably should be packages=["clearml==1.1.6"]
BTW: do you have any imports inside the pipeline function itself ? if you do not, then no need to pass "packages" at all, it will just add clearml
So without the flush I got the error apparently at the very end of the script -
Yes... it's a python thing, background threads might get killed in random order, so that when one needs a background thread that died you get this error, which basically should mean you need to do the work in the calling thread.
This actually explains why calling Flush solved the issue.
Nice!
os.environ['CLEARML_PROC_MASTER_ID'] = ''
Nice catch! (I'm assuming you also called Task.init somewhere before, otherwise I do not think this was necessary)
I think i solved it by deleting the project and running the base_task one time before the hyper parameter optimzation
So isit working now? everything is there ?
Hi PompousBeetle71
Try this one, let me know if it helpedlogging.getLogger('trains.frameworks').setLevel(ERROR)
I want to schedule bulk tasks to run via agents, so I'm running
create
I see, that makes sense.
specially when dealing with submodules,
BTW: submodule diff should always get stored, can you provide some error logs on fail cases?
Before manually modifying the diff:
If you have local commits (i.e. un-pushed) this might fail the diff apply, in that case you can set the following in your clearml.confstore_code_diff_from_remote: true
https://github.com/allegroai/clear...
Hi EnchantingOstrich20
You how doe s clearml get it there?
In runtime it analyzes the code you are running looking for imports then checks the version you have actively used (i.e. active venv / python) and lists it there.
You can also override those in code, or edit them after you clone the ask and before you enqueue it for remote execution
Thanks ShakyJellyfish91 this really helps to narrow it down!
Let me see what I can find
Maybe permissions?!
you can test it manually by installing pynvml
and running:from pynvml.smi import nvidia_smi nvsmi = nvidia_smi.getInstance() nvsmi.DeviceQuery('memory.free, memory.total')
it seems it's following the path of the script i'm using to task.create, eg:
The folder it should run it is the script path you are passing (i.e. "script=ep_fn," )
Wrong path would imply that is it not finding the correct repository, is that the case ?
Can you see the repo itself ? the commit id ?
*Actually looking at the code, when you call Task.create(...) it will always store the diff from the remote server.
Could that be the issue?
To edit the Task's diff:task.update_task(dict(script=dict(diff='DIFF TEXT HERE')))
, the easiest way possible would be if could just some how run task and let the lsf manage the environment
You mean let the LSF set the conda/venv ? or do you also mean to get the code-base, changes etc ?
can you get the agent to execute the task on the current conda env without setting up new environment?
Wouldn't that break easily ? Is this a way to avoid dockers, or a specific use case ?
is there any other way to get task from the queue running locally in the current conda env?
You mean including cloning the code etc. but not installing any python packages ?
(currently I think the implementation expects that if the download completed, it was successful)
Failing when passing the diff to the git command...
It should be the last line (or almost) of the Log. is it there ? Also it seems that from the log, that trains you are using trains 0.14.3 , try with trains 0.15 , let me know if you are still missing packages
Hi @<1556812486840160256:profile|SuccessfulRaven86>
I'm assuming this relates to the SaaS service.
API calls are away to measure usage, basically metric reports are bunched into a single call, agents pings / query is API call, and so on so forth.
How many hours you had training tasks reporting data? how many agents running and so on
If you create an initial code base maybe we can merge it?
the parameter datatypes are not being changed when loading them up.
These are the auto logged parameters , inside YOLO, correct?
Just to make sure, you can actually see the value None
in the UI, is that correct? (if everything works as expected, you should see empty string there)
Are Kwargs supported in functions decorated as a pipeline component?
They are, but I think the main issue is the casting, without prior knowledge, everything will be a tring
The issue is uploading reporting fro http uploads (object storage will report upload). Basically the http upload is post with urllib that does not support upload callbacks for progress report. If you have an idea here, we will gladly add it (as you mentioned it can be quite annoying to have to open network manager to verify the upload is progressing)
ModelCheckpoint('best_model', save_best_only=True)
That worked for me now, what's the diff
Hi StickyWhale51
I think this issue is due to some internal race condition, anyhow I think we have an RC out solving it, can you try with:pip install clearml==1.2.0rc2
Hi SmallDeer34
Can you see it in TB ? and if so where ?