Hi VexedCat68
txt file or pkl file?
If this is a string , it just stored it (not as a file, this is considered a "link")
https://github.com/allegroai/clearml/blob/12fa7c92aaf8770d770c8ed05094e924b9099c16/clearml/binding/artifacts.py#L521
Hi RipeGoose2
So the http://app.community.clear.ml already contains it.
Next release of the standalone server (a.k.a clearml-server) will include it as well.
I think the ETA is end of the year (i.e. 2 weeks), but I'm not sure on the exact timeframe.
Sounds good ?
you can run md5 on the file as stored in the remote storage (nfs or s3)
s3 is implementation specific (i.e. minio weka wassaby etc, might not support it) and I'm actually not sure regrading nfs (I mean you can run it, but it actually means you are reading the data, that said, nfs by definition I'm assuming is relatively fast access)
wdyt?
Yeah the doctring is always the most updated π
TrickySheep9
you are absolutely correct π
Hi OddShrimp85
If you pass 'output_uri=True' to task init, it will upload the model automatically, or as you said manually with outputmodel class
HiΒ SmoggyGoat53
There is a storage limit on the file server (basically 2GB per file limit), thisΒ is the cause of the error.
You can upload the 10GB to any S3 alike solution (or a shared folder). Just set the "output_uri" on the Task (either at Task.init or with Task.output_uri = " s3://bucket ")
It uses only one CPU core, could I use multiprocessing somehow?
Hi EcstaticMouse10
Hmm, yes it should be multi core:
https://github.com/allegroai/clearml/blob/a9774c3842ea526d222044092172980ae505e24f/clearml/datasets/dataset.py#L1175
wdyt?
ReassuredTiger98 no, but I might be missing something.
How do you mean project-specific?
I can verify the behavior, I think it has to do with the way the subparser was setup.
This was the only way for me to get it to run:script.py test blah1 blah2 blah3 42When I passed specific arguments (for example --steps) it ignored them...
Done π
Test it on your local setup (I would hate to push a broken fix)
Is that possible?
So can you verify it can download the model ?
Sure. JitteryCoyote63 so what was the problem? can we fix something?
I try to add it to ClearML Serving, but it call
forward
method by default
If this is the case, then the statement above is odd to me, if this is a custom engine, who exactly is calling " forward " ?
(in you code example you specifically call generate, as you should)
- Correct. Basically the order is restapi body dictionary-> preprocess -> process -> post-process -> restapi dictionary return
Thanks ShakyJellyfish91 this really helps to narrow it down!
Let me see what I can find
pip install clearml==1.0.6rc2Did not work?!
okay this points to an issue with the k8s glue, I think it somehow failed to launch the pod. Can you send me the log of the clearml-k8s-glue ?
PompousBeetle71 you can check this example:
https://github.com/allegroai/trains/blob/master/examples/distributed/example_torch_distributed.py
I think it should help, if you want a more manual approach, you can check the POpen subprocesses here:
https://github.com/allegroai/trains/blob/master/examples/distributed/example_subprocess.py
JitteryCoyote63 , just making sure, does refresh fixes the issue ?
Hi HollowFish37
I think I have good news for you, the clearml-agent is only communicating with the api endpoint, so as long as this is secure, you should be fine. Do notice that the default files server endpoint should be secure as well, as by default it will allow any upload/download
A single query will return if the agent is running anything, and for how long, but I do not think you can get the idle time ...
PipelineController works with default image, but it incurs overhead 4-5 min
You can try to spin the "services" queue without docker support, if there is no need for containers it will accelerate the process.
Repository cloning failed: Command '['git', 'fetch', '--all', '--recurse-submodules']' returned non-zero exit status 1.
This error is about failing to clone the pipeline code repo, how is that connected to changing the container ?!
Can you provide the full log?
Where are they stored? I could not find a backend they work with, what am I missing?
The agent cannot use another user (it literally has no way of getting credentials). I suspect this is all a by product of the actual mount point)