Reputation
Badges 1
282 × Eureka!I see, so its a path. Another question, as far as i can tell, clearml-data will download entire datasets before starting training. This isn't very ideal when we are dealing with billions of datasets (E.g. WE might want to download a subset at a time, send to GPU for training and then use the CPU to concurrently pull another subset.). Any comments on this?
Hi, any idea if i can acheive this? I just need a list of usernames.
I did another test by runningkubectl exec pod-name -- echo $PIP_INDEX_URL and it returned nothing. So the env are not passed to the container at all.
Hi TimelyPenguin76 ,
If you notice in the last screenshot, it state the bucket name to be http://ecs.ai . It then it tries to open http://s3.amazonaws.com/ecs.ai/clearml-models/artifact/uploading_file?X-Amz-Algorithm= ....
Sorry AgitatedDove14 can you bump me to that thread?
Hi FriendlySquid61 , AgitatedDove14 , the issue and possible fix is in this issue raise. https://github.com/allegroai/clearml-agent/issues/51
It also stopped taking in tasks from the queue after that.
Hi AgitatedDove14 , i changed everything to cuda 10.1 and tried again with the same rrror. the section as follows. I made sure torch==1.6.0+cu101 and torchvision==0.8.2+cu101 are in the pypi repo. But the same error still came up.
` # Python 3.6.9 (default, Oct 8 2020, 12:12:24) [GCC 8.4.0]
boto3 == 1.14.56
clearml == 0.17.4
numpy == 1.19.1
torch == 1.6.0
torchvision == 0.7.0
Detailed import analysis
**************************
IMPORT PACKAGE boto3
clearml.storage: 0
IMPORT PACKAG...
Hi.
We tried as advised above and it still didn't work.
Host: http://ecs.ai:443
output_uri = S3://ecs.ai:443/bucketname
This time round the client gave this error.
Botocore.exceptions.connectiinclosederror: connection was closed before we received a valid response from endpoint URL: ' http://ecs.ai/bucketname/.clearml.test '.
It's quite apparent that whatever clearml passed to boto3 ends up as a http call instead of https, which is wrong.
Hi AgitatedDove14 , i was refering totask.set_base_docker("nvcr.io/nvidia/tensorflow:19.11-tf2-py3 --env TRAINS_AGENT_GIT_USER=git_username_here --env TRAINS_AGENT_GIT_PASS=git_password_here")The above will give errorskipping docker argument TRAINS_AGENT_GIT_USER=git_username_here (only -e --env supported) TRAINS_AGENT_GIT_PASS=git_username_here (only -e --env supported)
Hi, currently the ClearML SDK only supports python. If i want to run my ML in other languages, can i use a SDK in that language? Or is there other means such as a Web API calls that does the same as the SDK?
Thanks could you share the URL to this full API documentation?
Ah ok. So it will be fixed on the ClearML server web UI as well? (See screenshots).
Previously we had similar issues when we switched images used in agent. Might want to check on that.
ah ok, so if i see Jax's workspace on https://app.community.clear.ml/dashboard , then i'm on the right track? How regular does this reset then?
Hi, is this currently not working? http://app.community.clear.ml ? I noticed that cleaml UI will cache on the browser and if the backend is not running, its not clear to user that something is wrong (except for broken pages).
It would make sense on a very large resource cluster. Unfortunately we only have less than 50 GPUs to share across. A multi-tenant SAAS would cut the resources into even more smaller clusters and not help with efficiency. Or would you have a suggestion?
Yes it is! But ClearML didn't support multi node training out of the box in a way that it streamline the process. So we are trying to figure out a way to do it.
Thanks SuccessfulKoala55 , how might I do this clean up? Does this increase with more use of ClearML? And to add, we save all artifacts onto a remote S3 server.
yes its on purpose, each user would have their own AWS credentials for default_output_uri.
what feature on this paid roadmap are you referring to? I am indeed communicating with Noem on paid features.
Hi SuccessfulKoala55 , thanks. Opened issue on the CLearml-Agent GH at https://github.com/allegroai/clearml-agent/issues/67
python k8s_glue_example.py --queue gpu --namespace default
Traceback (most recent call last):
File "k8s_glue_example.py", line 86, in <module>
main()
File "k8s_glue_example.py", line 80, in main
namespace=args.namespace,
File "/home/administrator/clearml-agent-k8s/venv/lib/python3.6/site-packages/clearml_agent/helper/base.py", line 239, in _ call _
cls. instances[cls] = super(Singleton, cls). call_(*args, **kwargs)
TypeError: _ init _() got an unexpected keyword argument 'base_pod...
the hackathon is 3 days.
What type of pipeline steps are you running? From task, decorator or function?
We were trying with 'from task' at the moment. But the question apply to all methods.
If they're all running on the same container why not make them the same task and do things in parallel?
The tasks were created by different teams and their tasks content is rather independent and modular. Usage of them is usually optional. For example, task1 performs 'image whitening', task2 performs 'image resize'.