Reputation
Badges 1
25 × Eureka!Would it suffice to provide the git credentials ...
That should be enough, basically this is where they should be:
https://github.com/allegroai/clearml-agent/blob/0462af6a3d3ef6f2bc54fd08f0eb88f53a70724c/docs/clearml.conf#L18
Hi JollyChimpanzee19
What are the versions (clearml , TF , PT), also could you add one more line from the stack (I.e. which call triggered the exception)
What's the trains-server version ?
Now Iβm just wondering if I could remove the PIP install at the very beginning, so it starts straightaway
AbruptCow41 CLEARML_AGENT_SKIP_PYTHON_ENV_INSTALL=1 does exactly that π BTW, I would just set the venv cache and this means it will just be able to restore the entire thing (even if you have changed the requirements
https://github.com/allegroai/clearml-agent/blob/077148be00ead21084d63a14bf89d13d049cf7db/docs/clearml.conf#L115
It seems like you are correct, everything should just work. Are you still getting the error? What's the clearml agent version?
Hi TrickyRaccoon92 , TB is automatically collected and converted into data stored on the system The UI uses plotly to display the data itself (on your web browser).
You still have the original TB protobuf file, if you want to dive deeper and debug the data (it is not automatically uploaded, but some users do upload it as additional artifact on the experiment)
Make sense ?
hmm that would explain it failing
Three options:
In your code: Task.init(..., output_uri='s3://.../'2. Configure a default output_uri to be used by all tasks: https://github.com/allegroai/clearml/blob/64042f6c4fdaaf15b6c5f816f2fbf50f89c313e2/docs/clearml.conf#L156
3. In the UI after you clone a Task under Execution tab, "output" "destination"
In all cases output_uri can be:
/mnt/share/folder (if you have a shared folder between all machines. http://trains-server:8081/ gs://bucket azure://bucket/
This is because we have a pub-sub architecture that we already use, it can handle retries, etc. also we will likely want multiple systems to react to notifications in the pub sub system. We already have a lot of setup for this.
How would you integrate with your current system? you have a restapi or similar to trigger event ?
but I was hoping ClearML had a straightforward way to somehow represent ALL ClearML events as JSON so we could land them in our system.
Not sure I'm followi...
JitteryCoyote63 I think that with 0.17.2 we stopped mounting the venv build to the host machine. Which means it is all stored inside the docker.
JitteryCoyote63
IAM role to the web app could access
you mean the web client key/secret to access S3 data ?
Hi @<1556812486840160256:profile|SuccessfulRaven86>
I'm assuming this relates to the SaaS service.
API calls are away to measure usage, basically metric reports are bunched into a single call, agents pings / query is API call, and so on so forth.
How many hours you had training tasks reporting data? how many agents running and so on
Sorry @<1798525199860109312:profile|IntriguedGoldfish14> just noticed your reply
Yes two inference container, running simultaneously on the cluster. As you said, each one with its own environment (assuming here that the requirements of the models collide)
Make sense
BTW: you will be loosing the comments π
An easier fix for now will probably be some kind of warning to the user that a task is created but not connected
That is a good point, maybe if you do not have a "main" Task, then we print the warning (with some flag to disable the warning) ?
Hi @<1534706830800850944:profile|ZealousCoyote89>
We'd like to have pipeline A trigger pipeline B
Basically a Pipeline is a Task (of a specific Type), so you can have pipeline A function clone/enqueue the pipelineB Task, and wait until it is done. wdyt?
Hi DeliciousBluewhale87
I think we had a docker that does exactly that, and then you would spin the docker as a k8s service , is this what you are referring to?
This would be my only improvement, otherwise awesome!!!output_model.update_weights(weights_filename=os.path.join(training_data_path, 'runs', 'train', 'yolov5s6_results', 'weights', 'best.onnx'))
How did you add the args? Is it argparser? If so the help is automatically picked so you can see it in yhe UI. BTW, the ability to provide a list of options is a really cool feature to have, I'll make sure to pass ot to product π
Wtf? can you try with = (notice single not double)?
channels:
- defaults
- conda-forge
- pytorch
dependencies:
- cudatoolkit=11.1.1
- pytorch=1.8.0
Hi GreasyPenguin14
Could you tell me what the differences are and why we should use ClearML data?
The first difference is in the approach itself, DVC ties the data with the code (i.e. git repo), where we (ClearML - but not just us) actually think data should be abstracted from the Code-Base and become a standalone argument, allowing users to build/execute against different dataset/versions. ClearML Data becomes part of the workflow as it is visible from the UI including the abili...
VictoriousPenguin97 I'm not sure there is an easy solution, basically you have to edit both MongoDB (artifacts) and Elastic (think debug samples) π
Hi CluelessElephant89
Hi guys, if I spot issue with documentations, where should I post them?
The best way from our perspective PR the fix π this is why we put it on GitHub
But itβs running in docker mode and it is trying to ssh into the host machine and failing
It is Not sshing to the machine it is sshing directly Into the container.
Notice the port is is sshing to is 10022 which is mapped into the container
Hi OutrageousGrasshopper93
Are you working with venv or docker mode?
Also notice that is you need all gpus you can pass --gpus all
these are being repeated as well for a single task (this is training a t5_model with transformers):Β (edited)
Seems like someone is storing lots of files with torch.save that ClearML automatically logs.
You can disable the autolog:task = Task.init(..., auto_connect_frameworks={'pytorch': False})
Lol yeah Hydra is great. Notice you still have the ability to override Hydra from the UI so you really have the best of the two worlds