Reputation
Badges 1
25 × Eureka!SmilingFrog76
there is no internal scheduler in Trains
So obviously there is a scheduler built into Trains, this is the queues (order / priority)
What is missing from it is multi node connection, e.g. I need two agents running the exact same job working together.
(as opposed to, I have two jobs, execute them separately when a resource is available)
Actually my suggestion was to add a SLURM integration, like we did with k8s (I'm not suggesting Kubernetes as a solution for you, the op...
Hi ReassuredTiger98
I do not want to create extra queues for this since this will not be able to properly distribute tasks.
Queues are the way to abstract different resources to "compute capabilities". It creates a simple interface to users on the one hand and allows you to control the compute on the other Agents can listen to multiple queues with priority. This means an RTX agent can pull from an RTX queue, and if this is empty, it will pull from "default" queueWould that work for ...
, I generate some more graphs with a file calledย
graphs.py
ย and want to attach/upload to this training task
Make total sense to use Task.get_task, I just want to make sure that you are aware of all the options, so you pick the correct one for you :)
What do you mean by "modules first and find a way to install that package" ?
Are those modules already in wheels ? are they part a git repository?
(the pipeline component can also start inside a git repository it clones)
Hmm we might need more detailed logs ...
When you say there is a lag, what exactly doe s that mean? if you have enough apiserver instances answering the requests, the bottleneck might be the mongo or the elastic ?
SmallBluewhale13
And the Task.init registers 0.17.2 , even though it prints (while running the same code from the same venv) 0.17.2 ?
Hi @<1523704152130064384:profile|SmallGiraffe94>
Yes it is possible!
set the User Properties of a dataset when creating a Dataset with
A bit hackish but should work.
dataset = Dataset.create(dataset_project="project", dataset_name="name")
dataset._task.set_user_properties(key="value")
dataset_ids = Task.query_tasks(
project_name=["project/.datasets/name"],
task_filter=dict(
type=[str(Task.TaskTypes.data_processing)],
exact_match_regex_flag=False,
...
, i thought there will be some hooks for deploying where the integration with k8s was also taken care automatically.
Hi ObedientToad56
Yes you are correct, basically now you have a docker-compose (spinning everything, even though per example you can also spin a standalone container (mostly for debugging).
We are working on a k8s helm chart so the deployment is easier, it will be based on these docker-compose :
https://github.com/allegroai/clearml-serving/blob/main/docker/docker-comp...
CLI? Code ?
I can't seem to figure out what the names should be from the pytorch example - where did INPUT__0 come from
This is actually the latyer name in the model:
https://github.com/allegroai/clearml-serving/blob/4b52103636bc7430d4a6666ee85fd126fcb49e2e/examples/pytorch/train_pytorch_mnist.py#L24
Which is just the default name Pytorch gives the layer
https://discuss.pytorch.org/t/how-to-get-layer-names-in-a-network/134238
it appears I need to converted into TorchScript?
Yes, this ...
This looks exactly like the timeout you are getting.
I'm just not sure what's the diff between the Model autoupload and the manual upload.
ScantMoth28 it should work, I think default deployment also has an NGINX with reverse proxy on it switching from " http://clearml-server.domain.com/api " to " http://api.clearml-server.domain.com "
Hi StrangePelican34
What exactly I not working? Are you getting any TB reports?
Hi MortifiedCrow63
sawย
ย , ...
By default ClearML
will only log the exact local place where you stored the file, I assume this is it.
If you pass output_uri=True
to the Task.init
it will automatically upload the model to the files_server and then the model repository will point to the files_server (you can also have any object storage as model storage, e.g. output_uri=s3://bucket
)
Notice yo...
Thank you!
one thing i noticed is that it's not able to find the branch name on >=1.0.6x , while on 1.0.5 it can
That might be it! let me check the code again...
I was able to successfully enqueue the task but only entrypoint script is visible to it and nothing else.
So you passed a repository link is it did not show on the Task ?
What exactly is missing and how the Task was created ?
ShakyJellyfish91 can you check if version 1.0.6rc2
can find the changes ?
Thanks ShakyJellyfish91 ! please let me know what you come up with, I would love for us to fix this issue.
ldconfig fromย
/etc/profile
ย which is put there by the interactive_session_task
LackadaisicalOtter14 are you sure ? maybe this is done as part of the installation the interactive session runs ?
Could that be the issue ?apt-get update && apt-get install -y openssh-server
BTW: I tested the code you previously attached, and it showed the plot in the "Plots" section
(Tested with latest trains from GitHub)
Could you amend the original snippet (or verify that it also produces plots in debug samples) ?
(Basically I need something that I can run ๐ )
Do you have any experience and things to watch out for?
Yes, for testing start with cheap node instances ๐
If I remember correctly everything is preconfigured to support GPU instances (aka nvidia runtime).
You can take one of the templates from here as a starting point:
https://aws.amazon.com/blogs/compute/running-gpu-accelerated-kubernetes-workloads-on-p3-and-p2-ec2-instances-with-amazon-eks/
I see, let me check something ๐
Hi LackadaisicalOtter14
However, whenever we spin up a session,ย
ย always gets run and overwrites our configs
what do you mean by that?
The what config are being overwritten? (generally speaking, it just add the OS environment it needs to for the setup process)
ย I want to schedule bulk tasks to run via agents, so I'm runningย
create
I see, that makes sense.
specially when dealing with submodules,
BTW: submodule diff should always get stored, can you provide some error logs on fail cases?
Before manually modifying the diff:
If you have local commits (i.e. un-pushed) this might fail the diff apply, in that case you can set the following in your clearml.confstore_code_diff_from_remote: true
https://github.com/allegroai/clear...
Hi GrievingTurkey78
Can you test with the latest clearml-agent RC (I remember a fix just for that)pip install clearml-agent==1.2.0rc0
I do not think this is the upload timeout, it makes no sense to me for GCP package (we do not pass any timeout, it's their internal default for the argument) to include a 60sec timeout for upload...
I'm also not sure where is the origin of the timeout (I'm assuming the initial GCP handshake connection could not actually timeout, as the response should be relatively quick, so 60sec is more than enough)