WickedGoat98 did you setup a machine with trains-agent pulling from the "default" queue ?
Hi VexedCat68
Could it be you are trying to update a committed dataset?
The experiment finished completely this time again
With the RC version or the latest ?
so 78000 entries ...
wow a lot! would it makes sens to do 1G chunks ? any reason for the initial 1Mb chunk size ?
Hi @<1692345677285167104:profile|ThoughtfulKitten41>
Is it possible to trigger a pipeline run via API?
Yes! a pipeline is at the end a Task, you can take the pipeline ID and clone and enqueue it
pipeline_task = Task.clone("pipeline_id_here")
Task.enqueue(pipeline_task, queue_name="services")
You can also monitor the pipeline with the same Task inyerface.
wdyt?
Hi DrabOwl94
I think that if I understand you correctly you have a Lot of chunks (which translates to a lot of links to small 1MB files, because this is how you setup the chunk size). Now apparently you have reached the maximum number of chunks per specific Dataset version (at the end this meta-data is stored in a document with limited size, specifically 16MB).
How many chunks do you have there?
(In other words what's the size of the entire dataset in MBs)
Hi VexedCat68
What type of data is it? And what type of annotations?
Streaming data into the training process is great, but is it post quality control?
Hi @<1543766544847212544:profile|SorePelican79>
You want the pipeline configuration itself, not the pipeline component, correct?
pipeline = Task.get_task(Task.current_task().parent)
conf_text = pipeline.get_configuration_object(name="config name")
conf_dict = pipeline.get_configuration_object_as_dict(name="config name")
To be honest, I'm not sure I have a good explanation on why ... (unless on some scenarios an exception was thrown and caught silently and caused it)
Hi, I was expecting to see the container rather then the actual physical machine.
It is the container, it should tunnels directly into it. (or that's how it should be).
SSH port 10022
According to you the VPN shouldn't be a problem right?
Correct as long as all parties are on the same VPN it should work, all the connections are always http so basically trivial communication
Are you running the agent in docker mode ?
Is there a mount to the host machine ?
Hi QuaintPelican38
Assuming you have open the default SSH port 10022 on the ec2 instance (and assuming the AWS premissions are set so that you can access it). You need to use the --public-ip
flag when running the clearml-session. Otherwise it "thinks" it is running on a local network and it registers itself with the local IP. With the flag on it gets the public IP of the machine, then the clearml-session running on your machine can connect to it.
Make sense ?
his means that you guys internally catch the argparser object somehow right?
Correct π this is how you get the type checking casting abilities, and a few other perks
but I still clearml-agent will raise the same error
which one?
Yes, the agent's mode is global, i.e. all tasks are either inside docker or in venv. In theory you can have two agents on the same machine one venv one docker listening to two diff queues
Sorry, what I meant is that it is not documented anywhere that the agent should run in docker mode, hence my confusion
This is a good point! I'll make sure we stress it (BTW: it will work with elevated credentials, but probably not recommended)
This is the reason you are getting an error π
Basically the session asks the agent to setup a new SSH server with credentials on the remote machine, this is not an issue inside a container, as this is an isolated environment, but when running in venv mode the User running the agent is not root, hence it cannot spin/configure an SSH server.
Make sense ?
Sometimes it is working fine, but sometimes I get this error message
@<1523704461418041344:profile|EnormousCormorant39> can I assume there is a gateway at --remote-gateway <internal-ip>
?
Could it be that this gateway has some network firewall blocking some of the traffic ?
If this is all local network, why do you need to pass --remote-gateway ?
worker nodes are bare metal and they are not in k8s yet
By default the agent will use 10022 as an initial starting port for running the sshd that will be mapped into the container. This has nothing to do with the Host machine's sshd. (I'm assuming agent running in docker mode)
Btw it seems the docker runs in
network=host
Yes, this is so if you have multiple agents running on the same machine they can find a new open port π
I can telnet the port from my mac:
Okay this seems like it is working
It does not use key auth, instead sets up some weird password and then fails to auth:
AdventurousButterfly15 it ssh Into the container inside the container it sets new daemon with new random very long password
It will Not ssh to the host machine (i.e. the agent needs to run in docker mode, not venv mode), make sense ?
I mean if I enter my host machine ssh password it works. But we will disable password auth in future, so itβs not an option
To clarify, it should not allow users to ssh into the host machine (if you can do that this means you own it), it only allows users to SSH into the container the host machine spins, make sense ?
hmm can you share the log of the Task? (the clearml-session created Task)
BTW: the agent will resolve pytorch based on the install CUDA version.
it overwrites the previous run?
It will overwrite the previous if
Under 72h from last execution no artifact/model was createdYou can control it with "reuse_last_task_id=False" passed to Task.init
Task name itself is Not unique in the system, think of it as short description
Make sense ?
Hi TrickySheep9
Long story short, clearml-session fully supports k8s (using k8s glue)
The --remote-gateway along side ports mode will basically allow you to setup a k8s service so that every session will register with a specific port so k8s does ingest foe you and route the SSH connection to the pod itslef, everything else is tunneled over the original SSH connection.
Make sense ?