data:image/s3,"s3://crabby-images/ea8fc/ea8fc4a242d3fbf9f124d8906a48b69b89ea53a2" alt="Profile picture"
Reputation
Badges 1
25 × Eureka!Since I can't use the
torchrun
comand (from my tests, clearml won't use it on the clearm-agent), I went with the
@<1556450111259676672:profile|PlainSeaurchin97> did you check this example?
None
Run ifconfig
No worries, I would love for us to come up with a nice solution 🙂
Hi MelancholyElk85
However, when I clone the pipeline from web UI and launch it once again, it works. Is there a way to bypass this?
In both cases, are you seeing a different behavior on the same machine running the agent (i.e. clonening from the UI vs code) ?
According to you the VPN shouldn't be a problem right?
Correct as long as all parties are on the same VPN it should work, all the connections are always http so basically trivial communication
Yes they are supposed to be routed there by pytorch dist
(and the TB logs are on the master only anyhow)
Could it be the credentials are actually incorrect? because it seems like you can access the server? (I assume you were able to browse to it and generate credentials. right?)
@<1558624430622511104:profile|PanickyBee11> how are you launching the code on multiple machines ?
are they all reporting to the same Task?
Hi PanickyMoth78
You mean like another Task? or maybe Slack message?
I would recommend reading this blog post, it should give you a glimpse of what can be built 🙂
https://medium.com/pytorch/how-trigo-built-a-scalable-ai-development-deployment-pipeline-for-frictionless-retail-b583d25d0dd
I thought this is the issue on the thread you linked, did I miss something ?
PompousParrot44 Enterprise licensing pricing usually custom tailored to the size of the company and based on usage. If you are interested feel free to leave details in the "contact us" form on the website, and someone from sales will contact you shortly after.
but the debug samples and monitored performance metric show a different count
Hmm could you expand on what you are getting, and what you are expecting to get
If you are using user/pass for the git (i.e. not ssh key) we are not passing it to the pip install (and come to think about it, we probably should?!)
Hey LethalDolphin75 , when it works, could you PR it?
That is a bit odd, But SSH keys have to have a specific chmod flags for them to work (security issues)
What was the error ?
I'm having another problem now because I am using the OptunaOptimizer.
Hmm let me check a sec
Hi @<1566596960691949568:profile|UpsetWalrus59>
just wondering - shouldn't the job still work if I didn't push the commit yet
How would that work? it does not know which commit to take? it would also fail on git diff apply, no?
Hi @<1556812486840160256:profile|SuccessfulRaven86>
Every clearml-serving session (you can have multiple different "sessions") is assumed to be homogeneous, this would mean it will serve the same models on as many nodes as possible supporting multiple models per pod.
In your example I think the easiest is to create two serving sessions one with a node selector for the 24GB node and another for the 16GB node, wdyt?
Hi @<1541954607595393024:profile|BattyCrocodile47>
Did you check None ?
You are not supposed to do 2,3,4
After (1) you should just do
ssh root@localhost -p 8022
and provide the password that is written in the CLI
(Notice to pass --public-ip
if your remote machine is using a public IP you can access)
SmarmySeaurchin8 could you test with the latest RCpip install clearml==0.17.5rc2
Yeah, but I still need to update the links in the clearml server
yes... how many are we talking about here?
As a result, I need to do somethig which copies the files (e.g. cp -r or StorageManager.upload_folder(‘b’, ‘a’)
but this is expensive
You are saying the copy is just wasteful (but you do have the files locally)?
Yes, I mean use the helm chart to deploy the server, but manually deploy the agent glue.
wdyt?
Hi @<1535069219354316800:profile|PerplexedRaccoon19>
On debugging, it looks like indices are corrupt.
ishhhhh, any chance you have a backup?
My bad, there is a mixture in terms.
"configuration object" is just a dictionary (or plain text) stored on the Task itself.
It has no file representation (well you could get it dumped to a file, but it is actually stored a s a blob of text on the Task itself, at the backend side)
logger.report_scalar("loss-train", "train", iteration=0, value=100)
logger.report_scalar("loss=test", "test", iteration=0, value=200)
notice that the title of the graph is its uniue id, so if you send scalars to with the same "title" they will show on the same graph
Hi @<1562973095227035648:profile|ThoughtfulOctopus83>
The host should be just the host name, no https prefix, I'm assuming that's the issue