You can always add the relevant configurations to the docker image itself as well. From my understanding a new version should be released towards the end of the month and with it the ability to run without docker image required on the autoscaler
Hi @<1840924578885406720:profile|VictoriousFish46> , how are you uploading the dataset? Did you set output_uri? What is set as your files server in the api section of your clearml.conf ?
That or a private docker registry
Hi @<1815919815257231360:profile|UpsetFrog68> , can you provide a standalone code snippet that would reproduce this occasional behaviour?
UnevenDolphin73 , can you please provide a screenshot of the window, message and the URL in sight?
UnevenDolphin73 , sorry for the delay 🙂
Please go to the profile page, hit F12 and do CTRL+F5
In 'Network' there should be a call to ' http://server.info '. Can you please copy paste the response here?
Also, can you copy here the contents of your docker-compose file here?
Yes, this will cause the code to run inside the container.
if so it won't work as my environment is in the hist linux
Not sure I understand this part, can you please elaborate?
Hi @<1708653001188577280:profile|QuaintOwl32> , you can set some default image to use. My default for most jobs is nvcr.io/nvidia/pytorch:23.03-py3
CrookedWalrus33 , you can set in the Task.init , set the output_uri = True . This should upload to the fileserver since by default models are saved locally
Now try logging in
Hi @<1710827340621156352:profile|HungryFrog27> , what seems to be the issue?
@<1570583227918192640:profile|FloppySwallow46>
It looks like you're running on different machines and the file your code is looking for is not available on the other machine
Can you add such an attempt and the outputs please?
Hi @<1724960468822396928:profile|CumbersomeSealion22> , what was the structure that worked previously for you and what is the new structure?
DepressedFish57 , Hi 🙂
What do you mean by downloading a previous part of the dataset? get_local_copy fetches the entire dataset if I'm not mistaken. Am I missing something?
Hi @<1858319200146165760:profile|PoisedDeer30> , can you provide a standalone snippet that reproduces this behaviour?
Also do you have a log of this? From where did you delete it?
Hi @<1552101447716311040:profile|SteadySeahorse58> , if the experiment is still in pending mode it means that it wasn't picked up by any worker. Please note that in a pipeline you have the controller that usually runs on the services queue and then you have the steps where they all can run on different queues - depending on what you set
Hi JumpyRabbit71 , I think each step has it's own requirements
Hi @<1731483438642368512:profile|LoosePigeon2> , you need to set the following:
sdk: {
development: {
store_code_diff_from_remote: false
store_uncommitted_code_diff: false
On the machine you're running your pipeline from
SmallDeer34 , great, thanks for the info 🙂
Hi @<1631102016807768064:profile|ZanySealion18> , I would suggest using the web UI as a reference. Open developer tools and check what is being sent/received when looking at the workers/queues pages
Can you please open developer tools (F12) and see what is returned in network when you try to do this?
Hi @<1639799308809146368:profile|TritePigeon86> , if I understand you correctly, you're basically looking for a switch in pipelines (per step) to say "even if step failed, continue the pipeline"?
Hi @<1533159639040921600:profile|JoyousReindeer30> , the pipeline controller is currently pending. I am guessing it is enqueued into the services queue. You would need to run an agent on the services queue for the pipeline to start executing 🙂
UnevenDolphin73 , if you're launching the Autoscaler through the apps, you can also add bash init script or additional configs - that's another way to inject env vars 🙂
Can you please open a GitHub issue to follow up on this issue?