Hi EnchantingWorm39
Great question!
Regrading the data management, I know the enterprise edition has full support for unstructured data, and we plan to soon have a solution for structured data as part of the open source (soon= hopefully in a month time)
Regrading model serving, I know you can integrate with TFServing or seldon with very little effort (usually the challenge is creating triggers etc, but but in most cases this is custom code anyhow 🙂 )
I do not have experience with Cortex/B...
Seems like something is not working with the server, i.e. it cannot connect with one of the dockers.
May I suggest to carefully go through all the steps here, make sure nothing was missed
https://github.com/allegroai/trains-server/blob/master/docs/install_linux_mac.md
Especially number (4)
EnchantingWorm39 you have great timing ;)
Hi SmoothSheep78
Do you need to import the previous state of the trains-server, or are you starting from scratch ?
JitteryCoyote63 see here https://stackoverflow.com/questions/55385900/pip3-setup-py-install-requires-pep-508-git-url-for-private-repo bottom line, you have to add package@ before the link, but if you do that and the package is already installed it will not install using the git repo, this is an issue with pip. I think that since the agent installs everything from scratch it should work for you. Wdyt?
BattyLion34 are you saying you do not have the "APP CREDENTIALS" section in the profile page?
Sure thing 🙂
BTW: ReassuredTiger98 this is definitely an interesting use case, and I think you can actually write some code to solve it if you like.
Basically let's followup on you setup:Machine X: agent listening to queue A, B_machine_a *notice we have two agents here Machine Y: agent listening to queue B_machine_b
Now we (the users) will push our jobs into queues A and B
Now we have a service that does the following:
` see if we have a job in queue B
check if machine Y is working...
That would be great! Might have to useÂ
2>/dev/null
 in some of my bash scripts
Feel free to test and PR :)
One other question regarding connecting. We have setup sshd inside the docker image we are using.
Actually the remote session opens port 10022 on the host machine (so it does not collide with the default ssh port)
It actually runs an additional sshd
inside the docker, setting its port.
And the clearml-session will ssh directly into the container sshd...
Hi PerplexedGoat65
it appears, in a practical sense, this means to mount the second drive, and then bind them in ClearML’s configuration
Yes, the entire data
folder (reason is, if you loose it, you loose all the server storage / artifacts)
Also, thinking about Docker and slower access speed for Docker mounts and such,
If the host OS is linux, you have nothing to worry about, speed will be the same.
SubstantialElk6 could you try with the latest (just released)?pip install clearml-agent==0.17.2
Then if possible, could you attach the full log of the agent's execution (Task->results->Console)
Hmm ConvincingSwan15
WARNING - Could not find requested hyper-parameters ['Args/patch_size', 'Args/nb_conv', 'Args/nb_fmaps', 'Args/epochs'] on base task
Is this correct ? Can you see these arguments on the original Task in the UI (i.e. Args section, parameter epochs?)
DeterminedToad86
So based on the log it seems the agent is installing:
torch from https://download.pytorch.org/whl/cu102/torch-1.6.0-cp36-cp36m-linux_x86_64.whl
and torchvision from https://torchvision-build.s3-us-west-2.amazonaws.com/1.6.0/gpu/cuda-11-0/torchvision-0.7.0a0%2B78ed10c-cp36-cp36m-manylinux1_x86_64.whl
See in the log:Warning, could not locate PyTorch torch==1.6.0 matching CUDA version 110, best candidate 1.7.0
But torchvision is downloaded from the cuda 11 folder...
I...
I was able to successfully enqueue the task but only entrypoint script is visible to it and nothing else.
So you passed a repository link is it did not show on the Task ?
What exactly is missing and how the Task was created ?
I basically moved the Task.init() call below the imports
Okay that is odd, can you copy pate the before/after of the import, so we can fix that?!
i had a misconception that the conf comes from the machine triggering the pipeline
Sorry, this one :)
Well, in that case, just change the order it should solve it (I'll make sure we have that as the default:
conda_channels: ["pytorch", "conda-forge", "defaults", ]
It should solve the issue 🙂
Hi @<1726410010763726848:profile|DistinctToad76>
Why not just report scalars, the x-axis you can use as "iterations" if this is a running in real time to collect the prompts.
If this is a summary then just report a scatter plot (you can also specify the names of the axis and the series)
None
1
One reason I don't like using the configuration section is that it makes debugging much much harder.
debugging ? please explain how it relates to the configuration, and presentation (i.e. preview)
2.
Yes in theory, but in your case it will not change things, unless these "configurations" are copied on any Task (which is just storage, otherwise no real harm)
3.
I was thinking "zip" file that the Task creates and uploads, and a new configuration type, say "external/zip" , and in the c...
hmm I assume the reason is the cookie / storage changed?
There is some overhead, but it should be negligible.
poetry
 stores git related data in ... you get an internal package we have with its version, but no git reference, i.e.Â
internal_module==1.2.3
 instead ofÂ
internal_module @H4dr1en
This seems like a bug with poetry (and I think I have run into this one), worth reporting it, no?
Since the error says network error, is it possible because I'm in Taiwan? Like downloading from Asia leads to this kind of issue
Can you download it from the browser ? (I mean the file size after download , is it 400mb?)
Although it's still really weird how it was failing silently
totally agree, I think the main issue was the agent had the correct configuration, but the container / env the agent was spinning was missing it,
I'll double check how come it did not print anything
Basically you should not use Task.create to log the current execution. It is used to create a Task externally and then enqueue it for remote execution. Make sense?
ShaggyHare67 notice that the services queue is designed to run CPU based tasks like monitoring etc.
For the actual training you need to run your trains-agent
on a GPU machine.
Did you run the trains-agent init
? it will walk you through the configuration (git user/pass) included.
If you want to manually add them, you can see an example of the configuration file in the link below.
You can find it on ~\trains.conf
https://github.com/allegroai/trains-agent/blob/master/docs/tr...
User/pass should be enough,
Could it be the specific commit ID is not pushed?
I think your "files_server" is misconfigured somewhere, I cannot explain how you ended up with this broken link...
Check the clearml.conf on the machines or the env vars ?