I am trying to see if the user can submit a list of resource requirements (e.g 4GPUs, 12 cores, 100GB diskspace) for the task when queuing the task and the agents pick these tasks if they have the requested resources. With this, the user need not think about which queue to send the task to. The users just state what they need and the agents do the scheduling for them.
Can I assume we are talking Kubernetes under the hood for the resource allocation ?
EmbarrassedSpider34
Sync_folder and upload
Several times along the code and then
Do notice they overwrite one another...
Is this information stored anywhere or do I need to explicitly log this data somehow?
On the creating Task along side all the other reports.
Basically each model stores its creating Task (Task ID), using the Task ID you can query all the metrics reported by the task
Is the agent itself registered on the clearml-server (a.k.a can you see it in the UI?)
Hi GreasyPenguin14
Could you tell me what the differences are and why we should use ClearML data?
The first difference is in the approach itself, DVC ties the data with the code (i.e. git repo), where we (ClearML - but not just us) actually think data should be abstracted from the Code-Base and become a standalone argument, allowing users to build/execute against different dataset/versions. ClearML Data becomes part of the workflow as it is visible from the UI including the abili...
ProudMosquito87 I think this is what you are looking for: https://github.com/allegroai/trains-agent/blob/master/docs/trains.conf#L101
I suppose the same would need to be done for any
client
PC running
clearml
such that you are submitting dataset upload jobs?
Correct
That is, the dataset is perhaps local to my laptop, or on a development VM that is not in the
clearml
system, but I from there I want to submit a copy of a dataset, then I would need to configure the storage section in the same way as well?
Correct
Hi CleanPigeon16
You need to pass the private repository docker credentials to the aws instance, I would use the custom bash script option of the aws autoscaler to create the docker credentials file.
GreasyPenguin14 thank you! that will make our life a lot easier š
SubstantialElk6 on the client side?
I can definitely feel you!
(I think the implementation is not trivial, metrics data size is collected and stored as commutative value on the account, going over per Task is actually quite taxing for the backend, maybe it should be an async request ? like get me a list of the X largest Tasks? How would the UI present it? As fyi, keeping some sort of book keeping per task is not trivial either, hence the main issue)
BeefyCow3 see this https://allegroai-trains.slack.com/archives/CTK20V944/p1593077204051100 :)
Thanks for the ping ConvolutedChicken69 , I missed it š
from what i see in the docs it's only for Jupyter / VS Code, i didn't see anything about pycharm
PyCharm is basically SSH, which is supported š
(Maybe we should mention it in the docs?)
The issue I want to avoid is aborting of the dataset task that these regular tasks update.
HelpfulHare30 could you post a pseudo code of the dataset update ?
(My point is, I'm not sure the Dataset actually supports updating, as it need to reupload the previous delta snapshot). Wouldn't it be easier to add another child dataset and then use dataset.squash (like one would do in git) ?
clearml-agent daemon --detached --queue manual_jobs automated_jobs --docker --gpus 0
If the user running this command can run "docker run", then you should ne fine
DeliciousBluewhale87 could you restart the pod and ssh to the Host and make sure the folder /opt/clearml/agent
exists and there is not *.conf file in it ?
Hi DangerousDragonfly8
You mean you want to trigger something when users archive a Task ?
WorriedParrot51 I now see ...
Two solutions that I can quickly think of:
In the code add:import sys sys.path.append('./my_sub_module')
Assuming you always have to add the sub-directories to make the code work, and assuming they are part of the repository, this is probably the table stolution
2. In the the UI in the Docker base image, add -e PYTHONPATH=/folder
or from code (which is exactly what you did)
a clean interface task.set_base_docker('nvidia/cids -e PYTHONPATH=/folder")
Hi GrievingTurkey78task.models['output'][-1]
should return the last stored model.
What do you have under under task.models['output'][-1].url
Hi FloppyDeer99
What is the meaning of no real scheduling
I think the meaning is that from the moment a k8s job is created, the k8s is in charge of actually spinning the container. Since k8s has no real priority/order the scheduling order is not guaranteed form this point.
The idea of the cleaml-k8s -glue is that the glue will launch a job on the k8s cluster only if it is sure there are enough resources to actually spin the job now (as opposed to, sometime in the future), this mea...
CooperativeFox72 yes 20 experiments in parallel means that you always have at least 20 connection coming from different machines, and then you have the UI adding on top of it. I'm assuming the sluggishness you feel are the requests being delayed.
You can configure the API server to have more process workers, you just need to make sure the machine has enough memory to support it.
can you tell me what the serving example is in terms of the explanation above and what the triton serving engine is,
Great idea!
This line actually creates the control Task (2)clearml-serving triton --project "serving" --name "serving example"
This line configures the control Task (the idea is that you can do that even when the control Task is already running, but in this case it is still in draft mode).
Notice the actual model serving configuration is already stored on the crea...
Hi JealousParrot68
do tasks that are created through create_function_task run the entry_script again instead of just the pure function
Basically they will run the code until the "create_function_task" call, but never after. We are working on adding a decorator to a function, making it a "standalone" script, is this what you actually need ?
Could it be the credentials are actually incorrect? because it seems like you can access the server? (I assume you were able to browse to it and generate credentials. right?)
š Let me know if it solved the issue š
Hi @<1663354518726774784:profile|CrookedSeal85>
I am trying to optimize storage on my ClearML file server when doing a lot of experiments.
This is not straight forward, you will need to get a list of all the events via
None
filter on image events
and then delete the the URL you are getting via the StorageManager.
But to be honest, why not just direct it to S3 or something like that ?
@<1542316991337992192:profile|AverageMoth57> it sounds like you should use SSH authentication for the agent, just setforce_git_ssh_protocol: true
None
And make sure you have the SSH kets on the agent's machine
GrievingTurkey78 notice that when enqueuing an aborted Task, the agent will not deleted the previously reported metrics/logs