Reputation
Badges 1
25 × Eureka!as i also noticed that uploads are sometimes slow, and i see here max_connections=2
Makes sense to me, please go ahead and add that as well (basically the same thing on _AzureBlobServiceStorageDriver.upload_object
and an additional variable on the AzureContainerConfigurations
class.
Could you PR a tested draft ? we will be able to take from there
JuicyFox94
NICE!!! this is exactly what I had in mind.
BTW: you do not need to put the default values there, basically it reads the defaults from the package itself trains-agent/trains and uses the conf file as overrides, so this section can only contain the parts that are important (like cache location credentials etc)
Local changes are applied before installing requirements, right?
correct
Failed to initialize NVML: Unknown Error
yeah this is a driver issue. I think you need to check the VM image if the drivers match the GPU on that machine
(Caused by SSLError(SSLError(1, '[SSL: DECRYPTION_FAILED_OR_BAD_RECORD_MAC] decryption failed or bad record mac
Where is the code running (agent) GCP instance ? your machine ?
If this is how the repo links look like, do not set anything in the clearml.conf
It "should" use the ssh for the ssh links, and http for the http links.
Hi FunnyTurkey96
Which pip are you using, basically pip changed the dependency resolver after 20.1
Change: https://github.com/allegroai/clearml-agent/blob/aede6f4bac71c8fc56e7cf982318a48527953a3c/docs/clearml.conf#L57pip_version: "<20.2"
See if that helps
Hi @<1547028074090991616:profile|ShaggySwan64>
I'm guessing just copying the data folder with rsync is not the most robust way to do that since there can be writes into mongodb etc.
Yep
Does anyone have experience with something like that?
basically you should just backup the 3 DBs (mongo, redis, elastic) each one based on their own backup workflows. Then just rsync the files server & configuration.
It's the safest way to run multiple processes and make sure they are cleaned afterwards ...
I think I'm missing the connection between the hash-ids and the txt file, or in other words why is the txt file containing full path not relative path
Oh what if the script is in the container already?
Hmm, the idea of clearml is that the container is a "base environment" and code is "injected", this makes sure it is easy to reuse it.
The easiest way is to add an "entry point" scripts that just calls the existing script inside the container.
You can have this python initial script on your local machine then when you call clearml-task
it will upload the local "entry point" script directly to the Task, and then on the remote machin...
@<1569858449813016576:profile|JumpyRaven4> fyi clearml-serving was synced 🤞
This smells like a driver/image issue on the instance VM
What are you getting if add this inside your code?
os.system('nvidia-smi')
Hi @<1720249416255803392:profile|IdealMole15>
I'm assuming you mean on a remote machine with clearml-agent running ?
If you do, then you either use clearml-task
to create a Task (Job) and specify the container and script. or click on "Create New Experiment" in the UI, and fill out the git repo / script and specify the docker image.
Make sense?
that clearml-agent needs to be installed from system python mentioned anywhere in the docs, if not I suggest it gets added.
You are right, I will check and fix if not 🙂
Thank you so much for helping.
My pleasure
BTW: how did it get there ?
CooperativeSealion8 let me know if you managed to solve the issue, also feel free to send the entire trains-server log. I'm assuming one of the dockers failed to boot...
ComfortableShark77 are you saying you need "transformers" in the serving container?CLEARML_EXTRA_PYTHON_PACKAGES: "transformers==x.y"
https://github.com/allegroai/clearml-serving/blob/6005e238cac6f7fa7406d7276a5662791ccc6c55/docker/docker-compose.yml#L97
OddAlligator72 FYI, in you current code you can always doif use_trains: from trains import Task Task.init()
Might be easier 😉
@<1657918706052763648:profile|SillyRobin38> out of curiosity did you compare performance of tensorrt-llm vs vllm ?
(the jury is still out on that, just wondered if you had a chance)
In the UI you can see all the agents and their IDs
Then you can so
clearml-agent daemon --stop <agent id>
, the easiest way possible would be if could just some how run task and let the lsf manage the environment
You mean let the LSF set the conda/venv ? or do you also mean to get the code-base, changes etc ?
You mean like for your internal support channel inside your company ?