 
			Reputation
Badges 1
25 × Eureka!JitteryCoyote63 could you send the log maybe ?
Epochs are still round numbers ...
Multiply by 2?!  😅
you mean to spin a pod with the agent inside it (daemon in services mode).
Or connect the services queue to the k8s cluster (i.e. define the pod template that uses cpu with not a lot of ram)?
Then running by using the
, am I right?
yep
I have put the
--save-period
while running Yolov5 and ClearML does not save the weight per epoch that I have trained. Why is this happened?
But do you still see it in the clearml UI ? do you see the models logged in the clearml UI ?
ohh right, my bad:docker run -t --rm nvidia/cuda:10.1-base-ubuntu18.04 bash -c "echo 'Binary::apt::APT::Keep-Downloaded-Packages \"true\";' > /etc/apt/apt.conf.d/docker-clean && apt-get update && apt-get install -y git python3-pip && pip install trains-agent && echo done"
ReassuredTiger98  I'm trying to debug what's going on, because it should have worked.
Regrading Prints ...
` from clearml import Task
from time import sleep
def main():
task = Task.init(project_name="test", task_name="test")
d = {"a": "1"}
print('uploading artifact')
task.upload_artifact("myArtifact", d)
print('done uploading artifact')
# not sure if this helps but it won'r hurt to debug
sleep(3.0)
if name == "main":
main() `
In that case, I think it is stuck on a previous Node, I can;t think of any other reason.
Do you have something else on the same PV that was lost ? like api server configuration?
Hi  GrievingTurkey78
How are you getting different version than what is used in run time? it analyzes the PYTHONPATH just as python does ? How can I reproduce it? Currently you can use  Task.add_requirements(package_name, package_version=None)  This will not force it though, it is a recommendation (if it fails to find the package itself) maybe we can add force ?!What do you think?
agent.cuda_driver_version = ...
agent.cuda_runtime_version = ...
Interesting idea! (I assume for reporting only, not configuration)
... The agent mentionned used output from nvcc (2) ...
The dependencies I shared are not how the agent works, but how Nvidia CUDA works  🙂
regrading the cuda check with  nvcc  , I'm not saying this is a perfect solution, I just mentioned that this is how this is currently done.
I'm actually not sure if there is an easy way to get it from nvid...
Hi  ChubbyLouse32
If I understand correctly you can relatively easy take a clearml Task and launch it on LSF, an integration would be something like:
` from clearml import Task
from clearml.backend_api.session.client import APIClient
while True:
result = client.queues.get_next_task(queue=q_id)
if not result or not result.entry:
sleep(5)
continue
task_id = result.entry.task
here is where we create the LSF job, this is just a pseudo code
os.system("lsf-launch-cmd 'clearml...
Ohh, the controller task itself holds the artifacts ?
Not really  😞
Everyone can do everything, the idea is sharability and accessibility.
I do know that in the paid tier they have full access control roles SSO etc, but unfortunately its way too complicated for the open-source.
Basically what I'm saying is trust your fellow colleagues  🙂
That is odd, can you send the full Task log? (Maybe some oddity with conda/pip ?!)
Okay we have something  🙂
To your clearml.conf add:agent.docker_preprocess_bash_script = [ "su root", "cp -f /root/*.conf ~/", ]Let's see if that works
Hi  @<1694157594333024256:profile|DisturbedParrot38>
Could you attach a full log? This is quite cryptic and does not ring a bell
Are these experiments logged too (with the train-valid curves, etc)?
Yes every run is log as a new experiment (with it's own set of HP). Do notice that the execution itself is done by the "trains-agent". Meaning the HP process creates experiments with new set of HP an dputs them into the execution queue, then  trains-agent  pulls them from the queue and starts executing them. You can have multiple  trains-agent  on as many machines as you like with specific GPUs etc. each one ...
Hi  SourSwallow36
What do you man by Log each experiment separately ? How would you differentiate between them?
I am symlinking the .clearml directory to a NAS server and this is perhaps part of the problem.
Yep, that sounds about right, it uses Posix file system for internal lock mechanisms (multi process locks), and my guess is that the NAS for some reason does not support it...
Correct, (if this is running on k8s it is most likely be passed via env variables , CLEARML_WEB_HOST etc,)
YummyMoth34
It tried to upload all events and then killed the experiment
Could you send a log?
Also, what's the train package version ?
Hi ElegantCoyote26
sometimes the agents load an earlier version of one of my libraries.
I'm assuming some internal package that is installed from a wheel file not a direct git repo+commit link ?
@<1545216077846286336:profile|DistraughtSquirrel81> shoot an email to "support@clear.ml" and provide all the information you can on the "lost account" (i.e. the one you had the data on), this means email account that created it (or your colleagues emails), and any other information that might help to locate it.
Bake to the error:
clearml_agent: ERROR: Failed getting token (error 401 from
): Unauthorized (invalid credentials) (failed to locate provided credentials)
See here:
https://github.com/allegroai/clearml-server/blob/3f2b96266bc51bfce680bd759c7fa9d635ae36d3/docker/docker-compose.yml#L131
You need to provide an access key so it can actually "talk" to the server next to it.
ShakyOstrich31
I am reusing an old task ...
Which means that the old Task stores the requirements on the Task itself (see "Installed Packages" section), Notice it also stores the exact git commit to use.
When you are cloning the Task (i.e. in the pipeline), you should probably:
set the commit / branch to the latest in the branch clear the "installed packages" section, which would cause the agent to use the "requirements.txt" stored in the git repo itself.As far as I understand this s...