Reputation
Badges 1
25 × Eureka!So without the flush I got the error apparently at the very end of the script -
Yes... it's a python thing, background threads might get killed in random order, so that when one needs a background thread that died you get this error, which basically should mean you need to do the work in the calling thread.
This actually explains why calling Flush solved the issue.
Nice!
yes, so it does exist the local process (at least, the command returns),
What do you mean the command returns ? are running the scipt from bash and it returns to bash ?
but when the dependencies are installed, the git creds are not taken in account
I have to admit, we missed that use case 😞
A quick fix will be to use git ssh, which is system wide.
but I want know to switch to git auth using Personal Access Token for security reasons)
Smart move 😉
As for the git repo credentials, we can always add them, when you are using user/pass. I guess that would be the behavior you are expecting, unless the domain is different......
Hmm there was this one:
https://github.com/allegroai/clearml/commit/f3d42d0a531db13b1bacbf0977de6480fedce7f6
Basically always caching steps (hence the skip), you can install from the main branch to verify this is the issue. an RC is due in a few days (it was already supposed to be out but got a bit delayed)
Does it mean I can use clearml-serving helm chart alone
Unrelated, the clearml-serving can be deployed on k8s or with docker-compose regardless of where/how clearml-server is deployed
DefiantHippopotamus88HTTPConnectionPool(host='localhost', port=8081):
This will not work because inside the container of the second docker compose "fileserver" is not definedCLEARML_FILES_HOST="
"
You have two options:
configure to the docker compose to use the networkhost on all conrtainers (as oppsed to the isolated mode they are now running ing)2. Configure all of the CLEARML_* to point to the Host IP address (e.g. 192.168.1.55) , then rerun the entire thing.
If you are using the "default" queue for the agent, notice you might need to run the agent with --services-mode
to allow for multiple pipeline components on the same machine
MoodyCentipede68 seems you did not pass any configuration (os env or conf file) so it does nor know how to find the server and authenticate. Make sense?
Notice the configuration parameters:
https://github.com/allegroai/clearml/blob/34c41cfc8c3419e06cd4ac954e4b23034667c4d9/examples/services/monitoring/slack_alerts.py#L160
https://github.com/allegroai/clearml/blob/34c41cfc8c3419e06cd4ac954e4b23034667c4d9/examples/services/monitoring/slack_alerts.py#L162
https://github.com/allegroai/clearml/blob/34c41cfc8c3419e06cd4ac954e4b23034667c4d9/examples/services/monitoring/slack_alerts.py#L156
Oh dear, I think your theory might be correct, and this is just the mongo preallocating storage.
Which means the entire /opt/trains just disappeared
Hi WickedGoat98
but is there also a way to delete them, or wipe complete projects?
https://github.com/allegroai/trains/issues/16
Auto cleanup service here:
https://github.com/allegroai/trains/blob/master/examples/services/cleanup/cleanup_service.py
Hi LazyTurkey38
, is it possible to have the agents keep a local version and only download the diff of the job commit to speed things up?
This is what it does, it has a local cached copy and it only pulls the latest changes
Hi AstonishingWorm64
I think you are correct, there is external interface to change the docker.
Could you open a GitHub issue so we do not forget to add an interface for that ?
As a temp hack, you can manually clone "triton serving engine" and edit the container image (under the execution Tab).
wdyt?
I appended python path with /code/app/flair in my base image and execute
the python path is changing since it installs a new venv into the system.
Let me check what's going on with the pythonpath, because it is definitely is changed when running the code (the code base root folder is added to it). Maybe we need to make sure that if you had PYTHON PATH pre-defined we restore it.
JitteryCoyote63 Should be quite safe, there is no major change that I'm aware of on the ClearML side that can effect it.
That said, wait for after the weekend, we are releasing a new ClearML package, I remember there was something with the model logging, it might not directly have something to do with ignite, but worth testing on the latest version.
is there a way that i can pull all scalars at once?
I guess you mean from multiple Tasks ? (if so then the answer is no, this is on a per Task basis)
Or, can i get experiments list and pull the data?
Yes, you can use Task.get_tasks to get a list of task objects, then iterate over them. Would that work for you?
https://clear.ml/docs/latest/docs/references/sdk/task/#taskget_tasks
Hi DefeatedCrab47
Not really sure, and it smells bad ...
Basically you can always use the TB logger, and call Task.init.
Hi @<1729309120315527168:profile|ShallowLion60>
How did you create those credentials ?
DistressedGoat23 you are correct, since at the end this become a plotly object the extra_layout is for general purpose layout, but this specific entry is next to the data. Bottom line, can you open a github issue, so we do not forget to fix? In the mean time you can use the general plotly reporting as SweetBadger76 suggested
Worker just installs by name from pip, and it installs not my package!
Oh dear ...
Did you configure additional pip repositories in the Agent's clearml.conf ? https://github.com/allegroai/clearml-agent/blob/178af0dee84e22becb9eec8f81f343b9f2022630/docs/clearml.conf#L77 It might be that (1) is not enough, as pip will first try to search the package in the pip repository, and only then in the private one. To avoid that, in your code you can point directly to an https of your package` Ta...
Hi ScantChimpanzee51
Is it possible to run multiple agent on EC2 machines started by the Autoscaler?
I think that by default you cannot,
having the Autoscaler start 1x p3.8xlarge (4 GPU) on AWS might be better than 4x p3.2xlarge (1 GPU) in terms of availability, but then then we’d need one Agent per GPU.
I think that this multi-GPU setup is only available in the enterprise tier.
That said, the AWS pricing is linear, it costs the same having 2 instances with 1 GPU as 1 instanc...
I can definitely feel you!
(I think the implementation is not trivial, metrics data size is collected and stored as commutative value on the account, going over per Task is actually quite taxing for the backend, maybe it should be an async request ? like get me a list of the X largest Tasks? How would the UI present it? As fyi, keeping some sort of book keeping per task is not trivial either, hence the main issue)
BTW: in your code, you should probably replacedataset_task = Task.get_task(task_id=dataset.id)
with:dataset_task = dataset._task
Hi RipeGoose2
when I'm using the set_credentials approach does it mean the trains.conf is redundant? if
Yes this means there is no need for trains.conf , all the important stuff (i.e. server + credentials you provide from code).
BTW: When you execute the same code (i.e. code with set_credentials call) the agent's coniguration will override what you have there, so you will be able to run the Task later either on prem/cloud without needing to change the code itself 🙂
compression=ZIP_DEFLATED if compression is None else compression
wdyt?
It takes 20mins to build the venv environment needed by the clearml-agent
You are Joking?! 😭
it does apt-get install python3-pip , and pip install clearml-agent, how is that 20min?
So I have a task that just loads a model, but I don't see it as an artifact in the UI
You should see it under Artifacts, Input model if you are calling Keras load function (or similar)
I'm assuming these are the Only packages that are imported directly (i.e. pandas requires other packages but the code imports pandas so this is what listed).
The way ClearML detect packages, it first tries to understand if this is a "standalone" scrip, if it does, than only imports in the main script are logged. Then if it "thinks" this is not a standalone script, then it will analyze the entire repository.
make sense ?
LovelyHamster1 what do you mean by "assume the permissions of a specific IAM Role" ?
In order to spin an ec2 instance (aws autoscaler) you have to have correct credentials, to pass those credentials you must create a key/secret pair to pass to the autoscaler. There is no direct support for IAM Role. Make sense ?