web-server seems okay, could you send the logs from the api-server?
Also if you can, the console logs from your browser, when you get the blank screen. Thanks.
you can run md5 on the file as stored in the remote storage (nfs or s3)
s3 is implementation specific (i.e. minio weka wassaby etc, might not support it) and I'm actually not sure regrading nfs (I mean you can run it, but it actually means you are reading the data, that said, nfs by definition I'm assuming is relatively fast access)
wdyt?
Please feel free to do so (always better to get it from a user not the team behind the product π )
Hi StickyMonkey98
aΒ
very
Β large number of running and pending tasks, and doing that kind of thing via the web-interface by clicking away one-by-one is not a viable solution.
Bulk operations are now supported , upgrade the clearml-server to 1.0.2 π
Is it possible to fetch a list of tasks via Task.get_tasks,
Sure:Task.get_tasks(project_name='example', task_filter=dict(system_tags=['-archived']))
Hi Martin, of course not,
Smart!
I was just wondering if it has been patched yet and if not what is the expected timeline for patching it
Yes, I believe the target is a patch version 1.15.1 to be released in a couple of weeks. This is not a major issue but it's always better to have have it fixed. (btw: the enterprise version never had this issue to being with, because it is of course authenticated, as well as it has additional RBAC layer on top.)
what do you see in the console when you start the trains-agent , it should detect the cuda version
WobblyCrab70 sure, put a load-balancer in between, AWS has a solution for that basically use the AMI from the GitHub and ask IT to add https on the 8080/8008/8081 ports
OHH nice, I thought that it just some kind of job queue on up and running machines
It's much more than that, it's a way of life π
But seriously now, it allows you to use any machine as part of your cluster, and send jobs for execution from the web UI (any machine, even just a standalong GPU machine under your desk, or any cloud GPU instance any mixing the two together:)
Maybe I need to change something here:Β
apiserver.conf
Not sure, I'm still waiting on answer... It...
It manages the scheduling process, so no need to package your code, or worry about building dockers etc. It also has an AWS autoscaler, that spins ec2 instances based on the amount of jobs you have in the execution queue, and the limit of your budget (obviously spinning down machines that are idle)
CooperativeFox72 btw, are you guys running those 20 experiments manually or through trains-agent ?
CooperativeFox72 yes 20 experiments in parallel means that you always have at least 20 connection coming from different machines, and then you have the UI adding on top of it. I'm assuming the sluggishness you feel are the requests being delayed.
You can configure the API server to have more process workers, you just need to make sure the machine has enough memory to support it.
Let me check... I think you might need to docker exec
Anyhow, I would start by upgrading the server itself.
Sounds good?
GrievingTurkey78 short answer no π
Long answer, the files are stored as differentiable sets (think changes set from the previous version(s)) The collection of files is then compressed and stored as a single zip. The zip itself can be stored on Google but on their object storage (not the GDrive). Notice that the default storage for the clearml-data is the clearml-server, that said you can always mix and match (even between versions).
if the first task failed - then the remaining task are not schedule for execution which is what I expect.
agreed
I'm just surprised that if the first task is
aborted
instead by the user,
How is that different from failed? The assumption is if a component depends on another one it needs its output, if it does not then they can run in parallel. What am i missing?
Hi CooperativeFox72
I think the upload reporting (files over 5mb) was added post 0.17 version, hence the log.
The default is upload chunk reporting is 5MB, but it is not configurable, maybe we should add it to the clearml.conf ? wdyt?
CooperativeFox72 I would think the easiest would be to configure it globally in the clearml.conf (rather than add more arguments to the already packed Task.init) π
I'm with on 60 messages being way too much..
Could you open a Github Issue on it, so we do not forget ?
The main reason to add the timeout is because the warning was annoying to users π
The secondary was that clearml will start reporting based on seconds from start, then when iterations start it will revert back to iterations. But if the iterations are "epochs" the numbers are lower so you end up with a graph that does not match the expected "iterations" x-axis. Make sense ?
This will set more time before the timeout right?
Correct.
task.freeze_monitor()
download()
task.defrost_monitor()
Currently there isn't, but that's a good ides.
What would be the argument of using it vs increasing the timeout ?
btw: setting the resource timeout to 99999 will basically mean that it will wait until the first reported iteration, Not that it will just sleep for 99999sec π
Yes it is reproducible do you want a snippet?
Already fixed π please ping tomorrow, I think an RC should be out soon with the fix
π CooperativeFox72 please see if you can send a code snippet to reproduce the issue. I'd be happy to solve the it ...
Hi CooperativeFox72
But my docker image has all my code and all the packages it needed I don't understand why the agent need to install all of those again?Β (edited)
So based on the docker file you previously posted, I think all your python packages are actually installed on the "appuser" and not as system packages.
Basically remove the "add user" part and the --user from the pip install.
For example:
` FROM nvidia/cuda:10.1-cudnn7-devel
ENV DEBIAN_FRONTEND noninteractive
RUN ...
Maybe we should rename it?! it actually creates a Task but will not auto connect it...
CooperativeFox72
Could you try to run the docker and then inside the docker try to do:su root whoami
Okay we have something π
To your clearml.conf add:agent.docker_preprocess_bash_script = [ "su root", "cp -f /root/*.conf ~/", ]Let's see if that works
I am creating this user
Please explain, I think this is the culprit ...
but I am think they done it for a reason no?
Not a very good one, they just installed everything under the user and used --user for the pip.
It really does not matter inside a docker, the only reason one might want to do that is if you are mounting other drives and you want to make sure they are not accessed with "root" user, but with 1000 user id.
Yes this is definitely the issue, the agent assume the docker user is "root".
Let me check something