Reputation
Badges 1
25 × Eureka!trains-agent should be deployed to GPU instances, not the trains-server.
The trains-agent purpose is for you to be able to send jobs to a GPU (at least in most cases) instance.
The "trains-server" is a control plane , basically telling the agent what to run (by storing the execution queue and tasks). Make sense ?
Itβs only on this specific local machine that weβre facing this truncated download.
Yes that what the log says, make sense
Seems like this still doesnβt solve the problem, how can we verify this setting has been applied correctly?
hmm exec into the container? what did you put in clearml.conf?
Hi @<1661542579272945664:profile|SaltySpider22>
Basically you need to put all of these files into a repository , which is always a good practice.
The reason is that the pipeline (and for that matter any Task on the system) can store wither a single script or a git reference, but not multiple scripts.
Okay we got to the bottom of this. This was actually because of the load balancer timeout settings we had, which was also 30 seconds and confusing us.
Nice!
btw:
in the clearml.conf we put this:
for future reference, you are missing the sdk section:
sdk.http.timeout: 300
. notation works as well as {}
ngrok to connect to the remote server at the office?
That makes sense, I guess this is the equivalent of using a VPN, from that point onward clearml-session can directly access the remote machine, right?
Oh in that case add --remote-gateway <external_ip> It will connect to the provided address instead of the local one. (you can also add --public-ip which will automatically resolve the public IP of the server
using the docker-compose file for the
clearml-serving
pipeline, do we also have to mount it somehow?
oh yes, you are correct the values are passed using environment variables (easier when using docker compose)
You can in addition add a mount from the host machine to a conf file,
volumes:
- ${PWD}/clearml.conf:/root/clearml.conf
wdyt?
Hi @<1671689437261598720:profile|FranticWhale40>
You mean the download just fails on the remote serving node becuause it takes too long to download the model?
(basically not a serving issue per-se but a download issue)
Hi WackyRabbit7
the services (or the agent running there) is spinning multiple Tasks (as opposed to regular agent where it is one task at a time).
how can I give this agent git access?
in the docker-compose you can configure the git credentials (user/pass or user/key it is the same).
https://github.com/allegroai/clearml-server/blob/d0e2313a24eb1248ebf0ddf31bf589de0d675562/docker/docker-compose.yml#L137
As I installed ClearML using pip,
Where is the clearml-serving runs ? usually your configuration file is in ~/clearml.conf
Notice if it is not there it means it is using the defaults so just create a new one and add that line
Oh...
None
try to add to your config file:
sdk.http.timeout.total = 300
Hi @<1547028031053238272:profile|MassiveGoldfish6>
hmm yeah you need to remove the "hidden" system_tag from the project
from clearml.backend_api.session.client import APIClient
c = APIClient()
print(c.projects.get_by_id("PROJECT_ID_HERE").to_dict())
c.projects.update(project="PROJECT_ID_HERE", system_tags=["test"])
print(c.projects.get_by_id("PROJECT_ID_HERE").to_dict())
Notice you can get the project ID from the URL
`/projects/1974af8ccdac454b836c47349c4e826e/experiments/84...
Hi @<1523707131994312704:profile|CrabbyKoala94>
I wanted to use method Task.reset() or Task.delete() however none of that seems to be able to delete
only
the logs in the "console" section in the UI.
So Task.reset will reset the entire outputs of the Task (and the status), as you noticed. Why would you want to just remove the logs?
You can disable the auto logs altogether if you really want to, see Task.init [auto_connect_streams](https://github.com/allegroai/cl...
DilapidatedDucks58
is there any way to post Slack alerts for the frozen experiments?
The latest RC should solve the PyTorch data loader, do you want to test it?pip install clearml==0.17.5rc2
ReassuredTiger98
(for some reason it kind of jumps over PyTorch, but then installs torchvision?!)
Could you run with the latest with --debug
We just added but you will have to install from git:pip3 install git+Then run with --debug:clearml-agent --debug daemon ...
@<1560074028276781056:profile|HealthyDove84> if you want you can PR a fix, it should be very simple basically:
None
elif np_dtype == str:
return "STRING"
elif np_dtype == np.object_ or np_dtype.type == np.bytes_:
return "BYTES"
return None
Hi CluelessElephant89
Hi guys, if I spot issue with documentations, where should I post them?
The best way from our perspective PR the fix π this is why we put it on GitHub
Follow up: I see that if I move an Experiment to a new project, it does not copy the associated model files and must be done manually.Β Once I moved the models to the new project, the query works as expected.
Correct π
Nice catch!
the trend step artifact used to keep track the time of the data so we know the expected trend of the input data. For example, on the first data which is trend_step = 1 the trend value is 10, then if the trend_step = 10 (the tenth data) our regressor will predict the trend value of the selected trend_step. this method is still in research to make it more efficient so it doesn't need to upload artifact every request
Make sense! I would suggest you add a GitHub issue with feature request ...
But do consider a sort of a designer's press kit on your page haha
That is a great idea!
Also you can use:
https://2928env351k1ylhds3wjks41-wpengine.netdna-ssl.com/wp-content/uploads/2019/11/Clear_ml_white_logo.svg
You mean like for your internal support channel inside your company ?
Why does ClearML hide the dataset task from the main WebUI?
Basically you have the details from the Dataset page, why should it be mixed with the others ?
If I specified a project for the dataset, I specifically want it there, in that project, not hidden away in some
.datasets
hidden sub-project.
This maybe a request for "Dataset" tab under project, why would you need the Dataset Task itself is the main question?
Not all dataset objects are equal, and perhap...
I guess itβs on me to check whether this slowdown is negligible or not
Usually performance is negligible, especially with GPU
But if you really want the best:
Add --security-opt seccomp=unconfined to the extra_docker_arguments
See detials:
https://betterprogramming.pub/faster-python-in-docker-d1a71a9b9917
Can you send the full log? This is odd, it will by default use the python executable it (the agent) is running with.
Regardless you can specify the python executable to be used here:
https://github.com/allegroai/clearml-agent/blob/bd411a19843fbb1e063b131e830a4515233bdf04/docs/clearml.conf#L44
Hi ClumsyElephant70
So do you need both requirements.txt combined ?
How will the agent be able to reproduce both repo on the remote machine ?
Sure thing π
BTW: ReassuredTiger98 this is definitely an interesting use case, and I think you can actually write some code to solve it if you like.
Basically let's followup on you setup:Machine X: agent listening to queue A, B_machine_a *notice we have two agents here Machine Y: agent listening to queue B_machine_bNow we (the users) will push our jobs into queues A and B
Now we have a service that does the following:
` see if we have a job in queue B
check if machine Y is working...
and: " clearml_agent: ERROR: 'charmap' codec can't encode character '\u0303' in position 5717: character maps to <undefined>Β "
Ohh that's the issue with the LC_ALL missing in the docker itself (i.e unicode code character will break it)
Add locals into the container, in your clearml.conf add the followingagent.extra_docker_shell_script: ["apt-get install -y locales",]Let me know if that solves the issue (as you pointed, it has nothing to do with importing package X)