Reputation
Badges 1
25 × Eureka!DepressedChimpanzee34 something along the lines of:from multiprocessing.pool import ThreadPool p = ThreadPool() def get_last_metric(t): return t.get_last_scalar_metrics() task_scalars_list = p.map(get_last_metric, top_tasks) p.close()We parallelized network connection as I'm assuming the delay is fetching
PompousParrot44
Check out the task.execute_remotely()
You can call it right after the task init, and it will enqueue your running Task, and leave the process (if you want).
https://github.com/allegroai/trains/blob/65a4aa7aa90fc867993cf0d5e36c214e6c044270/trains/task.py#L1437
Then this is by default the free space on the home folder (`~/.clearml') that is missing free space
DilapidatedDucks58 so is this more like a pipeline DAG that is built ?
I'm assuming this is more than just grouping ?
(by that I mean, accessing a Tasks artifact does necessarily point to a "connection", no? Is it a single Task everyone is accessing, or a "type" of a Task ?
Is this process fixed, i.e. for a certain project we have a flow (1) executed Task of type A, then Task of type (B) using the artifacts fro Task (A). This implies we might have multiple Tasks of types A/B but they are alw...
Is it possible to substitute these steps using containers instead.
I'm not sure I follow, could you expand ?
DefiantHippopotamus88 you are sending the curl to the wrong port , it should be 9090 (based on what remember from the unified docker compose) on your setup
Like what would be the exact query given an endpoint, for requests per sec.
You mean in Grafana ?
Please hit Ctrl-F5 refresh the entire page, see if it is till empty....
Woot woot
ChubbyLouse32 when you get it working please PR it, this is very very cool!
(I'll be happy to help π )
Task.current_task().connect(training_args, name='hugggingface args')And you should be able to change them when launching remotely π
SmallDeer34 btw: "set_parameters_as_dict" will replace all the arguments (and is one way) ...
Hurrah Hurrah
PricklyJellyfish35
Do you mean the original OmegaConf, before the overrides ? or the configuration files used to create the OmegaConf ?
GiganticTurtle0 fix was just pushed to GitHub πpip install git+
Train Data Params/a = {} Train Data Params/b = ...Then maybe we could "hack" it so that if you edit it in the UI like so:Train Data Params/a = {'new': 'value'} Train Data Params/b = ...You end up withparam = {'a': {'new': 'value'}, 'b' : ... }What do you think?
Okay. AndΒ
110
Β means 11.1 and not 11.0?Β (edited)
110 means 11.0, the odd thing is, it actually installed 11.1, and from the pytorch website this is exactly how they suggest to install with conda...
Let me know if forcing the CUDA version changes anything
Okay, what you can do is the following:
assuming you want to launch task id aabb12
The actual slurm command will be:trains-agent execute --full-monitoring --id aabb12
You can test it on your local machine as well.
Make sure the trains.conf is available in the slurm job
(use trains-agent --config-file to point to a globally shared one)
What do you think?
ZanyPig66 you are correct in your assumptions. What exactly do you have in the Task? If there is no git repo the entire script should be under "uncommitted changes. What is your case?
Hi ScaryKoala63
Sure, add the following to your clearml.conf:sdk.storage.cache.default_cache_manager_size = 400I think you are correct, it seems like for some reason you hit the cache limit, and a previous entry was deleted
Hi BoredGoat1
from this warning: " TRAINS Monitor: GPU monitoring failed getting GPU reading, switching off GPU monitoring " It seems trains failed to load the nvidia .so dll that does the GPU monitoring:
This is based on pynvml, and I think it is trying to access "libnvidia-ml.so.1"
Basically saying, if you can run nvidima-smi from inside the container, it should work.
So basically the APIClient is a pythonic interface to the RestAPI, so you can do the following
See if this one works# stats from he last 60 seconds for worker in workers: print(client.workers.get_stats(worker_ids=[worker.id], from_date=int(time()-60),to_date=int(time()), interval=60, ))
It's just the print (_ repr _) not showing the datafor w in client.workers.get_all(): print(w.data)
RattySeagull0 I think you are correct, python 3.6 is the installed inside the docker. Is it important to have 3.7 ? You might need another docker (or change the installation script and install python 3.7 inside)
because step can be constructed with multiple
sub-components
but not all of them might be added to the UI graph
Just to make sure I fully understand when we decorate with @sub_node we want that to also appear in the UI graph (and have it's own Task / metrics etc)
correct?
Could you run your code not from the git repository.
I have a theory, you never actually added the entry point file to the git repo, so the agent never actually installed it, and it just did nothing (it should have reported an error, I'll look into it)
WDYT?
, the easiest way possible would be if could just some how run task and let the lsf manage the environment
You mean let the LSF set the conda/venv ? or do you also mean to get the code-base, changes etc ?
ReassuredTiger98 I'm trying to debug what's going on, because it should have worked.
Regrading Prints ...
` from clearml import Task
from time import sleep
def main():
task = Task.init(project_name="test", task_name="test")
d = {"a": "1"}
print('uploading artifact')
task.upload_artifact("myArtifact", d)
print('done uploading artifact')
# not sure if this helps but it won'r hurt to debug
sleep(3.0)
if name == "main":
main() `
AbruptWorm50 can you send full image (X axis is missing from the graph)
I have to admit, I'm not sure...
Let me talk to backend guys, in theory you are correct the "initial secret" can be injected via the helm env var, but I'm not sure how that would work in this specific case