Skip to content

Conversation

@veelion
Copy link
Contributor

@veelion veelion commented Nov 3, 2022

This mode improves the throughput of websocket server.

Test result:

  • hardware-1:
    Platinum 8358P CPU @ 2.60GHz 15 cores + 80G memory, A5000 * 1 + 24G memory

  • hardware-2:
    Platinum 8369B CPU @ 2.90GHz 32 cores + 120GB memory, A100-SXM4-80GB * 1 + 80GB memory

  • data:
    3000 wavs with different durations in range [0.6, 15] seconds.

hardware websocket_server concurrency batch_size RTF CER
hardware-1 libtorch(CPU) 30 1 0.01666 8.90
hardware-1 libtorch(GPU) 10 1 0.00831 8.90
hardware-1 libtorch(GPU+batch) 20 8 0.00339 9.61
hardware-2 libtorch(CPU) 48 1 0.00753 8.90
hardware-2 libtorch(GPU) 48 1 0.00234 8.90
hardware-2 libtorch(GPU+batch) 48 8 0.00110 9.61

With same CPU, GPU is 2~3 times faster than CPU, run_batch is 2.x times faster than non run_batch mode, but the CER has a little bigger.

veelion added 30 commits July 20, 2022 11:10
@WangGewu
Copy link

libtorch-gpu代码中,没有显式的释放显存。在调用量增加的时候,是否会存在out of memory的问题?

r_hyps_pad_sos_eos, ctc_scores_tensor).toTuple()->elements();
auto rescores = outputs[1].toTensor().to(at::kCPU);
#ifdef USE_GPU
c10::cuda::CUDACachingAllocator::emptyCache();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#1534 clear GPU memory cache here, so it could support much more concurrency.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants