INFO 02-25 15:57:54 [config.py:208] Replacing legacy 'type' key with 'rope_type'
INFO 02-25 15:58:00 [config.py:569] This model supports multiple tasks: {'classify', 'reward', 'score', 'embed', 'generate'}. Defaulting to 'generate'.
WARNING 02-25 15:58:00 [cuda.py:95] To see benefits of async output processing, enable CUDA graph. Since, enforce-eager is enabled, async output processor cannot be used
INFO 02-25 15:58:00 [llm_engine.py:234] Initializing a V0 LLM engine (v0.7.4.dev75+g4a8cfc75) with config: model='DeepSeek-V2-Lite-Chat', speculative_config=None, tokenizer='DeepSeek-V2-Lite-Chat', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, override_neuron_config=None, tokenizer_revision=None, trust_remote_code=True, dtype=torch.bfloat16, max_seq_len=8192, download_dir=None, load_format=auto, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='xgrammar'), observability_config=ObservabilityConfig(show_hidden_metrics=False, otlp_traces_endpoint=None, collect_model_forward_time=False, collect_model_execute_time=False), seed=0, served_model_name=DeepSeek-V2-Lite-Chat, num_scheduler_steps=1, multi_step_stream_outputs=True, enable_prefix_caching=False, chunked_prefill_enabled=False, use_async_output_proc=False, disable_mm_preprocessor_cache=False, mm_processor_kwargs=None, pooler_config=None, compilation_config={"splitting_ops":[],"compile_sizes":[],"cudagraph_capture_sizes":[],"max_capture_size":0}, use_cached_outputs=False,
INFO 02-25 15:58:01 [cuda.py:173] Using FlashMLA backend.
INFO 02-25 15:58:01 [parallel_state.py:948] rank 0 in world size 1 is assigned as DP rank 0, PP rank 0, TP rank 0
INFO 02-25 15:58:01 [model_runner.py:1110] Starting to load model DeepSeek-V2-Lite-Chat...
INFO 02-25 15:58:01 [cuda.py:173] Using FlashMLA backend.
Loading safetensors checkpoint shards: 0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 25% Completed | 1/4 [00:01<00:04, 1.37s/it]
Loading safetensors checkpoint shards: 50% Completed | 2/4 [00:02<00:02, 1.45s/it]
Loading safetensors checkpoint shards: 75% Completed | 3/4 [00:03<00:01, 1.26s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:05<00:00, 1.33s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:05<00:00, 1.34s/it]
INFO 02-25 15:58:07 [model_runner.py:1117] Loading model weights took 31.1253 GB and 5.736558 seconds
WARNING 02-25 15:58:09 [fused_moe.py:849] Using default MoE config. Performance might be sub-optimal! Config file not found at /usr/local/lib/python3.10/site-packages/vllm/model_executor/layers/fused_moe/configs/E=64,N=1408,device_name=****.json
INFO 02-25 15:58:09 [worker.py:267] Memory profiling takes 1.76 seconds
INFO 02-25 15:58:09 [worker.py:267] the current vLLM instance can use total_gpu_memory (95.00GiB) x gpu_memory_utilization (0.90) = 85.50GiB
INFO 02-25 15:58:09 [worker.py:267] model weights take 31.13GiB; non_torch_memory takes 0.12GiB; PyTorch activation peak memory takes 0.96GiB; the rest of the memory reserved for KV Cache is 53.30GiB.
INFO 02-25 15:58:09 [executor_base.py:111] # cuda blocks: 25874, # CPU blocks: 1941
INFO 02-25 15:58:09 [executor_base.py:116] Maximum concurrency for 8192 tokens per request: 202.14x
INFO 02-25 15:58:11 [llm_engine.py:436] init engine (profile, create kv cache, warmup model) took 4.16 seconds
/tmp/ipykernel_3769/871633.py:7: DeprecationWarning: The keyword arguments {'prompt_token_ids'} are deprecated and will be removed in a future update. Please use the 'prompts' parameter instead.
generated_text, n_tokens, generation_time = warmup_and_infer(build_flash_mla_dskv2(), messages_list)
Processed prompts: 100%|██████████| 1/1 [00:01<00:00, 1.44s/it, est. speed input: 9.02 toks/s, output: 23.59 toks/s]
/tmp/ipykernel_3769/871633.py:7: DeprecationWarning: The keyword arguments {'prompt_token_ids'} are deprecated and will be removed in a future update. Please use the 'prompts' parameter instead.
generated_text, n_tokens, generation_time = warmup_and_infer(build_flash_mla_dskv2(), messages_list)
Processed prompts: 100%|██████████| 1/1 [00:17<00:00, 17.13s/it, est. speed input: 0.99 toks/s, output: 30.06 toks/s]
Generate 515 tokens in 17.13 secs
Here is a simple implementation of the QuickSort algorithm in C++:
```cpp
#include <iostream>
#include <vector>
void swap(int* a, int* b) {
int t = *a;
*a = *b;
*b = t;
}
int partition (std::vector<int>& arr, int low, int high) {
int pivot = arr[high];
int i = (low - 1);
for (int j = low; j <= high - 1; j++) {
if (arr[j] < pivot) {
i++;
swap(&arr[i], &arr[j]);
}
}
swap(&arr[i + 1], &arr[high]);
return (i + 1);
}
void quickSort(std::vector<int>& arr, int low, int high) {
if (low < high) {
int pi = partition(arr, low, high);
quickSort(arr, low, pi - 1);
quickSort(arr, pi + 1, high);
}
}
void printArray(std::vector<int>& arr) {
for (int i = 0; i < arr.size(); ++i)
std::cout << arr[i] << " ";
std::cout << "\n";
}
int main() {
std::vector<int> arr = {10, 7, 8, 9, 1, 5};
int n = arr.size();
quickSort(arr, 0, n - 1);
std::cout << "Sorted array: \n";
printArray(arr);
return 0;
}
This code sorts an array in ascending order using the QuickSort algorithm. The quickSort function is a recursive function that sorts the sub-array to the left of pi and the sub-array to the right of pi. The partition function rearranges the elements in the array so that all elements less than the pivot are to its left and all elements greater are to its right. The pivot is always the last element of the sub-array.
评论