paint-brush
Kupangisha AI Yako Mwenyewe na Gumzo la Sauti la Njia Mbili Ni Rahisi Kuliko Unavyofikiria!kwa@herahavenai
181 usomaji

Kupangisha AI Yako Mwenyewe na Gumzo la Sauti la Njia Mbili Ni Rahisi Kuliko Unavyofikiria!

kwa HeraHaven AI10m2025/01/08
Read on Terminal Reader

Ndefu sana; Kusoma

Mwongozo huu utakuelekeza katika kusanidi seva ya ndani ya LLM inayoauni mwingiliano wa sauti wa njia mbili kwa kutumia Python, Transformers, Qwen2-Audio-7B-Instruct, na Gome.
featured image - Kupangisha AI Yako Mwenyewe na Gumzo la Sauti la Njia Mbili Ni Rahisi Kuliko Unavyofikiria!
HeraHaven AI HackerNoon profile picture

Ujumuishaji wa LLM na uwezo wa sauti umeunda fursa mpya katika mwingiliano wa kibinafsi wa wateja.


Mwongozo huu utakuelekeza katika kusanidi seva ya ndani ya LLM inayoauni mwingiliano wa sauti wa njia mbili kwa kutumia Python, Transformers, Qwen2-Audio-7B-Instruct, na Gome.

Masharti

Kabla hatujaanza, utakuwa na zifuatazo zilizosakinishwa:

  • Python : Toleo la 3.9 au la juu zaidi.
  • PyTorch : Kwa kuendesha mifano.
  • Transfoma : Hutoa ufikiaji wa mfano wa Qwen.
  • Ongeza kasi : Inahitajika katika baadhi ya mazingira.
  • FFmpeg & pydub : Kwa usindikaji wa sauti.
  • FastAPI : Kuunda seva ya wavuti.
  • Uvicorn : Seva ya ASGI ili kuendesha FastAPI.
  • Gome : Kwa usanisi wa maandishi-hadi-hotuba.
  • Multipart & Scipy : Ili kudhibiti sauti.


FFmpeg inaweza kusanikishwa kupitia apt install ffmpeg kwenye Linux au brew install ffmpeg kwenye MacOS.


Unaweza kusanikisha utegemezi wa Python kwa kutumia bomba: pip install torch transformers accelerate pydub fastapi uvicorn bark python-multipart scipy

Hatua ya 1: Kuweka Mazingira

Kwanza, wacha tuanzishe mazingira yetu ya Python na tuchague kifaa chetu cha PyTorch:


 import torch device = 'cuda' if torch.cuda.is_available() else 'cpu'


Msimbo huu hukagua ikiwa GPU inayooana na CUDA (Nvidia) inapatikana na kusanidi kifaa ipasavyo.


Ikiwa hakuna GPU kama hiyo inayopatikana, PyTorch badala yake itaendesha kwenye CPU ambayo ni polepole zaidi.


Kwa vifaa vipya vya Apple Silicon, kifaa kinaweza pia kuwekwa kwa mps ili kuendesha PyTorch kwenye Metal, lakini utekelezaji wa PyTorch Metal sio wa kina.

Hatua ya 2: Kupakia Mfano

LLM nyingi za chanzo huria hutumika tu na maandishi na utoaji wa maandishi. Walakini, kwa kuwa tunataka kuunda mfumo wa kutoa sauti ndani ya sauti, hii ingetuhitaji kutumia miundo miwili zaidi (1) kubadilisha hotuba kuwa maandishi kabla ya kulishwa kuwa LLM yetu na (2) kubadilisha pato la LLM kurudi. kwenye hotuba.


Kwa kutumia LLM yenye miundo mingi kama vile Sauti ya Qwen, tunaweza kuepuka modeli moja ya kuchakata ingizo la usemi kuwa jibu la maandishi, na kisha tu kutumia muundo wa pili kubadilisha towe la LLM kuwa usemi.


Mbinu hii ya aina nyingi sio tu ya ufanisi zaidi katika suala la wakati wa usindikaji na (V) matumizi ya RAM, lakini pia kwa kawaida hutoa matokeo bora zaidi kwa vile sauti ya ingizo hutumwa moja kwa moja kwa LLM bila msuguano wowote.


Ikiwa unatumia seva pangishi ya GPU ya wingu kama Runpod au Vast , utataka kuweka saraka za HuggingFace home & Bark kwenye hifadhi yako ya sauti kwa kuendesha export HF_HOME=/workspace/hf & export XDG_CACHE_HOME=/workspace/bark kabla ya kupakua. mifano.


 from transformers import AutoProcessor, Qwen2AudioForConditionalGeneration model_name = "Qwen/Qwen2-Audio-7B-Instruct" processor = AutoProcessor.from_pretrained(model_name) model = Qwen2AudioForConditionalGeneration.from_pretrained(model_name, device_map="auto").to(device)


Tulichagua kutumia kibadala kidogo cha 7B cha mfululizo wa modeli ya Sauti ya Qwen hapa ili kupunguza mahitaji yetu ya kimahesabu. Hata hivyo, Qwen anaweza kuwa ametoa mifano ya sauti yenye nguvu na kubwa zaidi wakati unasoma makala hii. Unaweza kutazama miundo yote ya Qwen kwenye HuggingFace ili kuangalia mara mbili kuwa unatumia muundo wao wa hivi punde.


Kwa mazingira ya uzalishaji, unaweza kutaka kutumia injini ya uelekezaji ya haraka kama vLLM kwa upitishaji wa juu zaidi.

Hatua ya 3: Inapakia muundo wa Gome

Gome ni muundo wa kisasa wa AI wa chanzo huria wa maandishi-hadi-hotuba unaoauni lugha nyingi pamoja na madoido ya sauti.


 from bark import SAMPLE_RATE, generate_audio, preload_models preload_models()


Kando na Gome, unaweza pia kutumia mifano mingine ya chanzo-wazi au ya umiliki ya maandishi-hadi-hotuba. Kumbuka kwamba ingawa wamiliki wanaweza kuwa watendaji zaidi, wanakuja kwa gharama kubwa zaidi. Uwanja wa TTS huhifadhi ulinganisho wa kisasa .


Na Qwen Audio 7B na Gome zote zikiwa zimepakiwa kwenye kumbukumbu, takriban (V) matumizi ya RAM ni 24GB, kwa hivyo hakikisha maunzi yako yanatumia hili. Vinginevyo, unaweza kutumia toleo la quantized la mfano wa Qwen kuokoa kwenye kumbukumbu.

Hatua ya 4: Kuweka Seva ya FastAPI

Tutaunda seva ya FastAPI yenye njia mbili za kushughulikia maingizo ya sauti au maandishi yanayoingia na kurudisha majibu ya sauti.


 from fastapi import FastAPI, UploadFile, Form from fastapi.responses import StreamingResponse import uvicorn app = FastAPI() @app.post("/voice") async def voice_interaction(file: UploadFile): # TODO return @app.post("/text") async def text_interaction(text: str = Form(...)): # TODO return if __name__ == "__main__":  uvicorn.run(app, host="0.0.0.0", port=8000)


Seva hii inakubali faili za sauti kupitia maombi ya POST kwenye /voice & /text endpoint.

Hatua ya 5: Inachakata Ingizo la Sauti

Tutatumia ffmpeg kuchakata sauti inayoingia na kuitayarisha kwa mfano wa Qwen.


 from pydub import AudioSegment from io import BytesIO import numpy as np def audiosegment_to_float32_array(audio_segment: AudioSegment, target_rate: int = 16000) -> np.ndarray: audio_segment = audio_segment.set_frame_rate(target_rate).set_channels(1) samples = np.array(audio_segment.get_array_of_samples(), dtype=np.int16) samples = samples.astype(np.float32) / 32768.0 return samples def load_audio_as_array(audio_bytes: bytes) -> np.ndarray: audio_segment = AudioSegment.from_file(BytesIO(audio_bytes)) float_array = audiosegment_to_float32_array(audio_segment, target_rate=16000) return float_array

Hatua ya 6: Kuzalisha Majibu ya Maandishi na Qwen

Kwa sauti iliyochakatwa, tunaweza kutoa jibu la maandishi kwa kutumia mfano wa Qwen. Hii itahitaji kushughulikia maandishi na sauti.


Kichakataji kitabadilisha ingizo letu hadi kiolezo cha gumzo cha modeli (ChatML katika kesi ya Qwen).


 def generate_response(conversation): text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False) audios = [] for message in conversation: if isinstance(message["content"], list): for ele in message["content"]: if ele["type"] == "audio": audio_array = load_audio_as_array(ele["audio_url"]) audios.append(audio_array) if audios: inputs = processor( text=text, audios=audios, return_tensors="pt", padding=True ).to(device) else: inputs = processor( text=text, return_tensors="pt", padding=True ).to(device) generate_ids = model.generate(**inputs, max_length=256) generate_ids = generate_ids[:, inputs.input_ids.size(1):] response = processor.batch_decode( generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False )[0] return response


Jisikie huru kucheza na vigezo vya uzalishaji kama vile halijoto kwenye kitendakazi cha model.generate .

Hatua ya 7: Kubadilisha Maandishi hadi Kuzungumza kwa Gome

Hatimaye, tutabadilisha majibu ya maandishi yaliyotolewa kuwa matamshi.


 from scipy.io.wavfile import write as write_wav def text_to_speech(text): audio_array = generate_audio(text) output_buffer = BytesIO() write_wav(output_buffer, SAMPLE_RATE, audio_array) output_buffer.seek(0) return output_buffer

Hatua ya 8: Kuunganisha Kila kitu kwenye API

Sasisha miisho ili kuchakata sauti au ingizo la maandishi, kutoa jibu, na kurudisha hotuba iliyosanisishwa kama faili ya WAV.


 @app.post("/voice") async def voice_interaction(file: UploadFile): audio_bytes = await file.read() conversation = [ { "role": "user", "content": [ { "type": "audio", "audio_url": audio_bytes } ] } ] response_text = generate_response(conversation) audio_output = text_to_speech(response_text) return StreamingResponse(audio_output, media_type="audio/wav") @app.post("/text") async def text_interaction(text: str = Form(...)): conversation = [ {"role": "user", "content": [{"type": "text", "text": text}]} ] response_text = generate_response(conversation) audio_output = text_to_speech(response_text) return StreamingResponse(audio_output, media_type="audio/wav")

Unaweza kuchagua pia kuongeza ujumbe wa mfumo kwenye mazungumzo ili kupata udhibiti zaidi wa majibu ya mratibu.

Hatua ya 9: Kujaribu vitu

Tunaweza kutumia curl kuweka seva yetu kama ifuatavyo:


 # Audio input curl -X POST http://localhost:8000/voice --output output.wav -F "[email protected]" # Text input curl -X POST http://localhost:8000/text --output output.wav -H "Content-Type: application/x-www-form-urlencoded" -d "text=Hey"

Hitimisho

Kwa kufuata hatua hizi, umeweka seva rahisi ya ndani yenye uwezo wa mwingiliano wa sauti wa njia mbili kwa kutumia miundo ya hali ya juu. Usanidi huu unaweza kutumika kama msingi wa kuunda programu ngumu zaidi zinazowezeshwa na sauti.

Maombi

Iwapo unatafuta njia za kuchuma mapato kwa miundo ya lugha inayoendeshwa na AI, zingatia programu hizi zinazowezekana:

Msimbo kamili

 import torch from fastapi import FastAPI, UploadFile, Form from fastapi.responses import StreamingResponse import uvicorn from transformers import AutoProcessor, Qwen2AudioForConditionalGeneration from bark import SAMPLE_RATE, generate_audio, preload_models from scipy.io.wavfile import write as write_wav from pydub import AudioSegment from io import BytesIO import numpy as np device = 'cuda' if torch.cuda.is_available() else 'cpu' model_name = "Qwen/Qwen2-Audio-7B-Instruct" processor = AutoProcessor.from_pretrained(model_name) model = Qwen2AudioForConditionalGeneration.from_pretrained(model_name, device_map="auto").to(device) preload_models() app = FastAPI() def audiosegment_to_float32_array(audio_segment: AudioSegment, target_rate: int = 16000) -> np.ndarray: audio_segment = audio_segment.set_frame_rate(target_rate).set_channels(1) samples = np.array(audio_segment.get_array_of_samples(), dtype=np.int16) samples = samples.astype(np.float32) / 32768.0 return samples def load_audio_as_array(audio_bytes: bytes) -> np.ndarray: audio_segment = AudioSegment.from_file(BytesIO(audio_bytes)) float_array = audiosegment_to_float32_array(audio_segment, target_rate=16000) return float_array def generate_response(conversation): text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False) audios = [] for message in conversation: if isinstance(message["content"], list): for ele in message["content"]: if ele["type"] == "audio": audio_array = load_audio_as_array(ele["audio_url"]) audios.append(audio_array) if audios: inputs = processor( text=text, audios=audios, return_tensors="pt", padding=True ).to(device) else: inputs = processor( text=text, return_tensors="pt", padding=True ).to(device) generate_ids = model.generate(**inputs, max_length=256) generate_ids = generate_ids[:, inputs.input_ids.size(1):] response = processor.batch_decode( generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False )[0] return response def text_to_speech(text): audio_array = generate_audio(text) output_buffer = BytesIO() write_wav(output_buffer, SAMPLE_RATE, audio_array) output_buffer.seek(0) return output_buffer @app.post("/voice") async def voice_interaction(file: UploadFile): audio_bytes = await file.read() conversation = [ { "role": "user", "content": [ { "type": "audio", "audio_url": audio_bytes } ] } ] response_text = generate_response(conversation) audio_output = text_to_speech(response_text) return StreamingResponse(audio_output, media_type="audio/wav") @app.post("/text") async def text_interaction(text: str = Form(...)): conversation = [ {"role": "user", "content": [{"type": "text", "text": text}]} ] response_text = generate_response(conversation) audio_output = text_to_speech(response_text) return StreamingResponse(audio_output, media_type="audio/wav") if __name__ == "__main__": uvicorn.run(app, host="0.0.0.0", port=8000)