paint-brush
Ukusingatha I-AI Yakho Ngengxoxo Yezwi Enezindlela Ezimbili Kulula Kunokuba Ucabanga!nge@herahavenai
181 ukufundwa

Ukusingatha I-AI Yakho Ngengxoxo Yezwi Enezindlela Ezimbili Kulula Kunokuba Ucabanga!

nge HeraHaven AI10m2025/01/08
Read on Terminal Reader

Kude kakhulu; Uzofunda

Lo mhlahlandlela uzohamba nawe ekusetheni iseva yasendaweni ye-LLM esekela ukusebenzisana kwezwi kwezindlela ezimbili usebenzisa iPython, Transformers, Qwen2-Audio-7B-Instruct, kanye neBark.
featured image - Ukusingatha I-AI Yakho Ngengxoxo Yezwi Enezindlela Ezimbili Kulula Kunokuba Ucabanga!
HeraHaven AI HackerNoon profile picture

Ukuhlanganiswa kwama-LLM namandla ezwi kudale amathuba amasha ekusebenzelaneni komuntu siqu kwamakhasimende.


Lo mhlahlandlela uzohamba nawe ekusetheni iseva yendawo ye-LLM esekela ukusebenzisana kwezwi okubili kusetshenziswa iPython, Transformers, Qwen2-Audio-7B-Instruct, kanye neBark.

Okudingekayo

Ngaphambi kokuthi siqale, uzofaka okulandelayo:

  • I-Python : Inguqulo engu-3.9 noma ngaphezulu.
  • I-PyTorch : Ngokusebenzisa amamodeli.
  • Ama-Transformers : Inikeza ukufinyelela kumodeli ye-Qwen.
  • Sheshisa : Kudingeka kwezinye izindawo.
  • I-FFmpeg ne-pydub : Ngokucubungula umsindo.
  • FastAPI : Ukudala iseva yewebhu.
  • I-Uvicorn : Iseva ye-ASGI ukusebenzisa i-FastAPI.
  • Igxolo : Okokuhlanganiswa kombhalo-kuya-inkulumo.
  • I-Multipart & Scipy : Ukukhohlisa umsindo.


I-FFmpeg ingafakwa nge apt install ffmpeg ku-Linux noma brew install ffmpeg ku-MacOS.


Ungafaka ukuncika kwePython usebenzisa ipayipi: pip install torch transformers accelerate pydub fastapi uvicorn bark python-multipart scipy

Isinyathelo 1: Ukusetha Imvelo

Okokuqala, ake simise indawo yethu yePython bese sikhetha idivayisi yethu ye-PyTorch:


 import torch device = 'cuda' if torch.cuda.is_available() else 'cpu'


Le khodi ihlola ukuthi ingabe i-GPU ehambisana ne-CUDA (Nvidia) iyatholakala futhi isetha idivayisi ngendlela efanele.


Uma ingekho i-GPU enjalo etholakalayo, i-PyTorch izosebenza ku-CPU ehamba kancane kakhulu.


Kumadivayisi amasha we-Apple Silicon, idivayisi ingasethwa futhi ibe yi mps ukuze isebenzise i-PyTorch ku-Metal, kodwa ukuqaliswa kwe-PyTorch Metal akuphelele.

Isinyathelo sesi-2: Ilayisha Imodeli

Ama-LLM amaningi omthombo ovulekile asekela kuphela okokufaka kombhalo nokuphumayo kombhalo. Kodwa-ke, njengoba sifuna ukudala isistimu yokuphuma ngezwi, lokhu kuzodinga ukuthi sisebenzise amamodeli amabili ngaphezulu ukuze (1) siguqule inkulumo ibe umbhalo ngaphambi kokuthi ifakwe ku-LLM yethu kanye (2) nokuguqula okukhiphayo kwe-LLM kubuye. enkulumweni.


Ngokusebenzisa i-LLM enezimo eziningi njenge-Qwen Audio, singakwazi ukubalekela imodeli eyodwa ukuze sicubungule okokufaka kwenkulumo kube impendulo yombhalo, bese kufanele sisebenzise imodeli yesibili kuphela ukuguqula okukhiphayo kwe-LLM kubuyisele enkulumweni.


Le ndlela yokwenza izinto eziningi ayisebenzi nje kuphela ngokusebenza kahle ngokwesikhathi sokucubungula kanye (V) nokusetshenziswa kwe-RAM, kodwa futhi ngokuvamile iveza imiphumela engcono njengoba umsindo wokufakwayo uthunyelwa ngokuqondile ku-LLM ngaphandle kokungqubuzana.


Uma usebenzisa umsingathi we-GPU wefu njenge -Runpod noma i-Vast , uzofuna ukusetha inkomba ye-HuggingFace home & Bark kusitoreji sakho sevolumu ngokusebenzisa export HF_HOME=/workspace/hf & export XDG_CACHE_HOME=/workspace/bark ngaphambi kokulanda amamodeli.


 from transformers import AutoProcessor, Qwen2AudioForConditionalGeneration model_name = "Qwen/Qwen2-Audio-7B-Instruct" processor = AutoProcessor.from_pretrained(model_name) model = Qwen2AudioForConditionalGeneration.from_pretrained(model_name, device_map="auto").to(device)


Sikhethe ukusebenzisa okuhlukile okuncane kwe-7B kochungechunge lwemodeli ye-Qwen Audio lapha ukuze sinciphise izidingo zethu zokubala. Kodwa-ke, kungenzeka ukuthi u-Qwen usekhiphe amamodeli alalelwayo anamandla namakhudlwana ngesikhathi ufunda lesi sihloko. Ungabuka wonke amamodeli we-Qwen ku-HuggingFace ukuze uhlole kabili ukuthi usebenzisa imodeli yawo yakamuva.


Ukuze uthole indawo yokukhiqiza, ungase ufune ukusebenzisa injini ye-inference esheshayo efana ne -vLLM ukuze uthole ukuphuma okuphakeme kakhulu.

Isinyathelo sesi-3: Ilayisha imodeli ye-Bark

I-Bark iyimodeli ye-AI yesimanjemanje yomthombo ovulekile wombhalo-kuya-inkulumo esekela izilimi eziningi kanye nemisindo.


 from bark import SAMPLE_RATE, generate_audio, preload_models preload_models()


Ngaphandle kwe-Bark, ungasebenzisa futhi amanye amamodeli omthombo ovulekile noma ophathelene nombhalo-kuya-inkulumo. Khumbula ukuthi nakuba abanikazi bempahla bengase basebenze kakhulu, beza ngezindleko eziphakeme kakhulu. Inkundla ye-TTS igcina ukuqhathanisa kwakamuva .


Ngokulayishwa kokubili kwe-Qwen Audio 7B ne-Bark kunkumbulo, ukusetshenziswa kwe-RAM (V) okulinganiselwe kungu-24GB, ngakho qiniseka ukuthi izingxenyekazi zekhompuyutha zakho ziyakusekela lokhu. Uma kungenjalo, ungasebenzisa inguqulo ye-quantized yemodeli ye-Qwen ukuze ulondoloze kumemori.

Isinyathelo sesi-4: Ukusetha iseva ye-FastAPI

Sizodala iseva ye-FastAPI enemizila emibili yokusingatha umsindo ongenayo noma okokufaka kombhalo futhi sibuyisele izimpendulo zomsindo.


 from fastapi import FastAPI, UploadFile, Form from fastapi.responses import StreamingResponse import uvicorn app = FastAPI() @app.post("/voice") async def voice_interaction(file: UploadFile): # TODO return @app.post("/text") async def text_interaction(text: str = Form(...)): # TODO return if __name__ == "__main__":  uvicorn.run(app, host="0.0.0.0", port=8000)


Le seva yamukela amafayela alalelwayo ngezicelo ze-POST endaweni /voice & /text .

Isinyathelo sesi-5: Icubungula Okokufaka Komsindo

Sizosebenzisa i-ffmpeg ukucubungula umsindo ongenayo futhi siwulungiselele imodeli ye-Qwen.


 from pydub import AudioSegment from io import BytesIO import numpy as np def audiosegment_to_float32_array(audio_segment: AudioSegment, target_rate: int = 16000) -> np.ndarray: audio_segment = audio_segment.set_frame_rate(target_rate).set_channels(1) samples = np.array(audio_segment.get_array_of_samples(), dtype=np.int16) samples = samples.astype(np.float32) / 32768.0 return samples def load_audio_as_array(audio_bytes: bytes) -> np.ndarray: audio_segment = AudioSegment.from_file(BytesIO(audio_bytes)) float_array = audiosegment_to_float32_array(audio_segment, target_rate=16000) return float_array

Isinyathelo sesi-6: Khiqiza Impendulo Yombhalo no-Qwen

Ngomsindo ocutshunguliwe, singakwazi ukukhiqiza impendulo yombhalo sisebenzisa imodeli ye-Qwen. Lokhu kuzodinga ukuphatha kokubili okokufaka kombhalo nokomsindo.


Iprosesa izoguqula okokufaka kwethu kube isifanekiso sengxoxo semodeli (i-ChatML esimweni sikaQwen).


 def generate_response(conversation): text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False) audios = [] for message in conversation: if isinstance(message["content"], list): for ele in message["content"]: if ele["type"] == "audio": audio_array = load_audio_as_array(ele["audio_url"]) audios.append(audio_array) if audios: inputs = processor( text=text, audios=audios, return_tensors="pt", padding=True ).to(device) else: inputs = processor( text=text, return_tensors="pt", padding=True ).to(device) generate_ids = model.generate(**inputs, max_length=256) generate_ids = generate_ids[:, inputs.input_ids.size(1):] response = processor.batch_decode( generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False )[0] return response


Zizwe ukhululekile ukudlala ngamapharamitha esizukulwane njengezinga lokushisa kuhlelo lokusebenza model.generate .

Isinyathelo sesi-7: Ukuguqula Umbhalo Ukuze Ukhulume Ngegxolo

Ekugcineni, sizoguqula impendulo yombhalo okhiqiziwe ibuyele enkulumweni.


 from scipy.io.wavfile import write as write_wav def text_to_speech(text): audio_array = generate_audio(text) output_buffer = BytesIO() write_wav(output_buffer, SAMPLE_RATE, audio_array) output_buffer.seek(0) return output_buffer

Isinyathelo sesi-8: Ukuhlanganisa yonke into kuma-API

Buyekeza izindawo zokugcina ukuze ucubungule umsindo noma okokufaka kombhalo, ukhiqize impendulo, futhi ubuyisele inkulumo ehlanganisiwe njengefayela le-WAV.


 @app.post("/voice") async def voice_interaction(file: UploadFile): audio_bytes = await file.read() conversation = [ { "role": "user", "content": [ { "type": "audio", "audio_url": audio_bytes } ] } ] response_text = generate_response(conversation) audio_output = text_to_speech(response_text) return StreamingResponse(audio_output, media_type="audio/wav") @app.post("/text") async def text_interaction(text: str = Form(...)): conversation = [ {"role": "user", "content": [{"type": "text", "text": text}]} ] response_text = generate_response(conversation) audio_output = text_to_speech(response_text) return StreamingResponse(audio_output, media_type="audio/wav")

Ungakhetha futhi ukwengeza umlayezo wesistimu ezingxoxweni ukuze uthole ukulawula okwengeziwe kuzimpendulo zomsizi.

Isinyathelo 9: Hlola izinto

Singasebenzisa curl ukufaka iseva yethu ngale ndlela elandelayo:


 # Audio input curl -X POST http://localhost:8000/voice --output output.wav -F "[email protected]" # Text input curl -X POST http://localhost:8000/text --output output.wav -H "Content-Type: application/x-www-form-urlencoded" -d "text=Hey"

Isiphetho

Ngokulandela lezi zinyathelo, usethe iseva yasendaweni elula ekwazi ukusebenzisana nezwi lezindlela ezimbili usebenzisa amamodeli asezingeni eliphezulu. Lokhu kusetha kungasebenza njengesisekelo sokwakha izinhlelo zokusebenza ezinamandla kakhulu ezisebenzisa izwi.

Izinhlelo zokusebenza

Uma uhlola izindlela zokwenza imali ngamamodeli olimi axhaswe yi-AI, cabanga ngalezi zinhlelo zokusebenza ezingaba khona:

Ikhodi egcwele

 import torch from fastapi import FastAPI, UploadFile, Form from fastapi.responses import StreamingResponse import uvicorn from transformers import AutoProcessor, Qwen2AudioForConditionalGeneration from bark import SAMPLE_RATE, generate_audio, preload_models from scipy.io.wavfile import write as write_wav from pydub import AudioSegment from io import BytesIO import numpy as np device = 'cuda' if torch.cuda.is_available() else 'cpu' model_name = "Qwen/Qwen2-Audio-7B-Instruct" processor = AutoProcessor.from_pretrained(model_name) model = Qwen2AudioForConditionalGeneration.from_pretrained(model_name, device_map="auto").to(device) preload_models() app = FastAPI() def audiosegment_to_float32_array(audio_segment: AudioSegment, target_rate: int = 16000) -> np.ndarray: audio_segment = audio_segment.set_frame_rate(target_rate).set_channels(1) samples = np.array(audio_segment.get_array_of_samples(), dtype=np.int16) samples = samples.astype(np.float32) / 32768.0 return samples def load_audio_as_array(audio_bytes: bytes) -> np.ndarray: audio_segment = AudioSegment.from_file(BytesIO(audio_bytes)) float_array = audiosegment_to_float32_array(audio_segment, target_rate=16000) return float_array def generate_response(conversation): text = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False) audios = [] for message in conversation: if isinstance(message["content"], list): for ele in message["content"]: if ele["type"] == "audio": audio_array = load_audio_as_array(ele["audio_url"]) audios.append(audio_array) if audios: inputs = processor( text=text, audios=audios, return_tensors="pt", padding=True ).to(device) else: inputs = processor( text=text, return_tensors="pt", padding=True ).to(device) generate_ids = model.generate(**inputs, max_length=256) generate_ids = generate_ids[:, inputs.input_ids.size(1):] response = processor.batch_decode( generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False )[0] return response def text_to_speech(text): audio_array = generate_audio(text) output_buffer = BytesIO() write_wav(output_buffer, SAMPLE_RATE, audio_array) output_buffer.seek(0) return output_buffer @app.post("/voice") async def voice_interaction(file: UploadFile): audio_bytes = await file.read() conversation = [ { "role": "user", "content": [ { "type": "audio", "audio_url": audio_bytes } ] } ] response_text = generate_response(conversation) audio_output = text_to_speech(response_text) return StreamingResponse(audio_output, media_type="audio/wav") @app.post("/text") async def text_interaction(text: str = Form(...)): conversation = [ {"role": "user", "content": [{"type": "text", "text": text}]} ] response_text = generate_response(conversation) audio_output = text_to_speech(response_text) return StreamingResponse(audio_output, media_type="audio/wav") if __name__ == "__main__": uvicorn.run(app, host="0.0.0.0", port=8000)