AI Data Engineering Intern
I used cutting-edge algorithms (WizardLM, DEITA, etc.) and tools (Distilable, DataDreamer) to do data augmentation work on large language models.
What I did
Based on NLP tools such as TextBlob and NLTK, clean and screen text data for LLM model training.
Based on papers such as WizardLM and DEITA, I parallelly completed data selection and diversification of instructions, effectively expanding more than 10 famous datasets with cutting-edge data processing tools such as Distilable and DataDreamer.
Based on UltraFeedBack, quantitatively evaluate LLM data based on complexity and quality.
Build relevant prompts based on CoT, Few-shot In-Context, and other methods to achieve diversified expansion of different style questions and complete high-quality specified tasks.
Augmented more than ten data sets, solved problems such as large-scale parallel calls and periodic storage.
What I get
I gained an understanding of algorithm-related work (I have a rough idea of what it’s like to work in the industry, :) and the actual needs of companies. I learned about the structure and operational mechanisms of a company, as well as some insights into the application of LLMs and related businesses.
I studied a wealth of knowledge related to LLMs, including training, fine-tuning, and the underlying principles.
During this period, my coding skills became more organized, and I extensively used APIs from mainstream LLMs like GPT and Gemini, accumulating a lot of engineering experience. I solved a series of issues related to parallel processing and scheduled saving, and I can expand on this further.
PS: I found that I still prefer doing research; working in an office can be a bit painful xxx