AI Data Engineering Intern

I used cutting-edge algorithms (WizardLM, DEITA, etc.) and tools (Distilable, DataDreamer) to do data augmentation work on large language models.

What I did

  • Based on NLP tools such as TextBlob and NLTK, clean and screen text data for LLM model training.

  • Based on papers such as WizardLM and DEITA, I parallelly completed data selection and diversification of instructions, effectively expanding more than 10 famous datasets with cutting-edge data processing tools such as Distilable and DataDreamer.

  • Based on UltraFeedBack, quantitatively evaluate LLM data based on complexity and quality.

  • Build relevant prompts based on CoT, Few-shot In-Context, and other methods to achieve diversified expansion of different style questions and complete high-quality specified tasks.

  • Augmented more than ten data sets, solved problems such as large-scale parallel calls and periodic storage.

What I get

I gained an understanding of algorithm-related work (I have a rough idea of what it’s like to work in the industry, :) and the actual needs of companies. I learned about the structure and operational mechanisms of a company, as well as some insights into the application of LLMs and related businesses.

I studied a wealth of knowledge related to LLMs, including training, fine-tuning, and the underlying principles.

During this period, my coding skills became more organized, and I extensively used APIs from mainstream LLMs like GPT and Gemini, accumulating a lot of engineering experience. I solved a series of issues related to parallel processing and scheduled saving, and I can expand on this further.

PS: I found that I still prefer doing research; working in an office can be a bit painful xxx