NLP & LLM Data Scientist – Healthcare & Life Sciences
Job location
Remote, United States
Type
Full Time
Responsibilities:
Employ and leverage NLP and open-source Large Language Models (LLM) such as LLama2, Mixtral, Qwen, BERT, etc., to extract, process, and interpret unstructured medical data from diverse sources like EHRs, medical notes, and laboratory reports.
Collaborate with clinical scientists and data scientists to create efficient NLP models for healthcare, exhibiting an understanding of both the technical and medical aspects of the data.
Conduct data cleaning, preprocessing, and validation to maintain the accuracy and reliability of insights gathered from NLP processes.
Validate and present data findings to stakeholders, exhibiting clear and effective communication skills.
Requirements & Skills:
Master’s or Ph.D. degree in Computational Biology, Computer Science, Data Science, Computational Linguistics, Machine Learning, or a related analytical field.
Deep understanding and direct experience (2+ years) in handling and interpreting either Electronic Health Records (EHR) and laboratory tests results or genetic test results is a must.
Proven experience (2+ years) in NLP with a strong knowledge of NLP techniques such as Named Entity Recognition (NER), text summarization, topic modeling, etc., and their applied use in healthcare.
Expert-level understanding and practical experience (1+ years) with open-source Large Language Models (Llama2/3, Mixtral, etc.), e.g., prompt engineering, inference, and fine-tuning.
Proficient in Python and SQL, with strong experience in NLP libraries such as NLTK, spaCy, and Hugging face Transformers, and deep learning libraries such as PyTorch, TensorFlow.
Familiarity with common data science and ML practices, e.g., version control systems, agile methodologies, and documentation.
Experience in working with AWS cloud environment and large databases (e.g., AWS redshift).
Experience in managing ML lifecycle using open-source tools (e.g., MLflow).
Detail-oriented with strong analytical and problem-solving abilities.
Excellent verbal and written communication skills, with the ability to present complex data to a non-technical audience.
Experience dealing with protected health information (PHI) and familiarity with healthcare-related data privacy laws such as HIPAA.
Familiarity with standard healthcare codes and terminologies such as ICD-10, CPT, LOINC, and SNOMED CT.
Experience in RAG (Retrieval-Augmented Generation) and vector store in the context of storing large volumes of healthcare unstructured documents and querying those.