Shivam Sharmaशिवम् शर्मा

I am a Data Scientist at the Pacific Northwest National Laboratory's Physical and Computational Science directorate. I work on advancing the application of Artificial Intelligence (AI) and machine learning, particularly in the context of large-scale scientific and environmental documents.

My research experience involves application of NLP techniques for information extraction from unstructured scientific and environmental data for autoregressive models. I am also actively working on the implementation of generative AI for the development of conversational AI assistants for various real-world databases such as Livewire, ARM, NEPA Documents, etc.

I am a graduate from the New Jersey Institute of Technology, with a M.Sc in Data Science with a thesis in Crisis Informatics under the guidance of Dr. Cody Buntain

Email  /  Google Scholar  /  Resume/CV  /  Master's Thesis

profile photo
Research Highlights

My research interests included, information extraction from unstructured data, and training and evaluation of Large Langauge Models on domain-specific data.

NEPATEC1.0: First Large-Scale Text Corpus of National Environmental Policy Act PDF Documents
Shivam Sharma*, Dan Nally, Mike Parker, Sai Munikoti, Sameera Horawalavithana

Paper Link / Dataset Link
  • Description: Led the development of a specialized text corpus of over 28,000 Environmental Impact Statements (EIS), enriched with structured metadata and named entities for enhanced information retrieval. This corpus supports the development of AI-driven tools designed to improve the efficiency of NEPA reviews and aid in environmental decision-making.
RAG vs. Long Context: Examining Frontier Large Language Models for Environmental Review Document Comprehension
Hung Phan, Anurag Acharya, Sarthak Chaturvedi, Shivam Sharma*, Mike Parker, Dan Nally, Ali Jannesari, Karl Pazdernik, Mahantesh Halappanavar, Sai Munikoti, Sameera Horawalavithana

Paper Link
  • Description: Developed a benchmark dataset to evaluate comprehension of environmental review documents, comparing the effectiveness of Retrieval-Augmented Generation (RAG) pipelines against long-context modeling approaches for document retrieval and semantic understanding.
Foundation models of scientific knowledge for chemistry: Opportunities, challenges and lessons learned
Sameera Horawalavithana, Ellyn Ayton, Shivam Sharma*, Sylvia Howland, Megha Subramanian, Scott Vasquez, Robin Cosbey, Maria Glenski, Svitlana Volkova,
BigScience Workshop 2022
Paper Link
  • Description: Created a domain-specific dataset for the chemistry field, curating scientific publications to enable domain adaptation of GPT-style LLMs. Trained and fine-tuned models on this dataset to benchmark performance on domain-specific NLP tasks relative to general-purpose LLMs.
Combining neural, statistical and external features for fake news stance identification
Gaurav Bhatt, Aman Sharma, Shivam Sharma*, Ankush Nagpal, Balasubramanian Raman, Ankush Mittal,
MSM 2018 (Oral Presentation)
Paper Link / Github Code Link
  • Description: Designed a hybrid feature fusion model for fake-news stance detection, integrating neural embeddings, statistical NLP features, and custom feature engineering within a deep neural network layer for improved stance classification accuracy.
  • Outcome: Improved state-of-the-art performances on Fake-News Challenge 1 by 1.25% on overall score.
Research Experience
Data Scientist, Pacific Northwest National Laboratory
Jan 2023 - Present

  • Contributed to domain-specific LLM applications for scientific and environmental datasets, focusing on fine-tuning and domain embedding optimization.
  • Developing AI chatbots for database websites using LLMs on AWS Bedrock, enhancing natural language understanding for seamless data access.
  • Integrating seismic time-series data into Multi-Modal Models for information extraction and open-ended text generation tasks, enhancing contextual insights and data utilization.

Post-Masters Research Associate, Pacific Northwest National Laboratory
Oct 2021 - Jan 2023

  • Supported the implementation of NLP techniques for effective data curation and post-processing tasks.
  • Contributed to data extraction and processing from chemistry-domain scientific PDFs to support the training and evaluation of LLMs.

Research Assitant, New Jersey Institute of Technology
Oct 2019 - Sep 2020
Supervisor: Dr. Cody Buntain

  • Led research on training and evaluation of BERT-style LLMs to analyze and classify social media posts during crisis situations.

Undergraduate Researcher, Indian Institute of Technology
June 2017 - July 2017
Supervisor: Dr R Balasubramanian

  • Played a key role in research and implementation of NLP techniques for detecting stances in fake news content utilizing semantic similarity and topic modeling approaches.


I borrowed this website layout from here!