Shivam Sharma

Shivam Sharmaशिवम् शर्मा

I am a PhD Student researching on video misinformation detection at the Ubiquitous Knowledge Processing (UKP) Lab under the supervision of Prof. Iryna Gurevych in TU Darmstadt, Germany. Previously, I worked at the Pacific Northwest National Laboratory (PNNL) located at Richland, Washington, USA, where I contributed to projects on domain-specific LLMs and NLP for environmental and scientific data.

My broader interests lie at the intersection of Natural Language Processing and Vision-Language Modeling. I am passionate about building interpretable and transparent AI systems that help understand and mitigate misinformation in real-world multimodal settings.

I am a graduate from the New Jersey Institute of Technology, with a M.Sc in Data Science with a thesis in Crisis Informatics under the guidance of Prof. Cody Buntain

Email / Google Scholar / Resume/CV / Master's Thesis

Research Highlights

My research interests included, information extraction from unstructured data, and training and evaluation of Large Langauge Models on domain-specific data.

NEPATEC1.0: First Large-Scale Text Corpus of National Environmental Policy Act PDF Documents
Shivam Sharma^*, Dan Nally, Mike Parker, Sai Munikoti, Sameera Horawalavithana

Paper Link / Dataset Link

Description: Led the development of a specialized text corpus of over 28,000 Environmental Impact Statements (EIS), enriched with structured metadata and named entities for enhanced information retrieval. This corpus supports the development of AI-driven tools designed to improve the efficiency of NEPA reviews and aid in environmental decision-making.

RAG vs. Long Context: Examining Frontier Large Language Models for Environmental Review Document Comprehension
Hung Phan, Anurag Acharya, Sarthak Chaturvedi, Shivam Sharma^*, Mike Parker, Dan Nally, Ali Jannesari, Karl Pazdernik, Mahantesh Halappanavar, Sai Munikoti, Sameera Horawalavithana

Paper Link

Description: Developed a benchmark dataset to evaluate comprehension of environmental review documents, comparing the effectiveness of Retrieval-Augmented Generation (RAG) pipelines against long-context modeling approaches for document retrieval and semantic understanding.

Foundation models of scientific knowledge for chemistry: Opportunities, challenges and lessons learned
Sameera Horawalavithana, Ellyn Ayton, Shivam Sharma^*, Sylvia Howland, Megha Subramanian, Scott Vasquez, Robin Cosbey, Maria Glenski, Svitlana Volkova,
BigScience Workshop 2022
Paper Link

Description: Created a domain-specific dataset for the chemistry field, curating scientific publications to enable domain adaptation of GPT-style LLMs. Trained and fine-tuned models on this dataset to benchmark performance on domain-specific NLP tasks relative to general-purpose LLMs.

Combining neural, statistical and external features for fake news stance identification
Gaurav Bhatt, Aman Sharma, Shivam Sharma^*, Ankush Nagpal, Balasubramanian Raman, Ankush Mittal,
MSM 2018
Paper Link / Github Code Link

Description: Designed a hybrid feature fusion model for fake-news stance detection, integrating neural embeddings, statistical NLP features, and custom feature engineering within a deep neural network layer for improved stance classification accuracy.

Outcome: Improved state-of-the-art performances on Fake-News Challenge 1 by 1.25% on overall score.

Research Experience

Data Scientist, Pacific Northwest National Laboratory
Jan 2023 - Jan 2025

Contributed to domain-specific LLM applications for scientific and environmental datasets, focusing on fine-tuning and domain embedding optimization.
Developing AI chatbots for database websites using LLMs on AWS Bedrock, enhancing natural language understanding for seamless data access.
Integrating seismic time-series data into Multi-Modal Models for information extraction and open-ended text generation tasks, enhancing contextual insights and data utilization.

Post-Masters Research Associate, Pacific Northwest National Laboratory
Oct 2021 - Jan 2023

Supported the implementation of NLP techniques for effective data curation and post-processing tasks.
Contributed to data extraction and processing from chemistry-domain scientific PDFs to support the training and evaluation of LLMs.

Research Assitant, New Jersey Institute of Technology
Oct 2019 - Sep 2020
Supervisor: Dr. Cody Buntain

Led research on training and evaluation of BERT-style LLMs to analyze and classify social media posts during crisis situations.

Undergraduate Researcher, Indian Institute of Technology
June 2017 - July 2017
Supervisor: Dr R Balasubramanian

Played a key role in research and implementation of NLP techniques for detecting stances in fake news content utilizing semantic similarity and topic modeling approaches.

I borrowed this website layout from here!