Shivam Sharmaशिवम् शर्मा
I am a Data Scientist at the Pacific Northwest National Laboratory's Physical and Computational Science directorate. I work on advancing the application of Artificial Intelligence (AI) and machine learning, particularly in the context of large-scale scientific and environmental documents.
My research experience involves application of NLP techniques for information extraction from unstructured scientific and environmental data for autoregressive models. I am also actively working on the implementation of generative AI for the development of conversational AI assistants for various real-world databases such as Livewire, ARM, NEPA Documents, etc.
I am a graduate from the New Jersey Institute of Technology, with a M.Sc in Data Science with a thesis in Crisis Informatics under the guidance of Dr. Cody Buntain
Email  / 
Google Scholar  / 
Resume/CV  / 
Master's Thesis
|
|
Research Highlights
My research interests included, information extraction from unstructured data, and training and evaluation of Large Langauge Models on domain-specific data.
|
|
NEPATEC1.0: First Large-Scale Text Corpus of
National Environmental Policy Act PDF Documents
Shivam Sharma*,
Dan Nally,
Mike Parker,
Sai Munikoti,
Sameera Horawalavithana
Paper Link /
Dataset Link
-
Description: Led the development of a specialized text corpus of over 28,000 Environmental Impact Statements (EIS), enriched with structured metadata and named entities for enhanced information retrieval. This corpus supports the development of AI-driven tools designed to improve the efficiency of NEPA reviews and aid in environmental decision-making.
|
|
RAG vs. Long Context: Examining Frontier Large Language Models for Environmental Review Document Comprehension
Hung Phan,
Anurag Acharya,
Sarthak Chaturvedi,
Shivam Sharma*,
Mike Parker,
Dan Nally,
Ali Jannesari,
Karl Pazdernik,
Mahantesh Halappanavar,
Sai Munikoti,
Sameera Horawalavithana
Paper Link
-
Description: Developed a benchmark dataset to evaluate comprehension of environmental review documents, comparing the effectiveness of Retrieval-Augmented Generation (RAG) pipelines against long-context modeling approaches for document retrieval and semantic understanding.
|
|
Foundation models of scientific knowledge for chemistry: Opportunities, challenges and lessons learned
Sameera Horawalavithana,
Ellyn Ayton,
Shivam Sharma*,
Sylvia Howland,
Megha Subramanian,
Scott Vasquez,
Robin Cosbey,
Maria Glenski,
Svitlana Volkova,
BigScience Workshop 2022
Paper Link
-
Description: Created a domain-specific dataset for the chemistry field, curating scientific publications to enable domain adaptation of GPT-style LLMs. Trained and fine-tuned models on this dataset to benchmark performance on domain-specific NLP tasks relative to general-purpose LLMs.
|
|
Combining neural, statistical and external features for fake news stance identification
Gaurav Bhatt,
Aman Sharma,
Shivam Sharma*,
Ankush Nagpal,
Balasubramanian Raman,
Ankush Mittal,
MSM 2018 (Oral Presentation)
Paper Link /
Github Code Link
-
Description: Designed a hybrid feature fusion model for fake-news stance detection, integrating neural embeddings, statistical NLP features, and custom feature engineering within a deep neural network layer for improved stance classification accuracy.
-
Outcome: Improved state-of-the-art performances on Fake-News Challenge 1 by 1.25% on overall score.
|
|
Data Scientist, Pacific Northwest National Laboratory
Jan 2023 - Present
- Contributed to domain-specific LLM applications for scientific and environmental datasets, focusing on fine-tuning and domain embedding optimization.
- Developing AI chatbots for database websites using LLMs on AWS Bedrock, enhancing natural language understanding for seamless data access.
- Integrating seismic time-series data into Multi-Modal Models for information extraction and open-ended text generation tasks, enhancing contextual insights and data utilization.
|
|
Post-Masters Research Associate, Pacific Northwest National Laboratory
Oct 2021 - Jan 2023
- Supported the implementation of NLP techniques for effective data curation and post-processing tasks.
- Contributed to data extraction and processing from chemistry-domain scientific PDFs to support the training and evaluation of LLMs.
|
|
Research Assitant, New Jersey Institute of Technology
Oct 2019 - Sep 2020
Supervisor: Dr. Cody Buntain
- Led research on training and evaluation of BERT-style LLMs to analyze and classify social media posts during crisis situations.
|
|
Undergraduate Researcher, Indian Institute of Technology
June 2017 - July 2017
Supervisor: Dr R Balasubramanian
- Played a key role in research and implementation of NLP techniques for detecting stances in fake news content utilizing semantic similarity and topic modeling approaches.
|
I borrowed this website layout from here!
|
|