Development of a large language model for data extraction from unstructured text documents: a case study on production geophysical survey reports

UDK: 681.518:622.276
DOI: 10.24887/0028-2448-2025-9-108-111
Key words: Retrieval-Augmented Generation (RAG), Large Language Model (LLM), Production Geophysical Surveys (PGS), geophysics, artificial intelligence
Authors: B.M. Latypov (Ufa State Petroleum Technological University, RF, Ufa); E.V. Yudin (Gazprom Neft Companу Group, RF, Saint Petersburg); R.A. Bondorov (Ufa State Petroleum Technological University, RF, Ufa); N.A. Zyryanov (St. Petersburg State University, RF, Saint Petersburg)

This paper presents the methodology and results of developing a prototype system for the automated extraction of structured information from unstructured textual reports of Production Geophysical Surveys (PGS) of oil wells. The core of the solution is the QwenLarge Language Model (LLM) architecture, enhanced with a Retrieval-Augmented Generation (RAG) mechanism to provide the model with context from external knowledge bases. A comparative analysis of baseline LLM architectures (Qwen2.5-7B-Instruct and ruGPT-3.5-13B) was conducted, revealing that Qwen held a significant advantage in both accuracy and processing speed. The key achievement of this work is the integration of the RAG approach, which substantially increased the accuracy of geological and technical complication classification from 45 % for the baseline Qwen model to 83 % across nine predefined complication classes. The developed software system executes a full processing pipeline: from text preprocessing (tokenization, normalization) and Named Entity Recognition to complication classification and the generation of structured data ready for integration into corporate information systems. The average processing time for a single report was 30 seconds. This proposed solution is designed to automate engineering analysis, support intervention planning, and enhance the operational efficiency of oil field management by transforming unstructured textual data into actionable, structured insights.

References

1. Krasnov V.A., Sudeev I.V., Yudin E.V., Lubnin A.A., Reservoir parameters evaluation using the production data analysis (In Russ.), Nauchno-tekhnicheskiy vestnik

OAO “NK “Rosneft”, 2010, no. 1, pp. 30–34.

2. Asmandiyarov R.N., Kladov A.E., Lubnin A.A. et al., Automatic approach to field data analysis (In Russ.), Neftyanoe khozyaystvo = Oil Industry, 2011, no. 6, pp. 58–61.

3. Andrianova A.M., Yudin E.V., Ganeev T.A. et al., Application of intelligent methods for analysis high-frequency production data for solving oil engineering challenges

(In Russ.), Neftyanoe khozyaystvo = Oil Industry, 2021, no. 9, pp. 70–75, DOI: https://doi.org/10.24887/0028-2448-2021-9-70-75

4. Judin E., Andrianova A., Ganeev T. et al., Intelligent methods for analyzing high-frequency production data to optimize well operation modes, SPE-212118-MS, 2022, DOI: https://doi.org/10.2118/212118-MS

5. Whiteside J., AI-enabled large language model speeds up wells data retrieval but must be used with care, Drilling Contractor, 2023,

URL: https://drillingcontractor.org/ai-enabled-large-language-model-speeds-up-wells-data-retrieval-but-mu...

6. Rachmanto R., Utilizing large language models for information retrieval from reports in the oil and gas industry, Plain English AI, 2023,

URL: https://ai.plainenglish.io/utilizing-large-language-models-for-information-retrieval-from-reports-in...

7. Ghorbanfekr H., Kerstens P.J., Dirix K., Classification of geological borehole descriptions using a domain adapted large language model, arXiv preprint arXiv:2407.10991, 2024, DOI: https://doi.org/10.48550/arXiv.2407.10991

8. Zhiwei Ma, Santos J.E., Lackey G. et al., Information extraction from historical well records using a large language model, Scientific Reports, 2024, V. 14, No 1,

DOI: https://doi.org/10.1038/s41598-024-81846-5

9. Zhouhan Lin, Cheng Deng, Le Zhou et al., GeoGalactica: A large language model for geoscience knowledge retrieval and reasoning, arXiv preprint arXiv:2401.00434, 2024, DOI: https://doi.org/10.48550/arXiv.2401.00434

10. Wayne Xin Zhao, Kun Zhou, Junyi Li et al., A survey of large language models, 10.48550/arXiv.2303.18223, 2023, DOI: https://doi.org/10.48550/arXiv.2303.18223

11. Guu K. et al., Retrieval augmented language model pre-training, International conference on machine learning, PMLR, 2020, pp. 3929–3938.



Attention!
To buy the complete text of article (Russian version a format - PDF) or to read the material which is in open access only the authorized visitors of the website can. .

Юбилей Великой Победы

Pobeda80_logo_main.png В юбилейном 2025 году подготовлены: 
   - специальная подборка  статей журнала, посвященных подвигу нефтяников в годы Великой Отечественной войны;  
   - списки авторов публикаций журнала - участников боев и участников трудового фронта

Press Releases

25.09.2025
23.09.2025
12.09.2025
10.09.2025
08.09.2025