AI is not a new topic for scientists – it has been used in various forms for decades. However, growing investment, interest, and application of AI in academic and industrial research has led to a deep learning revolution that is changing the landscape of scientific discovery.
Machine learning/deep learning (ML/DL) models are used to process and analyze large data sets, identify patterns, and build predictive models, which is crucial in the hard sciences and analysis of experimental data. Meanwhile, large language models (LLM) offer advanced capabilities for natural language processing, content generation, and text analysis support, which can significantly improve research in text-related fields. Examples include literature processing, abstract creation, and content generation from input data. GenAI can support the creation of synthetic data, generate reports, and predict research results, which significantly speeds up the research process and increases its efficiency.
GenAI introduces new dynamics into the research process, offering tools for contextual text analysis and support in solving research problems. Thanks to advanced algorithms, scientists can now use tools that not only process and analyze text, but also provide new conclusions and suggest new directions for research.
Applications of GenAI (or large LLM language models)
Further development applications and models based on GenAI, such as the announced Strawberry OpenAI goes towards:
- Increased efficiency and automation of scientific research by automating routine tasks and analyzing large data sets.
- Facilitating access to information and knowledge what natural language processing and advanced information search or survey research will allow.
- Increased accuracy and repeatability of results by using advanced anomaly detection algorithms and predictive models, it is possible to identify errors and inconsistencies in the data, which increases the credibility of the results. In addition, automation of the analysis can increase the repeatability of research, minimizing the impact of human errors and subjectivity in data analysis.
- Support for interdisciplinary research thanks to the ability to integrate data from different fields and analyze them in the context of complex problems, this tool can help create models and solutions that combine different fields of knowledge. The development of AI will support communication between research teams, offering common platforms for sharing results and conclusions.
- Accelerating scientific discoveries poby automating research processes and faster data analysis. This will allow for quick testing of hypotheses and generating new ideas based on data analysis can lead to faster discovery of new phenomena and technologies. It can also facilitate the introduction of innovations that could be too time-consuming or expensive without AI support.
- Troubleshooting data issues, such as missing data, incomplete data or diverse data sources. Thanks to advanced data imputation and normalization techniques, it will be possible to improve the quality of the analyzed data, which is crucial for obtaining reliable research results.
Risks
The development of GenAI poses fundamental questions for scientists regarding the role and mission of science. In a world where more and more research tasks can be taken over by AI, it is crucial to consider what place humans occupy in this ecosystem. What role should scientists play when many research processes are beginning to be supported or completely implemented by intelligent systems? The answers to these questions will affect the future of science and the directions of its development, as well as the way in which human creativity and skills will be integrated with modern technologies.
The use of new tools and changes in the way scientists conduct research are creating new technical challenges, such as:
- Reproducibility issues where other researchers cannot to repeat experiments conducted using AI tools. For example, if experiments in a lab use AI algorithms to analyze data, difficulties in sharing the source code or data may prevent other researchers from repeating the same analyses.
- LLMs may generate content that is inaccurate or untrue, known as "hallucinations". This poses a significant challenge, as such content can be misleading and affect the quality and integrity of research. Mechanisms must be developed to effectively detect and correct such errors, as well as to ensure that the results generated by LLM are thoroughly checked.
- AI models can introduce or enhance prejudice, which exist in the training data. If the data used to train the models is biased, the results can also be biased. The results can be distorted by existing biases, leading to incorrect or discriminatory results.
- GenAI based models can generate results that are difficult to interpret without fully understanding how the model works. Difficulties in interpreting results can lead to erroneous conclusions or misapplications of research results.
Ethical issues and issues related to intellectual property law are also becoming increasingly important.
- The use of advanced GenAI tools may raise questions regarding ethics and rules of use, including issues related to research integrity and transparency. Misuse of AI tools can lead to violations of research ethics.
- The use of LLM in science involves new challenges regarding responsibility. Who is responsible for errors or misrepresentations caused by LLMs? What standards should be applied to ensure the integrity of scientific publications supported by LLMs? It is important to develop clear policies regarding responsibility for content generated by LLMs and to ensure that research conducted with their support complies with high ethical standards, and similarly to outline standards and procedures to ensure the authenticity of scientific publications.
- Problems related to privacy, data security and access to research results. This brings us to the question of privacy, data ownership, and data availability. Researchers need to be aware of who has access to data and how it is protected.
- Using AI can complicate issues related to intellectual property, especially when AI generates new ideas or results. It can be more difficult to determine who owns the results generated by AI and how to protect those results.
- Limited cooperation between fields AI and non-AI approaches may lead to less rigorous adoption of AI across fields. For example, the lack of integration of AI with traditional biology research methods may limit the potential for innovation.
- Environmental costs: High energy consumption needed to run the computing infrastructure. For example, training advanced AI models like GPT-4 requires huge computational resources and energy, which contributes to an increased carbon footprint.
The question also arises whether the use of LLM affects the accessibility and democratization of science, or perhaps the polarization of research centers around the world.
- They are growing barriers for the effective implementation of open science principles due to "black box” AI systems and limited commercial models that drive research. The complexity of AI models can make it difficult to fully understand and reproduce research results. If algorithms are too complex or insufficiently documented, scientists may have difficulty reproducing experiments, which is crucial for verifying results.
- LLM models are often opaque and difficult to interpret. The lack of transparency in how these models work can make it difficult for researchers to track information sources and understand how the models arrive at their conclusions. It is necessary to develop methods that allow for a better understanding of LLM decision-making processes and interpretation of their results.
- Using advanced AI tools may involve high costs associated with computing infrastructure and technical requirements. The possible high costs and need for specialized technical knowledge may be a barrier for some researchers and institutions.
It can be argued that the use of LLM creates a new model of scientific work:
- LLM can influence andintegrity of the peer review and publication process, if used to write peer review reports or produce scientific content. The use of LLM in these areas may lead to risks of misinterpretation or manipulation of data. Control procedures must be developed to minimise these risks and ensure that the peer review and publication processes remain sound.
- It is important to consider how LLMs may influence future directions of research and scientific discovery. There is a concern that over-reliance on LLMs may limit innovation and creativity in science. It is therefore important to treat LLMs as tools that support, rather than replace, human creativity and insight in scientific research.
- Reduction the ability to critically analyze and think independently in research. Scientists may start to rely on AI tools in a way that limits their ability to analyze and interpret data on their own. This can affect not only analytical skills and critical thinking but also addiction at work from these tools.
- Changing incentives in the scientific ecosystem may increase pressure on researchers to focus on advanced AI techniques at the expense of more conventional methods or to be "good at AI” instead of “good at science”.
Conclusions and recommendations
LLMs have the potential to revolutionize the way we process text and other data in a variety of scientific fields. Their ability to generate and edit scientific texts and answer scientific questions can greatly support the process of scientific discovery, increasing efficiency and accelerating progress in a variety of fields. What is important?
- Partnerships within the scientific community and with experts in AI ethics and security to develop best practices and standards for the use of LLMs. Joint efforts are needed to identify potential risks and develop mechanisms to ensure responsible and ethical use of LLMs in science.
- It is required greater transparency on the operation of LLM models to enable a better understanding of their decisions and outcomes.
- The focus should be on adapting the LLM to specific contexts and disciplines to better respond to research needs and requirements.
- As LLMs become more advanced, special attention should be paid to issues Ethics and privacy related to the use of data and the results generated by these models.
- Learning requires responsibility for the knowledge you produce. LLMs, while they can be useful, should not be treated as full-fledged scientists or authors. Instead, they should be viewed as research support tools that should be used with caution and responsibility.
- Kit is necessary to develop and implement best practices and standards regarding the use of LLMs in research. Collaboration with publishers, conference organizers and other stakeholders is key to ensuring responsible use of this technology.
- Scientists need to pay close attention monitor develop LLMs and adapt their policies and practices in consultation with experts in AI ethics and security to ensure that the use of LLMs does not undermine the rigor and replicability of scientific research
- It is important to introduce ecosystem of scientists' education to support the development of future competences needed to face the change that awaits them
Bibliography
Baldi, P., Sadowski, P. and Whiteson, D., 2014. Deep Learning in High-Energy Physics: Improving the Search for Exotic Particles. Journal of High-Energy Physics.
Baron, D., 2018. Machine Learning in Astronomy: A Practical Overview. School of Physics and Astronomy, Tel-Aviv University.
Birhane, A., Kasirzadeh, A., Leslie, D. and Wachter, S., 2023. Science in the age of large language models. Nature Reviews Physics, 5, pp. 277–280.
Fecher, B., Hebing, M., Laufer, M., Pohle, J. and Sofsky, F., 2023. Science in the age of AI: How artificial intelligence is changing the nature and method of scientific research. AI & Society.
Fan, L., Li, L., Ma, Z., Lee, S., Yu, H. and Hemphill, L., 2023. A bibliometric review of large language models research from 2017 to 2023. arXiv preprint.
Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., Ronneberger, O., Tunyasuvunakool, K., Bates, R., Žídek, A., Potapenko, A., Bridgland, A., Meyer, C., Kohl, S., Ballard, A., Cowie, A., Romera-Paredes, B., Nikolov, S., Jain, R., Adler, J. and Hassabis, D., 2021. Highly accurate protein structure prediction with AlphaFold. Nature, 596, pp. 590-599.
Meng, X., Yan, X., Zhang, K., Liu, D., Cui, X., Yang, Y., Zhang, M., Cao, C., Wang, J., Wang, 2024. The application of large language models in medicine: A scoping review. iScience, 47(3), p.109713.
Noursalehi, P., Koutsopoulos, H.N. and Zhao, J., 2020. Machine-Learning-Augmented Analysis of Textual Data: Application in Transit Disruption Management. IEEE Open Journal of Intelligent Transportation Systems, 1, pp. 227-236.
Preuss, N., Alshehri, AS and You, F., 2024. Large language models for life cycle assessments: Opportunities, challenges, and risks. Journal of Cleaning Production, 314, p. 142824.
Semercioğlu, İ.N., Başağa, H.B., Tokdemir, OB and Çıtıpıtıoğlu, A., 2023. Use of large language models in civil and geomatics engineering. In: Proceedings of the 3rd International Civil Engineering and Architecture Conference (ICEARC'23). Trabzon, Turkey.


