Summary of the study : Adapting Open Science

HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.


| Background and objectives of the study
The Adapting Open Science study was carried out as part of the "Successfully appropriating Open Science" project led by the Committee for Open Science.It was carried out by a multi-disciplinary and professional working group of the "Research Data" college. 1 This project ran from May 2020 to December 2021 and was composed of three work streams: • The design and organization of Open Science Legal Workshops (OSLA)2 • Participation in the Electronic Lab Notebook Working Group (ELN WG)3 • The Adapting Open Science study, which is the subject of the summary below.4 The "Adapting Open Science" study The "Adapting Open Science" study began with a field survey of research professionals in various disciplines to: • better understand the practices associated with data and their evolution with Open Science, • understand the factors that differentiate these practices (discipline, research approach, etc.), • provide support adapted to the needs of different research communities.
The study aimed to answer two questions: 1. What factors should be taken into consideration to better understand the diversity of practices associated with data in research?
2. How can we support the evolution of data practices in relation to the incentives/obligations brought about by Open Science policies?

| Methodology
The study was based on mixed methods with two initial qualitative phases including interviews (exploratory, observation of practices), focus groups, and a study day dedicated to social science and humanities.Following this, a quantitative phase was carried out based on the design, dissemination and analysis of a questionnaire to research professors and staff in France (more than 400 responses).The study was finalized by combining the results of these phases 5 and a design approach to facilitate the appropriation of the content (cf. Figure 1).
The research work was part of a collaborative approach between the different members of the working group from various disciplines (biology, art history, history, health, Science & Technology Studies -STS) and professional research fields (archives, libraries, research, management and strategy, etc.
).An Open Science approach (cf. Figure 2) was also tested in order to make the collected information available (in compliance with the GDPR) to facilitate the progress of the research (sharing of intermediate syntheses) and the reproducibility of the quantitative results (scripts, making data available). 6 5 The results of the qualitative phases were obtained through grounded theory analysis.Quantitative results were obtained through univariate (flat and cross tabulation) and multivariate (MCA and HCPC) statistical analysis.For more information, see the methodological guide referenced in section 7.2 (in French).6 4

Typology of practices and personas
Although disciplines are an important factor in differentiating various data-related practices, the study shows that it is important to go beyond the single disciplinary reading grid and distinguish other differentiating factors.
The first exploratory interviews, supported by a review of the literature, led to the definition of "data-related practices" as all of the steps necessary to constitute data 7 and to make it available (ranging from restricted sharing to open data).
In addition to the disciplinary fields (Sciences Technology and Medicine -STM/ Social Sciences and Humanities -SSH), another factor taken into consideration was the individual or collective nature of the research work.Based on these factors, the multivariate analysis and multiple correspondence analysis of the survey responses, these elements made it possible to highlight 4 main types of practices (experimental, collaborative, computational, solitary).
• While the "discipline" axis strongly colors these typologies of practices -for example, the "experimental" profile is associated with people from the Earth and Life Sciences -other profiles such as "computational" bring together individuals from different disciplines (from computer science to linguistics) but who share a common culture of data and often a knowledge of free and open source software.
• Furthermore, within the Social Sciences and Humanities where the representation of a "solitary researcher" may still be dominant, a "collaborative" profile stands out with individuals implementing collective practices at different stages of their research, or at least wishing to train in them.
• Finally, the "solitary" profile (not restricted to the SSH), includes individuals who conduct their research alone without necessarily wanting to do so because of their status or working conditions, for example in the case of PhD students.
7 In STS, several studies looked at the constitution of data/databases and frictionless processes related to it.Data are considered as a construct that is subject to different stages, exchanges, use of tools, processes, until the production of what is called "data" with the aim in particular of being shared, exchanged and having value as evidence.Other concepts at the heart of this study are, for example, those of data journeys, datafication or the "public of data" (Jaton and Vinck 2016; Gruson-Daniel and De Quatrebarbes 2019; Gitelman 2013; Bowker and Star 2000; Heaton  and Millerand 2013).
Different personas were produced to give a better understanding of these profiles (types of practices) based on the results of a MCA (Multiple Correspondence Analysis), (see Figures 4 and 5).The personas are fictional characters, made up based on the answers to the questionnaire (the most representative of each class) and the results of the qualitative analysis phases.Each persona is presented in the form of a descriptive sheet and gives an overview of concrete situations encountered by a variety of research professionals (assistant professors, research engineers, researchers, etc.).Navigating practices: approach, tools and status In order to obtain a finer level of granularity in the analysis of these two general factors (discipline and collaborative/solitary nature), three additional criteria were explored: 1. research approach ; 2. learning tools and modalities; and 3. status and function in research.
For the "research approach" axis, we include various elements such as the research environment (laboratory, clinic, fieldwork, etc.), data origins (measurement instruments, archives, etc.), the relationship to data, particularly in the terms used to namethe criteria associated with research quality, the added value of the research work, and the steps involved in the making of data8 .These criteria, based on the analysis of the questionnaire responses, have made it possible to characterize the various research approaches that influence the relationship with the data.For example, in the context of laboratory work, the added value of research is linked to experimentation and one speaks more readily of measurements and values to qualify the data.In archival work or field studies, the terms corpus and materials are widely used for a main added value associated with the collection of rare data and theoretical work (see table below).
In the "tools and learning methods" category, the focus is more specifically on the material context (even if digital) of practices through the use of a set of tools.It is a

Figure 6: Summary table of different research approaches according to work environment (laboratory, field/archive and clinical).
matter of considering the ways in which tools are discovered and learned, the appropriation of the digital work environment, and the interest in more or less collaborative work habits and the needs identified for support.This axis offers a distinction between different communities of practice and learning.
Finally, a last category is that of the status and functions allocated to individuals in research, for example the professional category or the status and seniority within research (doctoral student, civil servant, etc.).

Adapting Open Science according to …
For each axis, the criteria aim to provide a detailed understanding of the various types of relationship with data and their representation by research professionals.These criteria can also influence their apprehensions and/or motivations to give open access or share data in an Open Science approach and consequently, must be taken into consideration when deciding which assistance and support solution to adopt.

How can we support the evolution of data-related practices in relation to the incentives and obligations of Open Science public policies?
In order to respond to this problem, five guidelines for accompanying measures have been determined on the basis of the lessons learned from the qualitative phases (interviews, observations of uses, seminar) and the results of the "Data and Open Science" questionnaires (see appendix).
• Orientation track 1: To understand in detail the research approaches; • Orientation track 2: To apprehend different practices of provision of data; • Orientation track 3: To know the modalities of learning and the collaborative practices; • Orientation track 4: Diversify the types of accompaniment; • Orientation track 5: Take into consideration the status and the career issues.
Regarding incentives for Open Science related to research data, we include, for example, the application of FAIR principles for data (Findable, Accessible, Interoperable, Reusable), the implementation of data management plans (DMP), the encouragement of greater reproducibility of research work, the implementation of support and the deployment of infrastructures for making data available.
For each track, different themes have been distinguished, each associated with recommendations.The 20 recommendations aim to facilitate the evolution of practices associated with data and Open Science incentives while adapting to the various contexts of academic research.
Explanation : Today, the issues of reproducibility are an integral part of the discourse and incentives for Open Science10 .However, it is necessary to detach ourselves from the term reproducibility in order to address more broadly the question of "quality" in research.Indeed, the notion of "reproducibility" applies more specifically to research involving measurement instruments and the use of computational methods (verification of calculations based on access to source codes and original data).Other terms are more inclusive to address the issue of research quality more broadly in different research communities.For example, the principle of "transparency" is to be used in a privileged way in multidisciplinary research contexts11 .The concept of "explicability" is used in the context of SSH work that requires the constitution of corpora or the construction of databases.On the other hand, the notion of "replicability" can be used preferentially in the framework of experimental research when it is a question of reproducing an experiment.This implies considering access to methodological protocols (not exclusively to data and source codes).Several comments also pointed out the importance of associating the ethical principles and values (integrity, honesty, etc.) of research and its impacts (social, economic, technical, etc.)  with the reflections on the question of quality in research.

Pay attention to the different forms of added value derived from research work
Key takeaway 4 : Facilitating the availability of data implies taking into account, in a differentiated way, the investment of work required at different stages of the research, the added value created according to the research approach and the repercussions in terms of evaluation and career.
Explanation: When conducting research, different steps are necessary to obtain results that can be shared with the peer community.These steps generate a more or less important added value according to the time devoted to their realization or to the degree of recognition attributed to this work by the community.Different types of added value have been distinguished and then correlated to criteria related to the research process.For example : • the collection of rare data or data requiring a significant amount of time is mainly associated with fieldwork or with archives and documentary collections in the Social Sciences and Humanities; • the preparation of samples and the definition of experimental protocols are activities associated with laboratory research work; • a clinical research framework is more strongly correlated with added value derived from the automation of workflow processes and modeling on a large quantity of data.
Paying attention to these different research approaches, as well as the forms of added value generated according to the contexts, is important in order to identify blockages in the provision of data.Some research approaches (technique improvement, automation, modeling) may encourage the provision of data, while other approaches may discourage it (rare data collection, time-consuming sample preparation).
experiment replicability, sharing the protocol is essential, as is making the source codes available for reproducing the analysis of specific data.
The term "making available" is used in the study to distinguish different practices including: • sharing restricted to a targeted and known public (via email for example); • putting the data online on a site/warehouse with access control or not; • the opening of data on a repository with an open license (open data).

Distinguish between different limits to availability and levers for improvement
Key takeaway 8: Differentiate the reasons limiting the availability of data (too much time required, lack of habit, competitive advantage not to share) to provide appropriate responses.
Key takeaway 9: Encourage journal editorial boards to build on existing national policies regarding regarding data and source codes associated with scientific publications.

Explanation:
The main reasons limiting the availability of data are mainly lack of familiarity with these practices, too much time needed to make them available, and a desire to add value to the data storage (and retain information) to maintain a competitive advantage.Secondary reasons include questions about the risks of additional bureaucracy generated by making data available, as well as legal and ethical issues surrounding access to personal data.There is little awareness of the obligations to make data available, and these obligations are mostly from journal editorial boards or ethics committees.Making committees aware of the issues involved in making data available is a key element for taking these practices into account in the evaluation and recognition of research work, as their role in this process is important12 .

Highlighting data conservation and security issues
Key takeaway 10: Raise awareness of the distinction between data storage and archiving, which involve different services and different infrastructures as the need for a possible selection of data in order to differentiate data to be kept from data to be destroyed Key takeaway 11: Prioritize and/or highlight the security features and reliability elements offered by the research infrastructures made available for data storage.
Explanation: As far as data storage is concerned 13 , it is mostly done on external media and professional computers.Nevertheless, in the Social Sciences and Humanities, the use of personal computers is frequent, especially for doctoral students, which does not facilitate the follow-up of data, their security or their reuse at the end of a project.The communities are particularly vigilant about data security (encrypted data, risk of hacking, etc.) and question the reliability of institutional infrastructures.Cloud solutions such as Google Drive or Dropbox are mostly used for file sharing.Moreover, at present, the difference between storage and archiving remains blurred for the research communities.Archiving services are rarely used, because storing data seems to be a sufficient action for research professionals to preserve their data.

Discovery and training in tools: an exchange between peers
Key takeaway 12: To facilitate the appropriation of new practices, take into consideration the specificities of community meetings and learning (laboratory life, study days and conferences, social networks, etc.).

Explanation:
In addition to discovering tools on one's own, the role of other people within research teams (team members or other teams) is essential to build up one's digital work environment.Habits are often formed as soon as the first research internships in a master's degree with training within the teams (internship supervisor, "laboratory" life for work on the "bench", etc.).In the Social Sciences and Humanities, seminars and informal times play an important role in discovering new tools and sharing practices.Social networks also represent spaces for exchanging and discovering practices, which are considered useful especially when different communities meet.

Seminar: « from the field to the 'making of data' in SSH »
As part of the survey (phase 2), a study day was dedicated to the study of "data making" practices in SSH and allowed three key issues to emerge: • Common issues in "data making" practices14 , • Reconfiguration of research groups, • Environment and recognition of "data making" work.

Paying attention to interfaces
Key takeaway 13: Pay particular attention to data processing and analysis interfaces so that they do not become "black boxes" and "dead ends" (lack of interoperability, proprietary formats, etc.).
Key takeaway 14: Be vigilant about the new turnkey solutions that are being developed for data analysis and manipulation.
additional workload that would be generated by adding a new "data referent" function to the people already on the job, particularly research or study engineers.

Be vigilant about mediation issues within research teams
Key takeaway 19 : Pay attention to the necessary translation and mediation issues that arise when managing and making data available within research communities.This involves finding "common denominators" among tools and documentation that are being used, as well as data and protocol standardization processes.
Explanation: For many, adapting to new data processing, analysis, and sharing practices is accompanied by new and/or complementary work processes and environments to be appropriated.This also reconfigures the working methods between different team members (IT departments, engineers, researchers, etc.) with a set of possible frictions.
The constitution of databases between different disciplinary or professional profiles as well as their availability in data warehouses (sharing or opening) crystallize tensions (constitution of vocabularies, reduction of the complexity of a study, recognition of the people who participated in the creation of the database, etc).Nevertheless, these new objects are also a way to build new practices adapted to the skills of each person.
Building the necessary dialogue and understanding between different people and skillsets (translation of specific vocabulary, encouraging exchanges through mediation processes, etc.) requires time and sometimes financial, material or organizational support.

Considering status and career issues
Key takeaway 20 : Give greater consideration in the career development and evaluation of research professionals to the work of "data development" and making data available.

Explanation:
The work of "constituting data" and making data available often requires time, for example, collecting sparse data, formatting/cleaning data, adding documentation, adding metadata, posting to repositories.It is important to recognize the time spent on these activities in the evolution of careers, especially in the case of people with a status and function that can lead to solitary work, a context in which these tasks are even more invisible.Indeed, if some researchers prefer to work alone and not to change their practices by choice or by political positioning , others have a solitary and "non-sharing" approach imposed.This is the case, for example, for doctoral students who are interested in Open Science topics, but for whom data sharing activities are not a priority, nor for their supervisors.For post-doctoral researchers, in the same way, the search for a position often takes precedence over developing these practices, even if this may lead some to develop a visibility and networking strategy around these practices.

| Limitations
A first analysis of the results of the questionnaire showed an over-representation of research communities in the Social Sciences and Humanities (SSH).Following this, the results were weighted according to the current distribution of researchers in different disciplinary categories 15 .The way the questionnaire was distributed certainly explains this over-representation.The questionnaire was shared on discussion lists and social networks followed by members of the Adapting Open Science working group.Several lists were associated with the Social Sciences and Humanities (history, sociology, economics, etc.) and the announcement circulated more widely in these communities.In view of the results, the questionnaire would benefit from being shared more widely within institutions in order to refine the results concerning disciplines that are currently under-represented and to confirm the relevance of the factors differentiating the practices highlighted.

| Conclusion
This study aimed to study the current practices associated with data in various research communities and to best accompany their evolution in a digital context and public policy that are favorable to Open Science.The objective was to present with a fine granularity elements explaining the diversity of research practices within what is called "Science" in order to better decline and adapt Open Science measures according to epistemic communities or practices.More than a simple disciplinary view, the typology of practices highlighted -and their illustration by personae (typical profile) -shows the importance of considering the solitary or collaborative nature of the work that is part of diverse social, methodological and technical fabrics.
A better appropriation of new practices associated with data by the communities requires an in-depth understanding of different research approaches, as well as a look at the tools and devices used and their learning and discovery modalities.Through the differentiating factors defined, the orientations and recommendations proposed, this study wishes to help those involved in Open Science policies and projects to better dialogue with the research professionals they are called upon to accompany, as well as to diversify the types of assistance offered.
For the people concerned by these practices and subject to their evolution, the study wishes to participate in a step back and reflexivity.It is a question of better understanding "our practices" and/or having a framework of explanation on the practices of other colleagues.Far from wanting to decide or judge the quality of the norms to be applied within research teams or collectives, this study is rather about giving leads to adapt the modalities of interaction between research professionals, to understand the reasons for frictions or blockages to Open Science measures and their incentives, as well as to make available elements of argumentation and debate so that these changes in practices are an enlightened and desired act.
• The obligations to make information available are not well known and concern mainly obligations from an editorial or ethical committee (for example, in biomedical research).

Data Reuse and Limitations on Availability
• Nearly 50% of respondents report that they often and/or sometimes reuse data that has already been produced or published.• More than 45% of respondents consider that their data would be potentially reusable.
• The main reasons limiting data availability are primarily (see chart below): • lack of familiarity with these practices (63%); • too much time required (49%); • a desire to retain data to maintain a competitive advantage (48%).

Data storage
• The majority of data storage is done on external media (59%) and professional computers (57.5%).• There is little use of archive services (7.5%) (see graph below).

Tools used associated with the data
• Majority use of spreadsheet software (Excel, Calc) (74.5%).
• More than 40% use solutions based on the use of programming languages (R, Python).• QGIS is one of the most frequently cited data analysis and visualization software (24%).• Integrated database software/platforms (18%) frequently cited are FileMaker, PostgreSQL, MySQL.• Data warehouse platforms were used by only 12% of respondents.
• The most widely used operating system is Windows (62%) versus 26% for MacOS and 12% for Linux and other Unix.

Collaborative practices
• Shared note-taking tools are used by 40% of respondents.

Figure 1 :
Figure 1: Summary of the different methodological steps of the study "Adapting Open Science"

Figure 3 :
Figure 3: Presentation of the 4 personae according to types of practices associated with the data from the analysis of the questionnaire (MCA then HCPC).

Figure 7 :
Figure 7: Summary of factors differentiating data-related practices (research approaches, practices/tools/learning, status and function in research) and their characteristics.