Skip to main content

Data and Databases: Data Ethics

Introduction

When producing and processing humanities data, very often deal with information about people –unlike in for example physics or chemistry, the primary subjects of our research may be thinking, feeling beings, and that means that we have significant ethical and moral considerations in how we go about our data projects. Data is a powerful thing –indeed, this is a large aspect of how and why modern data collection and structuring methods were developed in the eighteenth and nineteenth centuries as states sought to operate and control resources more efficiently through better documenting their populations. We have responsibilities as researchers to ensure that information we hold about people is used and presented in a way that does not harm the people we are researching.

Some ethical considerations are defined and laid down in law: others are a matter for our own judgements. Both are important in our research practices, and should be written into research plans when we develop database projects.

Learning outcomes

After completing this resource, learners should be able to:

  • Know when and how to anonymise personal data
  • Understand the relationship between copyrights and textual data
  • Discuss the benefits of categorisation and structure ethics
  • Outline the purpose of data processing ethics

Anonymisation and personal data

Data you hold on individual living people may be subject to a range of legal requirements. If you are holding data about people, it is important to familiarise yourself with the specific data protection rules of the country or countries in which you are operating and conducting your research. There may be particular steps that the law obliges you to take, for example only holding data for certain amounts of time, or getting particular consent for certain uses of the data, or restrictions on how you can use the data you have gathered. Data subjects –that is, people included your dataset –may also have particular rights, for example to ask you for a copy of data you hold about them or to ask to be removed from your data.

Whilst you should always check the legal situation where you are collecting person data on living persons, as a general rule it is always safest to ensure:

  • That you have written consent for collecting and using the data
  • That you have made clear how long you intend to keep the data
  • That you have made clear what you intend to use the data for

If you do want to share and re-use data of this kind, you may need to use anonymisation (where no data are kept that allow the original person to be traced) or pseudo-anonymisation (where data that allow the original person to be traced is kept separately and outside the main dataset). It’s important to recognise that this doesn’t just mean removing people’s names –other data like addresses or dates of birth can also be identifying data that need to be removed when anonymising a set of data.

The most important concern regarding personal data of living or recently alive people, even in caseswhere the people are theoretically anonymised, is the possibility of re-identifying individuals from that data. Datasets including sensitive areas such as people’s medical, financial or other official records will often be unsuitable for open publication for this reason: even if the people’s names are not attached to the dataset, large scale data mining and cross-referencing to other data sources could allow the re-identification of individuals and leave them at risk.It is also important as we noted above that individuals providing data, whether as survey results or use of official documents on them, be given a proper understanding of how the data will be used.

There are three primary pillars to this: you need consentfrom the person to use their information, you need to ensure that your methods respect their privacy as an individual, and you need to ensure that any possibility of further or secondary use is clear both to them and in your methods. When using a dataset you did not construct, this last point should be in your mind: you should avoid using datasets if they contain personal information and where it is not explicitly clear that secondary use was intended.

In general, use of personal data is the most clearly and strictly legally defined part of humanities and social science data ethics. Whilst it does not apply to all humanities data collection, especially in the social sciences gathering survey or experimental data about people can be extremely important and as a result this is a crucial area for scholars to be aware of. It is not, however, the only ethical consideration we should have when collecting and storing data – the frameworks and models we use to store data are not necessarily ethically neutral either, and we need to consider these issues too when building our database projects.

Copyrights and textual data

Another legal area that you may need to be aware of when creating databases is the field of intellectual property rights, which is important when handling textual and image source material in your datasets. In particular, this can be important when analysing image or especially text datasets, where you may need to keep significant quantities of copyrighted material. As academics we have,

in these cases, an ethical and legal responsibility to respect the rights of a creative work’s creator or creators.

Copyright refers to the rights given to the creator of a work: in most countries these usually expire at least 50 years after the death of the work’s creator. These include both economic rights to reproduce and distribute the work, and moral rights such as the right to claim authorship and the right to object to distortions or modifications of their work. If you are putting texts, or text elements, in your database, it is useful to know if the work is copyrighted and, if so, what restrictions this might place on how you publish and share your work.

Most countries have some form of fair use policy that allows parts of copyrighted works to be utilised for and quoted within academic research, but there are limits on this. In particular, you should consider the following:

  • Does my dataset contain complete texts or copyrighted works? If so, you are unlikely to be able to share the dataset, as this would be equivalent to distributing the work –a right that is reserved to its author or whoever they have subsequently sold or given that right to.
  • Could complete texts or copyrighted works be reconstructed from my dataset? Even if you haven’t stored a text in a single, easy to read format, if the text could be reconstructed from your work it might classify as an adaptation of the text and still fall under copyright.
  • Is the work I am looking at under public license? Not all recent texts and materials are copyrighted, and many exist under creative commons or similar licenses that do allow a lot more freedom to others to copy and redistribute the work. Note, however, that unless a text explicitly says otherwise or has passed the legal endpoint for such, the default is that it is under copyright: there is rarely if ever a legal requirement for copyrights to be registered, they exist by default.

The most likely impact if you are working with copyrighted material would be that you may need to avoid publicly distributing or sharing your database. Constructing a dataset or digital copy of a book or image set you have access to and analysing it yourself is generally not a problem: what you should be careful of is anything that might involve re-distributing the work or a modified version thereof.

If studying historical materials, often the original works being studied are out of copyright, but particular critical editions of texts or images of artefacts may not be – do not assume that simply because your work is on premodern topics, there are therefore no intellectual property issues to consider.

Categorisation & structure ethics

How we categorise information is a question not usually subject to legal restrictions or definitions. It is nonetheless a very important question from an ethical perspective. The structure of a humanities database, and the categorisations used for its data, form an ontological (that is, information￾structure) argument about the data being modelled. We create information structures because they are informative ways of helping us understand the world: but in doing so we reify (that is, make more real) the categories. If those categories leave out or misrepresent certain groups of people or types of relationship between entities, there can sometimes be ethical implications for our work.

We have seen parts of this issue before in this course when discussing scope and ensuring that the data you have represent the population or category you wish to study. This can have profound ethical implications if the data are not properly representative because of biases that reflect societal inequalities or cultural differences. For example, in an opinion survey, it is not only a bad research practice but also frequently unethical to claim that a survey represents the views of the population as a whole if a significant part of the population –for example, a particular age category, people in certain jobs, or particular ethnic, religious or gender groups –were excluded or badly under￾represented in your data. We have a moral obligation both to the subjects of our research, to represent them fairly, and that includes anyone whose views or data we claim to be representing in our work. We also have an obligation to the readers of our data to provide accurate information, especially as in many areas of academic research our work can be used to justify or discuss business, legal, or public policy actions that can in turn affect people’s lives.

As well as the issues of ensuring appropriate scope, the information structures and categories we use may be considered from an ethical perspective. For example, by positing a certain entity relationship in our data, we suggest that as a meaningful –and sometimes, more dangerously, as a complete –way of modelling the situation under discussion. This can have an ethical impact in the resulting model. For example, in a public data survey we might choose to treat people and their home location as entities with a one to one relationship: each person should be assigned one home place in our database. This may be useful to usto keep our data simple –but we should consider whether this simplification might have an ethical impact in places where it does not completely capture reality, for example for university students who may live in one place for half of the year and then travel to live nearer their university during term-time, or for seasonal workers who move and work according to agricultural patterns. This may not create an ethics problem: if we are simply surveying preferences for a business survey and might want to check general trends in location, it is unlikely to be an issue. On the other hand, if what we are working on is research into the optimal allocation of resources for government services, this could be an ethical problem: by under￾representing people who are mobile in some of the areas that they live for much of the year, we could end up suggesting those areas have less need of services than is actually the case.

Categorisation ethics can be important even if the subjects of categorisation are deceased or indeed fictional. For example, suppose we have a data set of information on historical Chinese generals, which includes ethnic identities of various figures. We should think carefully about how such information is encoded, documented, and presented. Accepting a categorisation from the sources without question and utilising it in our analyses then encodes that worldview into our data, and thereby encodes any ethical problems with it. It may for example be that our body of source material tends to use particular non-standard classifications for some persons, for example, or uses some ethnonyms as pejorative terms. There may be reasons for our data to reflect these conceptions, but where this is the case we need to be clear that we are doing so. Modern people and states often look to past literary and historical texts and figuresto understand their place in the world today, and modern disputes over land ownership or cultural change often use appeals to history to justify certain positions: it is helpful to be aware of modern political contexts to historical work in the humanities to ensure that the way we present our findings and data are not misused to support social or political positions that we would not wish them to be used for.

Ethical issues with how our information is handled in this way are linked to some practical data issues, but are not quite the same as them. Data can be sufficiently accurate for a project’s broadacademic aims but still be unethically framed in how it is presented in the dataset or in result write￾ups. Ethical issues may lie at a tangent to the core of the study: one might correctly observe a pattern in properly collected data, but whilst using outdated or harmful terms for how some of the data are framed, by failing to take into account social and political sensitivities over people and places that are involved. Whilst the problems of categorisation ethics are therefore closely linked to research design and rigorous analysis, they need separate consideration.

Data processing ethics

The ethics of data processing and analysis are closely linked to those of categorisation and structure.

A common catchphrase when discussing data analysis is “junk in, junk out” – that is, your analyses are only as good as the data you put into them, and whether it is accurate and correctly scoped for what you are looking for absolutely matters. Analyses can also easily replicate and exacerbate any biases and problems in their input data or training data.

This last point is something that should be a particularly significant concern for humanities scholars.

People often tend to have a stronger implicit belief in academic work that appears to conform to ‘scientific’ expectations, such as having a quantitative element. As analytical database work frequently makes the relationship between a humanities scholar’s argument and their original source material less immediately clear to a reader, it becomes more important to maintain a critical approach and identify potential methodological or data biases.

For example, imagine an American company develops a piece of facial recognition software, and a researcher in Ethiopia decides to use their product to analyse the people in a large body of historical images of Ethiopian church art, to identify which paintings in her data set have people in and to see if the numbers of people portrayed in paintings from different periods changes significantly. A problem may quickly arise if the American company trained their facial recognition software predominantly with white people’s faces: the software will then be very likely to fail to recognise many of the characters in the artwork, or disproportionately recognise characters depicted with paler facial tones. Whether or not the original company intended this, their software has had a bias in its training data and will then go on and replicate that bias.

These sorts of problems can be even worse when such technology is used in areas like crime detection, where in America there are disproportionately higher arrest and incarceration rates for black citizens (even when other factors are taken into account: that is, a black person is statistically more likely to get arrested, to go to prison, and to get a longer sentence for the same crime or actions as a white counterpart). This would mean that a training dataset based on current prison populations or crime statistics will tend to over-identify crime in black people and neighbourhoods: predictive analysis trained on the outputs of a biased system will simply inherit its biases.

Even outside the world of artificial intelligence and advanced predictive technology, simpler analyses can equally cause problems if not thought through properly. For example, in textual analyses, a word frequency analysis across a group of publications may need issues of representation taking into account. Imagine a situation where in a group of written news outlets, an academic wanted to see how prevalent a particular topic was. If some of the news outlets cater to a particular ethnic group or region they might use a particular dialect or phrasings not used in a more standard version of the language. If not accounted for, this could lead to an analyst failing to pick up their coverage of the topic or even wrongly concluding that outlets in that region were disinterested in it. Taking into account the diversity of people and situations from which your data are gathered is vital to analysing it not only properly, but also ethically.

As with the previous section, these problems are partly issues of doing rigorous academic work: badly trained AI or badly chosen analysis methods will come out with poorly framed or inaccurateanalytical results. They are also nonetheless specifically ethical concerns as well, and even if issues of this sort do not invalidate a study’s overall findings they could cause harm. It is therefore worth specifically reviewing your analysis plans along ethical guidelines and considering its impact on a range of groups and identities, separately from your considerations of the overall data gathering process and how it links to your research question.

Conclusion

This discussion has only been a very brief introduction to some of the ethical concerns that need to be taken into consideration when preparing, categorising, storing and analysing data: it should be a starting point for some of the questions you might need to think about, but the applications will vary greatly depending on which data and data types you are storing and analysing. When planning a project, you should go through some of the issues raised here and prepare some notes on any possible data ethics issues in your research design.

Some of these ethical concerns, as we have seen, are encoded in law, especially when it comes toareas that can directly be shown to affect the rights or privacy of individuals. Understanding the legal situation around data protection if you are handling personal data, or around copyrights if you are handling text or images as your source material, is very important to ensure your work is on a sound ethical footing.

Ethics, however, is not just a question of legality. As scholars in the social sciences and humanities we produce research about very real people and very important social constructions and concepts: as a result we have a responsibility to those people when conducting our work. The ways that we categorise, analyse, and present our data represent ideas about how it is possible to see the world and conceptualise things within it. We should ensure that when developing those ideas we take both a critical approach to our source material and a careful approach to representativeness within the data. We can, thereby, not only get better academic results, but also help ensure that the impact of our work has been considered from a range of perspectives. This diversity of perspectives, in turn, is a key part of spotting and avoiding ethical issues with how we use databases to represent the societies and human worlds we study.

Cite as

Emily Genatowski and James Baille (2024). Data and Databases: Data Ethics . Version 1.0.0. DARIAH-Campus. [Training module]. http://localhost:3000/id/DEarSzP5VJVT5Zc1iqY_A

Reuse conditions

Resources hosted on DARIAH-Campus are subjects to the DARIAH-Campus Training Materials Reuse Charter

Full metadata

Title:
Data and Databases: Data Ethics
Authors:
Emily Genatowski, James Baille
Domain:
Social Sciences and Humanities
Language:
en
Published:
6/30/2024
Content type:
Training module
Licence:
CCBY 4.0
Sources:
DARIAH
Topics:
Data management
Version:
1.0.0