Measuring Risk and Utility of Anonymized Data Using Information Theory
Department of Computer Engineering and Maths
severe that the anonymized data become useless, i.e., that
Before releasing anonymized microdata (individual data) it
all information contained in the data is lost. The problem
is essential to evaluate whether: i) their utility is high enough
of optimizing the trade-off between disclosure risk and in-
for their release to make sense; ii) the risk that the anonymized
formation loss is known as the statistical disclosure control
data result in disclosure of respondent identity or respondent
attribute values is low enough. Utility and disclosure riskmeasures are used for the above evaluation, which normally
Information loss measures in SDC of microdata are usually
lack a common theoretical framework allowing to trade off
based on the relative discrepancy between some statistics or
utility and risk in a consistent way. We explore in this pa-
models computed on the original data and on the masked
per the use of information-theoretic measures based on the
data [4]. A critique to the above measures is that, for contin-
uous attributes, relative discrepancies are unbounded1 anddifficult to combine with disclosure risk; the latter is most
often measured as a probability of re-identification and isthus bounded between 0 and 1 [14].
H.2.8 [Database Management]: Database Applications—statistical databases; H.1.1 [Models and Principles]: Sys-
Probabilistic information loss measures yielding a figure be-
tems and Information Theory; K.4.1 [Computers and So-
tween [0, 1] which can be readily compared to disclosure risk
have been proposed [9]. Let θ be a population parameter(on the original data) and let ˆ
statistic (on the masked data). If the number n of records
of the original data is large (> 100), then
When a set of microdata (individual data) containing the
values of some attributes for individuals or companies is tobe released by a statistical office for research or general use,
can be assumed to follow a N (0, 1) distribution.
anonymization is more often than not a legal requirement [6]. Anonymization means processing original microdata so as to
A probabilistic information loss measure pil(θ) for param-
obtain masked microdata that can be released in such a way
eter θ is the probability that the absolute value of the dis-
that the respondent subjects to whom the microdata records
crepancy Z is ≤ the actual discrepancy in the masked data:
refer cannot be re-identified (identity disclosure) and partic-ular attribute values cannot be associated or disassociatedwith a particular respondent subject (attribute disclosure).
In order to reduce the disclosure risk, anonymization distorts
the data in some way (e.g. by perturbing them or reducing
their detail); as distortion increases, disclosure risk can beexpected to decrease. However, distortion should not be so
However, getting information loss and disclosure risk mea-sures bounded in the same [0, 1] does not imply that theirsemantics are entirely comparable. Indeed, probabilistic in-formation loss measures are actually distortions mapped to
1E.g. a statistic whose value is 0 on the original data and
0.1 on the masked data has an infinite relative discrepancy(0.1 − 0)/0.
[0, 1], whereas disclosure risk is normally computed as a re-
• More generally, if U and V are random, jointly Gaus-
identification probability. If information loss and disclosure
risk could be expressed with semantically similar measures,
boundedness would be irrelevant: both measures could be
where P t is the transpose of P , I the identity matrix
The original contribution of this paper, developed in the fol-
lowing sections, is to explore the use of information-theoreticmeasures based on the notion of mutual information in viewof providing a unified framework embracing information loss
If mutual information can be used to express information
and disclosure risk. Section 2 motivates the use of information-
loss or/and disclosure risk, then the machinery of informa-
tion theory can be used to optimize the trade-off between
oretic measures for information loss.
information-theoretic measures for disclosure risk. Based onthe previous measures, Section 5 presents models to trade
off information loss against disclosure risk. Section 6 lists
conclusions and issues for future research.
The attributes in a microdata set can be classified as:
• Identifiers. These are attributes that unambiguously
Unbounded loss measures based on relative discrepancies are
identify the respondent. Examples are the passport
very easy to understand, but rather difficult to trade off
number, social security number, name-surname, etc.
• Key attributes. These are attributes which identify the
respondent with some degree of ambiguity. (Nonethe-less, a combination of key attributes may provide un-
• They can be applied to the same usual statistics θ
ambiguous identification.) Examples are address, gen-
(means, variances, covariances, etc.) as measures based
contain sensitive information on the respondent. Ex-
• They are bounded within [0, 1], so they easily compare
amples are salary, religion, political affiliation, health
Both relative-discrepancy and probabilistic loss measures
We assume that the original microdata have been pre-processed
lack an underlying theory allowing to optimize their trade-
to remove all identifiers from them. Let X, Y be, respec-
tively, the key and confidential attributes in the original mi-crodata set. Let X be the key attributes in the masked
The mutual information I(X; Y ) between two random vari-
microdata set (as in k-anonymization [13], we assume that
ables X and Y measures the mutual dependence of the two
only key attributes are masked). If we focus on the dam-
variables and is measured in bits. Mutual information can
age inflicted to key attributes [11], a possible information
be expressed as a function of Shannon’s entropy:
loss measure is the expected distortion E(d(X, X )) whered(x, x ) is a distortion measure, e.g. d(x, x ) = ||x − x ||2.
I(X; Y ) = H(X) − H(X|Y ) = H(Y ) − H(Y |X)
A probably better option is to focus on how masking affects
the statistical dependence between the key and confidentialattributes. A possible measure for this is I(X; Y )−I(X ; Y ).
where H(X), H(Y ) are marginal entropies, H(X|Y ), H(Y |X)are conditional entropies and H(X, Y ) is the joint entropyof X and Y .
Lemma 1. If X is a randomized function of X, but not
of Y , it holds that I(X; Y ) − I(X ; Y ) ≥ 0.
Given a symmetric, positive definite matrix Σ, let Σ1/2 de-note the unique, symmetric, positive definite square root of
Given three random variables X1, X2 and X3,
Σ. The following can be stated about mutual information
define the conditional mutual information I(X1; X2|X3) as
the expected value of I(X1; X2) conditional to X3, that is,I(X1; X2|X3) = EX (I(X
• If U and V are jointly Gaussian random vectors, and
I(X ; Y )+I(X; Y |X ) = I((X, X ); Y ) = I(X; Y )+I(X ; Y |X)
U is the best linear estimate of U from V , then U is
The hypothesis of the lemma implies that X and Y are
a sufficient statistic, that is, I(U ; V ) = I(U ; V ).
conditionally independent given X, that is, I(X ; Y |X) = 0.
If U and V are random, jointly Gaussian scalars withcorrelation coefficient ρ, then I(U ; V ) = − log p1 − ρ2.
Since I(X; Y |X ) ≥ 0, we have that I(X ; Y ) ≤ I(X; Y ). ✷
Let us now compare the mutual information with the more
usual information loss measures based on the mean square
The MSE E(d(X, X )) = E(||X−X ||2) seems better adaptedthan I(X; X ) to measuring how well statistical properties
• A zero MSE between X and X , that is, E(d(X, X )) =
• However, I(X; Y ) − I(X ; Y ) = 0 only implies that X
Figure 1: Risk-loss as Lagrangian rate-distortion op-timization. Rate stands for risk in the privacy set-
Nonetheless, MSE and the mutual information are not that
different, both belonging to the family of so-called Bregmandivergences [10, 2].
By combining the above two approaches with the variousloss and risk measures sketched above, several optimization
The difference I(X; Y ) − I(X ; Y ) bears some resemblance
to the relative discrepancy between correlation matrices pro-posed as an information loss measure in [4]. However, mu-tual information measures the general dependence between
attributes, while the correlation measures only the linear
In this model, disclosure risk is minimized while keeping
dependence; thus the former seems superior [8]. It will be
information loss below a certain upper bound. Disclosure
shown below that, under some assumptions, preserving mu-
risk is measured using mutual information and information
tual information preserves the covariance matrix up to a
loss is measured as the expected distortion. Hence:
The mutual information I(X ; X) between the released and
for a certain pre-specified maximum tolerable loss D.
the original key attributes is a measure of identity disclosure(defined in the introduction above). Note that I(X ; X) was
Model 1 was related in [11] to the rate-distortion function
previously regarded as a possible information loss measure
optimization in information theory (see Figure 1): the risk
R was assimilated to the rate and the loss D to the distor-tion. An optimal random perturbation pX |X (x |x) for key
The mutual information I(X ; Y ) between the released key
attributes was obtained. Let d = D/σ2X be the normalized
attributes and the confidential attributes is a measure of at-
distortion. For the case of univariate Gaussian, real-valued
tribute disclosure (defined in the introduction above). Mea-
X and Y , a closed form of the minimum was obtained. The
suring risk as I(X ; Y ) conforms to the t-closeness privacy
idea is to compute X as X = (1−d)X +dZ, where the noise
property [7] requiring that the distance between the distribu-
Z is distributed according to N (µ, 1−d σ2
tion of Y within records sharing each combination of values
of X and the distribution of Y in the overall dataset be no
(x |x) = N ((1 − d)x + dµX , d(1 − d)σ2X )
Several combinations of the above loss and risk measures can
be used when trying to optimize the trade-off of informationloss and disclosure risk. Two approaches are conceivable:
If we maintain the same risk and loss measures, but take
the more natural approach of minimizing D for a maximum
Place an upper-bound constraint on the information
loss D and minimize the disclosure risk R.
• Place an upper-bound constraint on the disclosure risk
R and minimize the information loss D (more natural
Information loss measures based on relative discrepanciesare awkward to combine with risk measures in order to opti-mize the risk-loss trade-off. Probabilistic loss measures are a
This problem could be related to optimizing the distortion-
step forward, but lack a theoretical framework. We have ex-
rate function optimization in quantization. This again yields
plored here loss and risk measures based on information the-
an optimal perturbation pX |X , which can be computed by
ory, namely on mutual information. Models for optimizing
the information-theoretic risk-loss trade-off when perturbingdata and generating synthetic data have been presented. It
has been shown that preserving mutual information offers
Model 3 below results from miniziming disclosure risk while
keeping information loss below a certain level, and usingmutual information for measuring both the disclosure risk
However, this paper is intended to be a starting point rather
than a conclusive contribution. The information-theoreticmeasures and the models discussed above are just a firststep. A number of issues for future research lie open ahead:
• Relate Model 2 with distortion-rate function optimiza-
tion, a well-known problem in quantization. This should
subject to I(X; Y ) − I(X ; Y ) ≤ D.
be done in a way analogous to the connection betweenModel 1 and rate-distortion function optimization es-
Finally, Model 4 is obtained by maintaining the same mea-
sures, but minimizing the information loss while keeping dis-closure risk below an upper bound:
• In the context of synthetic data generation, devise
information-theoretic loss measures whose minimiza-tion is equivalent to preserving a given model.
• Whenever possible, find closed-form expressions for the
• If a closed-form expression is not possible, look for a
convex optimization problem to be solved numerically
Synthetic data generation, that is, generation of randomsimulated data preserving some properties of original data,
• Investigate the connection of mutual information with
can be viewed as a form of masking by perturbation [1]. If
information loss metrics other than MSE, like [5, 15,
we want to generate synthetic key attributes X in such a
way that the connection between key attributes and confi-dential attributes is minimally affected, we can use Model 4
to compute pX |X . Synthetic X can be generated by draw-
This work was partly funded by the Spanish Government
through projects CONSOLIDER INGENIO 2010 CSD2007-00004 “ARES” and TSI2007-65406-C03-01 “E-AEGIS”. The
Mutual information vs. covariance preser-
first author is partially supported as an ICREA Acad`
researcher by the Government of Catalonia; he holds theUNESCO Chair in Data Privacy, but his views do not nec-
We justify that preserving mutual information (that is, achiev-
essarily reflect the position of UNESCO nor commit that
ing D(R) = 0 in Model 4) preserves the covariance matrix
Let X and Y be zero-mean, jointly Gaussian random vari-
R- and R -valued, respectively. Let X = aT Y be the
[1] J. M. Abowd and L. Vilhuber. How protective are
best linear MSE estimate of X given Y , for a ∈
synthetic data? In J. Domingo-Ferrer and Y. Saygin,
editors, Privacy in Statistical Databases, volume 5262
of Lecture Notes in Computer Science, pages 239–246,
On the one hand, X is a sufficient statistic for X given Y ,
that is, I(X ; Y ) = I(X; Y ) (see [12]).
[2] L. M. Bregman. The relaxation method of finding the
common point of convex sets and its application to the
On the other hand, the covariance matrix is preserved when
solution of problems in convex programming. USSR
Comput. Math., Math. Phys., 7:200–217, 1967.
[3] J. Brickell and V. Shmatikov. The cost of privacy:
destruction of data-mining utility in anonymized datapublishing. In Proc. of the 14th ACM SIGKDD
International Conference on Knowledge Discovery and
[4] J. Domingo-Ferrer and V. Torra. Disclosure protection
methods and information loss for microdata. InP. Doyle, J. I. Lane, J. J. M. Theeuwes, andL. Zayatz, editors, Confidentiality, Disclosure andData Access: Theory and Practical Applications forStatistical Agencies, pages 91–110, Amsterdam, 2001. North-Holland.
[5] V. S. Iyengar. Transforming data to satisfy privacy
constraints. In Proc. of the 8th ACM SIGKDDInternational Conference on Knowledge Discovery andData Mining, pages 279–288, 2002.
[6] J. Lane, P. Heus and T. Mulcahy. Data access in a
cyber world: making use of cyberinfrastructure. Transactions on Data Privacy, 1(1): 2–16, 2008.
[7] N. Li, T. Li, and S. Venkatasubramanian. t-closeness:
privacy beyond k-anonymity and l-diversity. InProceedings of the IEEE ICDE 2007, 2007.
[8] W. Li. Mutual information functions vs correlation
functions. Journal of Statistical Physics, 60:823–837,1990.
[9] J. M. Mateo-Sanz, J. Domingo-Ferrer, and F. Seb´
Probabilistic information loss measures inconfidentiality protection of continuous microdata. Data Mining and Knowledge Discovery, 11(2), 2005. 181-193.
[10] D. Rebollo-Monedero. Quantization and Transforms
for Distributed Source Coding. PhD thesis, StanfordUniversity, 2007.
J. Domingo-Ferrer. From t-closeness to pram andnoise addition via information theory. In Privacy inStatistical Databases-PSD 2008, volume 5262 ofLecture Notes in Computer Science, pages 100–112,Berlin Heidelberg, 2008.
[12] D. Rebollo-Monedero, S. Rane, A. Aaron, and
B. Girod. High-rate quantization and transformcoding with side information at the decoder. SignalProcessing, 86(11):3160–3179, 2006.
[13] P. Samarati. Protecting respondents’ identities in
microdata release. IEEE Transactions on Knowledgeand Data Engineering, 13(6):1010–1027, 2001.
[14] M. Trottini. Decision models for data disclosure
limitation. PhD thesis, Carnegie Mellon University,2003. http://www.niss.org/dgii/TR/Thesis-Trottini-final.pdf.
[15] J. Xu, W. Wang, J. Pei, X. Wang, B. Shi, and
A. W.-C. Fu. Utility-based anonymization using localrecoding. In Proc. of the 12th ACM SIGKDDInternational Conference on Knowledge Discovery andData Mining, pages 785–790, 2006.
we are pleased to invite you to the 3rd International Convention for Ethnology and Cultural Anthropology Students from Central Europe „ETHNOLOGY WITHOUT BORDERS” which in 2014 will be hosted by the Eötvös Loránd University inBudapest, Hungary. The Convention has started off in 2012 with the intention of tightening the relationships and enhancing cooperation between students and acade