Open Access Research

Social network integration and analysis using a generalization and probabilistic approach for privacy preservation

Xuning Tang and Christopher C Yang*

Author Affiliations

College of Information Science and Technology, Drexel University, Philadelphia, USA

For all author emails, please log on.

Security Informatics 2012, 1:7  doi:10.1186/2190-8532-1-7


The electronic version of this article is the complete one and can be found online at: http://www.security-informatics.com/content/1/1/7


Received:23 February 2011
Accepted:30 May 2012
Published:12 July 2012

© 2012 Yang and Tang; licensee Springer.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

Social Network Analysis and Mining (SNAM) techniques have drawn significant attention in the recent years due to the popularity of online social media. With the advance of Web 2.0 and SNAM techniques, tools for aggregating, sharing, investigating, and visualizing social network data have been widely explored and developed. SNAM is effective in supporting intelligence and law enforcement force to identify suspects and extract communication patterns of terrorists or criminals. In our previous work, we have shown how social network analysis and visualization techniques are useful in discovering patterns of terrorist social networks. Attribute to the advance of SNAM techniques, relationships among social actors can be visualized through network structures explicitly and implicit patterns can be discovered automatically. Despite the advance of SNAM, the utility of a social network is highly affected by its d completeness. Missing edges or nodes in a social network will reduce the utility of the network. For example, SNAM techniques may not be able to detect groups of social actors if some of the relationships among these social actors are not available. Similarly, SNAM techniques may overestimate the distance between two social actors if some intermediate nodes or edges are missing. Unfortunately, it is common that an organization only have a partial social network due to its limited information sources. In public safety domain, each law enforcement unit has its own criminal social network constructed by the data available from the criminal intelligence and crime database but this network is only a part of the global criminal social network, which can be obtained by integrating criminal social networks from all law enforcement units. However, due to the privacy policy, law enforcement units are not allowed to share the sensitive information of their social network data. A naive and yet practical approach is anonymizing the social network data before publishing or sharing it. However, a modest privacy gains may reduce a substantial SNAM utility. It is a challenge to make a balance between privacy and utility in social network data sharing and integration. In order to share useful information among different organizations without violating the privacy policies and preserving sensitive information, we propose a generalization and probabilistic approach of social network integration in this paper. Particularly, we propose generalizing social networks to preserve privacy and integrating the probabilistic models of the shared information for SNAM. To preserve the identity of sensitive nodes in social network, a simple approach in the literature is removing all node identities. However, it only allows us to investigate of the structural properties of such anonymized social network, but the integration of multiple anonymized social networks will be impossible. To make a balance between privacy and utility, we introduce a social network integration framework which consists of three major steps: (i) constructing generalized sub-graph, (ii) creating generalized information for sharing, and (iii) social networks integration and analysis. We also propose two sub-graph generalization methods namely, edge betweenness based (EBB) and K-nearest neighbor (KNN). We evaluated the effectiveness of these algorithms on the Global Salafi Jihad terrorist social network.

Introduction

Social Network Analysis and Mining (SNAM) techniques have drawn significant attention in the recent years due to the popularity of online social media. With the advance of Web 2.0 and SNAM techniques, tools for aggregating, sharing, investigating, and visualizing social network data have been widely explored and developed. SNAM is effective in supporting intelligence and law enforcement force to identify suspects and extract communication patterns of terrorists or criminals. In our previous work [1-3], we have shown how social network analysis and visualization techniques are useful in discovering patterns of terrorist social networks. Attribute to the advance of SNAM techniques, relationships among social actors can be visualized through network structures explicitly and implicit patterns can be discovered automatically.

Despite the advance of SNAM, the utility of a social network is highly affected by its d completeness. Missing edges or nodes in a social network will reduce the utility of the network. For example, SNAM techniques may not be able to detect groups of social actors if some of the relationships among these social actors are not available. Similarly, SNAM techniques may overestimate the distance between two social actors if some intermediate nodes or edges are missing. Unfortunately, it is common that an organization only have a partial social network due to its limited information sources. In public safety domain, each law enforcement unit has its own criminal social network constructed by the data available from the criminal intelligence and crime database but this network is only a part of the global criminal social network, which can be obtained by integrating criminal social networks from all law enforcement units. However, due to the privacy policy, law enforcement units are not allowed to share the sensitive information of their social network data. A naïve and yet practical approach is anonymizing the social network data before publishing or sharing it. However, a modest privacy gains may reduce a substantial SNAM utility. It is a challenge to make a balance between privacy and utility in social network data sharing and integration.

In order to share useful information among different organizations without violating the privacy policies and preserving sensitive information, we propose a generalization and probabilistic approach of social network integration in this paper. Particularly, we propose generalizing social networks to preserve privacy and integrating the probabilistic models of the shared information for SNAM. To preserve the identity of sensitive nodes in social network, a simple approach in the literature is removing all node identities. However, it only allows us to investigate of the structural properties of such anonymized social network, but the integration of multiple anonymized social networks will be impossible. To make a balance between privacy and utility, we introduce a social network integration framework which consists of three major steps: (i) constructing generalized sub-graph, (ii) creating generalized information for sharing, and (iii) social networks integration and analysis. We also propose two sub-graph generalization methods namely, edge betweenness based (EBB) and K-nearest neighbor (KNN). We evaluated the effectiveness of these algorithms on the Global Salafi Jihad terrorist social network.

This paper is organized as follows. In the next section, we review the existing works about privacy preservation of social network. Previous techniques are classified based on their assumption of attack models the definition of sensitive information, and the privacy preservation techniques. In section 3, we introduce the researchd framework. Social network generalization and integration techniques are introduced in section 4. The experiment design, results and discussions are presented in section 5. We conclude our work and introduce future work in section 6.

Related work

Sensitive information of social network

Given a social network, the definition of sensitive information depends on the specific applications. In the literature, the social network sensitive information can be classified into node properties, neighborhood graphs, edge properties, and network properties in general.

Node properties

In a social network, identity of nodes can be an important type of sensitive property [4-7]. A node with sensitive identity means that its identity is private and should not be released. On the other hands, a node with insensitive identity means that the identity of this node can be released with no harm. Another type of sensitive property of a node can be its degree centrality [8-12]. Given a node, the degree centrality equals to the total number of edges connecting to this node, which is the number of friend in a social network. In a directed graph, edges can be further divided into in-links and out-links. Releasing the degree centrality of a given node, attacker can find out the number of nodes associated to this node which may further release its identity.

Neighborhood graphs

Node neighborhood graph is a concept highly related to degree centrality but with some differences [12]. Given a node and its neighbors, how these neighbors connect with each other can be unique. Publishing the neighborhood graph of a node may release the identify of this node.

Edge properties

Besides the properties of network nodes, Zheleva and Getoor also studied some sensitive properties related to network edges[13]. Two types of information of an edge can be potential sensitive information. One is the existence of an edge between two given nodes. The other is the label of a given edge which represents the type of relationship.

Network properties

Social network data has a set of important properties which can be considered as sensitive information in some cases, such as diameter, radius, betweenness, closeness, clustering coefficient etc.

Social network privacy attack model

To have a better protection against privacy attack, it is important to understand different types of privacy attack models. In this section, we introduce two categories of attack, active and passive attacks [11,14].

Active attacks

Backstorm et al. [14] introduced the active attack model. An adversary can actively select an arbitrary set of target actors, creates a small number of new actors with edges connecting to these targeted users, and then creates a pattern of links among the new actors. By planting new actors and connection patterns in the anonymized social network sophisticatedly, the adversary is able to identify the new actors as well as the targeted actors if the generated connection patterns are uniquely stand out in the anonymized network. Theoretically, the creation of <a onClick="popup('http://www.security-informatics.com/content/1/1/7/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://www.security-informatics.com/content/1/1/7/mathml/M1">View MathML</a>nodes in an n-node network will begin compromising the privacy of the arbitrary targeted nodes. Backstorm et al., [14] further divided the active attacks into walk-based attack and cut-based attack. Both of them employed the strategy of inserting nodes into the target network and then link these nodes with the target nodes. The difference between them is the theoretical number of nodes used in the attack.

Passive attacks

Backstorm et al. [14] also investigated the passive attack model, where adversaries do not create any new nodes or edges. Backstorm et al. pointed out that attacker with certain knowledge can easily differentiate the target nodes or edges from the others due to their unique structural information. Most current studies focus their research on preventing passive attacks, which includes: (1) node passive attack [8-11], where adversaries are supposed to take advantage of node’s degree centrality information to uncover node’s identity; (2) edge passive attack [13-15], where adversaries are supposed to know the existence of certain edges, leading to the disclosure of sensitive information by tracking the identify of other edges or nodes via known edges; (3) sub-graph passive attack [9,11,12], where adversaries are supposed to make use of sub-graph information known in advanced to identify sensitive information of node, such as node identity; (4) graph metrics passive attack [16], where the adversaries have certain background knowledge of the graph metrics, for example hub fingerprint, closeness centrality or betweenness centrality. With the knowledge of these graph metrics, it’s also possible that adversaries can uncover several sensitive information of the social network.

Privacy preservation models and algorithms

In the recent years, a number of approaches for preserving privacy of relational data have been studied extensively, which include k-anonymity [17,18], l-diversity[19], Personalized anonymity[20], and (α,k)-Anonymity[21]. One common objective of these algorithms is to ensure every node is indistinguishable to other (k-1) nodes after anonymization. Although these methods work well in relational table data, most of them cannot deal with social network data due to the complex structure of social network and various background and attack model employed by an adversary. In the recent years, a few research groups have investigated the privacy preservation of social network data. They preserve the data privacy mainly by three approaches: perturbation-based approach, generalization-based approach, and protocol-based approach. Different techniques correspond to different type of sensitive data as well as privacy requirement.

Perturbation-based technique

The perturbation-based technique perturbs a social network by adding, deleting or switching edges in a social network in order to increase the difficulty of identifying a node. Most of them are using greedy algorithm guided by an objective function to modify the social network step by step until the anonymized network satisfied some given conditions. Liu and Terzi proposed the K-degree Anonymous Algorithm to ensure that each network node is indistinguishable to other (K-1) nodes [10]. Starting from the original degree sequence d of input graph G, the algorithm constructs a new degree sequence <a onClick="popup('http://www.security-informatics.com/content/1/1/7/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://www.security-informatics.com/content/1/1/7/mathml/M2">View MathML</a> which satisfies two conditions including: <a onClick="popup('http://www.security-informatics.com/content/1/1/7/mathml/M3','MathML',630,470);return false;" target="_blank" href="http://www.security-informatics.com/content/1/1/7/mathml/M3">View MathML</a> is k-anonymous and <a onClick="popup('http://www.security-informatics.com/content/1/1/7/mathml/M4','MathML',630,470);return false;" target="_blank" href="http://www.security-informatics.com/content/1/1/7/mathml/M4">View MathML</a> is minimized. Zhou and Pei proposed the K-neighborhood Algorithm to make sure that node identity cannot be re-identified by an adversary with a confidence larger than 1/k, even though the adversary has background knowledge of the neighborhood graph [12]. The whole process is divided into two phases. First, the algorithm extracts the neighborhoods of all nodes in the network. To facilitate the comparisons among neighborhoods of different nodes, the researchers proposed a neighborhood component coding technique to represent the neighborhoods in a concise way. In the second step, the algorithm greedily organizes nodes into groups and anonymizes the neighborhoods of nodes in the same group. The greedy algorithm is guided by an anonymization cost which is measured by the similarity between the neighborhoods of two nodes. Ying and Wu proposed the Spectrum Preserving Algorithm which preserves the privacy by randomly perturbing edges in the network [16]. The whole process can be divided into three steps: at first, the eigenvalues of the input graph is computed; and then based on some proved theorems, the boundaries of eigenvalues are given; finally the algorithm perturbs the graph by adding, deleting or switching edges of the graph. If the eigenvalues of perturbed graph is within the given boundaries, the perturbation is accepted and continued for next perturbation. The algorithm terminates until the precondition is satisfied.

Generalization-based technique

The generalization-based technique preserves a social network by grouping certain number of nodes or edges together and then only release the general information of the groups of nodes or edges. Nodes within a group cannot be differentiated because they all share exactly the same properties of the group. In most cases, a generalization-based technique divides nodes according to some predefined loss functions. Hay et al. proposed a node splitting-based technique to achieve k-anonymity of the social network [9,11]. Starting from a single partition of a social network, the algorithm keeps on splitting the selected partition into two sub-groups until all predefined criteria are satisfied. Similarly, Campan and Truta introduced a node clustering-based approach to satisfy the k-anonymity requirement and minimize the information loss [8]. In their algorithm, clusters are created one at a time. To form a new cluster, a node in V with the maximum degree but not yet be allocated to any cluster is selected as a seed for the new cluster. Then the algorithm puts nodes to this currently processed cluster until it reaches the desired cardinality k. At each step, the current cluster grows with one node. The selected node should not be assigned to any cluster yet but it should be able to minimize the growth of information loss of the current clusters. Zheleva and Getoor proposed an edge clustering-based technique to hide the sensitive information on edges [13]. Their technique is divided into two phases. In the first phase, the technique provides a clustering of the nodes into m equivalence class (C1,C2,…,CM) such that each node is indistinguishable in its quasi-identifying attributes from K-1 other nodes. In the second step, this work presents several techniques to protect sensitive information of the social network and then compare their performance, which includes partial-edge removal, cluster-edge anonymization, and cluster-edge anonymization with constraints. Cormode and Srivastava proposed the safe groupings technique for a bipartite Graph G = (V,W,E), where V and W correspond to two types of objects [15]. In their work, a safe grouping of a bipartite graph partitions nodes into groups such that two nodes of the same group of V have no common neighbors in W and vice versa. A greedy algorithm is proposed to find K safe groups of V and L safe groups of W. For each node u, the algorithm attempts to assign u to the first group with fewer than n nodes. If it makes the grouping unsafe, the algorithm will try the second available group and so forth. If there is no group that meets the requirements, a new group will be created to contain this node. After getting K safe groups of V, the algorithm move forward to find L safe groups of W following a similar same process.

Protocol-based technique

The protocol-based technique is using the encryption approach rather than anonymizing the social network data. Social network data is encrypted by following a protocol before sharing with other parties. The protocol ensures that other parties are only able to obtain the insensitive information for their applications but the sensitive information is preserved. Frikken and Golle proposed the pieces assembling approach for private social network analysis [22].

Summary

In this section, we provide a summary of the literature by comparing the privacy preservation techniques and the preserving data in social network as shown in Table 1. In general, some privacy preservation techniques are developed for preserving specific information but are not applied to other information. The choice of the privacy preservation techniques also depends on the application of social network analysis.

Table 1. Classification of privacy preservation techniques based on sensitive information

Research problem

The existing works focus on preserving privacy of social network data for data publishing so that the global network structure can be analyzed. However, it has not considered how to integrate social network data from different sources so that social network analysis and mining can be conducted on the integrated data and yet the privacy of the shared data can be preserved. Individual published social network data only capture parts of the complete social network. Unless we can integrate multiple social networks and conduct SNAM on the integrated social network, the utility of the anonymized data is still limited. Given multiple law enforcement units and each of them has a criminal social network which captures a partial picture of a complete criminal social network, the objective of this work is preserving the privacy of the shared data from each law enforcement unit and conduct SNAM tasks on the integrated data. In this section, we first define formally the research problem, introduce what sensitive information to preserve, what insensitive information to share and what SNAM task can be conducted. Then, we proposed a research framework to address the research problem.

Problem definition

Given a set of network ℊ = {G1, G2,…,Gn} in a distributed setting where each organization i owns its piece of Gi, assuming the complete network <a onClick="popup('http://www.security-informatics.com/content/1/1/7/mathml/M5','MathML',630,470);return false;" target="_blank" href="http://www.security-informatics.com/content/1/1/7/mathml/M5">View MathML</a>) is unknown to each individual organization, the goal of this paper is to study how to anonymize each Gi into <a onClick="popup('http://www.security-informatics.com/content/1/1/7/mathml/M6','MathML',630,470);return false;" target="_blank" href="http://www.security-informatics.com/content/1/1/7/mathml/M6">View MathML</a> so that: 1) the sensitive identities of Gi can be protected; 2) <a onClick="popup('http://www.security-informatics.com/content/1/1/7/mathml/M7','MathML',630,470);return false;" target="_blank" href="http://www.security-informatics.com/content/1/1/7/mathml/M7">View MathML</a> can be shared with other organizations and the integrated anonymization graph <a onClick="popup('http://www.security-informatics.com/content/1/1/7/mathml/M8','MathML',630,470);return false;" target="_blank" href="http://www.security-informatics.com/content/1/1/7/mathml/M8">View MathML</a> can be used for SNAM task. Concretely, each network <a onClick="popup('http://www.security-informatics.com/content/1/1/7/mathml/M9','MathML',630,470);return false;" target="_blank" href="http://www.security-informatics.com/content/1/1/7/mathml/M9">View MathML</a> consists of both insensitive nodes and sensitive nodes. Node identities of those insensitive nodes are known to the public or the sharing parties while the node identities of those sensitive nodes are unknown to the public and needed to be protected. Our focus in this paper is to protect the node identities of those sensitive nodes. On the other hand, for SNAM purpose, some network properties, including topology, diameter and some other abstract features of the anonymized network, will be released and shared across organizations. Last but not least, it’s important to note that some network features cannot be preserved in our method, such as neighborhood information. Therefore, not all SNAM tasks can be achieved in our integrated anonymized network. In this paper, we only study how to preserve the usefulness of the integrated anonymized network regarding to distance-related analysis, such as computing the closeness of each node. To summarize, although some existing works have studied how to anonymize network for data publishing, the research problem that we study here is different. We not only anonymize a given network to protect its’ node identity but also focus on integrating anonymized networks to achieve better SNAM results

Framework of social network integration with privacy preservation

To further motivate our research framework, we assume organization P ( OP) has a social network GP and organization Q ( OQ) has another social network GQ, both GP and GQ are partial networks of a complete social network which is unknown to any organization. OP needs to conduct a Social Network Analysis and Mining (SNAM) but GP is incomplete due to its limited sources of information. As a result, it will be difficult or even impossible for OP to get accurate SNAM results. If there is no privacy concern between different organizations, one can integrate GP and GQ to generate an integrated G and obtain a better SNAM result. However, due to privacy concern, OQ cannot share GQ completely with OP, but only shares the insensitive information of GQ with OP according to the privacy policies. At the same time, OP does not need all data from OQ but only those that are critical for the SNAM tasks. For these reasons, to integrate social networks of different organizations without violating privacy policies, we only need to share information that is critical to the performance of SNAM and yet preserve the sensitive information.

Figure 1 demonstrates the general framework of social network integration for SNAM. In this framework, OQ employs sub-graph generalization techniques to create a generalized social network, GQ, from GQ without violating the privacy policy. The generalized social network only contains generalized information of GQ without releasing any sensitive information. For example, a generalized social network cannot release the exact identity of each nodes or exact shortest distance between any two nodes. On the other hand, generalized information can include diameter of a sub-graph, average number of adjacent nodes between two subgroups, degree of an insensitive node and other insensitive information. The generalized social network GQ will then be integrated with GP to support a social network analysis and mining task. Given the generalized information from GQ, it is expected to achieve better performance on SNAM task than conducting the analysis and mining on GP alone. There are two important sub-tasks in our proposed framework which we will address in the following sections:

thumbnailFigure 1 . General framework of social network integration for SNAM.

Task 1

Given a social network G with sensitive information, produce generalized social network G and determine the generalized information which can be released.

Task 2

Integrate a generalized social network with the local social network, and then utilize shared generalized information to achieve better SNAM results.

Notations

In Table 2, we define a set of notations for the proposed social network integration techniques.

Table 2. Notations and definitions

Methodology

Social network generalization

In task one, given a social network G with sensitive information, we employ clustering-based technique to produce a generalized social network G. We suppose G = ( V, E), where V is a set of nodes, E is a set of edges and | V| =  n, K of these nodes are insensitive nodes, and n-K of these nodes are sensitive nodes. We generate a generalized social network in two steps. In the first step, we decompose G into K sub-graphs Gi = (Vi,Ei), where V = Ui =1to KVi and each sub-graph contains one insensitive node. In the second step, each sub-graph will be transformed to a generalized node of the generalized graph G. Furthermore, two generalized nodes will be connected in G if and only if there is one or more edges connecting nodes from these two sub-graphs respectively.

In this section, we propose two graph partition algorithms, K-nearest neighbor ( KNN) method and Edge betweenness based (EBB) method, to generate a generalized social network G for sharing purpose. Both KNN and EBB methods are developed by following one common principle that the identity of insensitive nodes can be published safely while the identity of sensitive nodes cannot, so that, to produce a generalized social network, we need to divide the original network into several sub-graphs each of which represented by an insensitive nodes, and the final generalized network should be also represented by these insensitive nodes.

K-nearest neighbor (KNN) method

Given a social network G with K insensitive nodes <a onClick="popup('http://www.security-informatics.com/content/1/1/7/mathml/M14','MathML',630,470);return false;" target="_blank" href="http://www.security-informatics.com/content/1/1/7/mathml/M14">View MathML</a> KNN method divides G into K sub-graphs by assigning each node v to its nearest insensitive node. Let SPD(v, viC) be the distance of the shortest path between v and viC. Starting from the sensitive nodes adjacent to insensitive node, KNN method assign sensitive node, one node per time, to the closest sub-graph Gi where SPD(v, viC) is shorter than or equal to SPD(v, vjC) where j = 1, 2, .., K and j ≠ i. After dividing a social network into K sub-graphs, we collapse all nodes of a sub-graph into one generalized node, and represent this node with the identity of the insensitive node of this sub-graph. Finally, for each possible pair of generalized nodes, say Gi and Gj in the generalized graph G’, an edge will be created if and only if there is one or more edges between any two nodes in G from sub-graph Gi and Gj respectively.

Figure 2 presents a simple example to illustrate the idea of KNN and show how it works to produce generalized social network. Figure 2 (a) is the given social network which has seven nodes. Among them, v1 and v2 are insensitive nodes while the others are all sensitive nodes. By using KNN method, the given social network will be divided into two isolated social networks as shown in Figure 2 (b). Finally, one sub-graph is represented by v1 and another sub-graph is represented by v2, Figure 2 (c) demonstrates the final generalized social network where two generalized nodes are connected together because v4 and v5 are connected in G. The KNN subgraph generation algorithm is presented below:

length = 1;

V = V - {v1C, v2C, … vKC};

While V ≠ Ø

 For each vjV

 For each i = 1 to K

  IF(SPD(vj, viC) == length);

   Vi = Vi + vj;

   V =  Vvj;

  End For;

 End For;

 length++;

End While

For each (vi,vj) ∈ E

 IF(Subgraph( vi) == Subgraph( vj))

 //Subgraph( vi) is the subgraph such that viSubgraph( vi)

  Gk = Subgraph( vi)

  Ek = Ek + (vi,vj)

 ELSE

  Create an edge between Subgraph( vi) and Subgraph( vj) and add it to E’

End For

thumbnailFigure 2 . Illustrations of generating subgraphs.

Edge betweenness based (EBB) method

Instead of assigning sensitive nodes to the closest sub-graphs represented by insensitive nodes, the EBB method progressively remove edges with the highest betweenness and it also ensure that each separated sub-graph contains exactly one insensitive node. The betweenness of an edge is defined as the number of shortest paths between pairs of nodes that pass through it. If a network consists of a few of dense communities which are only loosely connected by some inter-community edges, these inter-community edges will have high betweenness, so that removing them will naturally break the social network into multiple communities. The EBB algorithm is presented as follows:

//EBB(G), Edge Betweenness Based method

Initialize e = {};

While(there are more than one insensitive node in graph G)

 Identify edge (vi,vj) in G which is not an element of e and has the highest betweenness;

 Remove (vi,vj) from G;

 IF(G is still connected after removing edges (vi,vj))

  EBB(G);

 ELSE IF (G is disconnected and split to two graph Gp and Gq)

  IF(No insensitive node in Gp) or (No insensitive node in Gq)

   Add (vi,vj) back to G;

   e = e + (vi,vj);

Go Back to Step 2;

  ELSE

   EBB(GP);

   EBB(Gq);

End While;

//Add edge between generalized node to form generalized graph

For each (vj,vj) ∈ E

 IF(Subgraph( vi) == Subgraph( vj))

  Gk = Subgraph( vi)

  Ek = Ek + (vi,vj)

 ELSE

  Create an edge between Subgraph( vi) and Subgraph( vj) and add it to E’

End For

Figure 3 shows an example of how EBB method works to produce generalized social network. Given a social network with nine nodes, v1 and v2 are insensitive nodes while all other nodes are sensitive nodes. Since edge (v1, v2) has the highest Betweenness and it is safe to be removed, EBB method delete this edge to form two separated sub-graphs each of them contains exactly one insensitive node, as shown in Figure 3 (b). Finally, the EBB method generalizes these two sub-graphs into two generalized nodes, and then connects them to form the generalized graph as shown in Figure 3 (c).

thumbnailFigure 3 . Illustrations of generalizing subgraph byEBB.

Generalized sub-graph information

Given a generalized social network Gi and its center viC, we select shareable network properties based on the information need and the privacy policy. In this paper, we treat node identity as sensitive information that we should protect, and consider distance between nodes to be useful information for SNAM task. Let va and vb be any two nodes in Gi and the length of the shortest path between va and vb be SPD(vp,vq,Gi). We define the longest length of the shortest paths between any two nodes in Gi, denoted by L_SPD(Gi), as

<a onClick="popup('http://www.security-informatics.com/content/1/1/7/mathml/M15','MathML',630,470);return false;" target="_blank" href="http://www.security-informatics.com/content/1/1/7/mathml/M15">View MathML</a>

(1)

We also define the shortest length of the shortest paths between any two nodes in Gi, denoted by S_SPD(Gi), as

<a onClick="popup('http://www.security-informatics.com/content/1/1/7/mathml/M16','MathML',630,470);return false;" target="_blank" href="http://www.security-informatics.com/content/1/1/7/mathml/M16">View MathML</a>

(2)

To reduce the risk of releasing sensitive information, instead of sharing exact information of shortest path, we propose to share the expected length between two nodes within a generalized social network. Formally speaking, the length of any shortest paths in Gi, α, must be smaller or equal to L_SPD(Gi) and larger or equal to S_SPD(Gi), where S_SPD(Gi) ≤ α ≤ L_SPD(Gi). We compute and share the probability of the length of the shortest path between any two nodes in Gi, denoted as Prob(SPD(Gi) = α), and 0 ≤ Prob(SPD(Gi) = α) ≤ 1

Similarly, let the length of the shortest path between va and viC, be SPD(va,viC,Gi). We define the longest length of the shortest paths between viC and other nodes within Gi, denoted by L_SPD(viC,Gi), as

<a onClick="popup('http://www.security-informatics.com/content/1/1/7/mathml/M17','MathML',630,470);return false;" target="_blank" href="http://www.security-informatics.com/content/1/1/7/mathml/M17">View MathML</a>

(3)

We also define the shortest length of the shortest paths between viC and other nodes within Gi, denoted by and S_SPD(viC,Gi), as

<a onClick="popup('http://www.security-informatics.com/content/1/1/7/mathml/M18','MathML',630,470);return false;" target="_blank" href="http://www.security-informatics.com/content/1/1/7/mathml/M18">View MathML</a>

(4)

Since the length of shortest paths between viC and any other nodes in Gi must be smaller or equal to L_SPD(viC, Gi) and larger or equal to S_SPD(viC, Gi), denoted as S_SPD(viC, Gi) ≤ β ≤ L_SPD(viC, Gi). We compute the probability of the length of the shortest path between any node and viC, Prob(SPD(viC,Gi) = β), where 0 ≤ Prob(SPD(Gi) = α) ≤ 1.

We also denote Num(Gi) as the number of nodes in Gi and Num(Gi,Gj) as the number of nodes in Gi that are adjacent to another subgraph Gj.

The generalized subgraph information for sharing includes: (i) L_SPD(Gi), (ii) S_SPD(Gi), (iii) Prob(SPD(Gi) = α), (iv) L_SPD(viC,Gi), (v) S_SPD(viC,Gi), (vi) Prob(SPD(viC,Gi) = β), (vii) Num(Gi), and (viii) Num(Gi,Gj).

Generalized graph integration and social network analysis

In section 4.2 and 4.3 we introduced how to divide a social network into sub-graphs, and then generalize these sub-graphs to nodes, then finally produce a generalized social network. We also discussed what kind of information will be shared along with the generalized social network. In this section, given a generalized social network G and the shareable information of the sub-graphs of G, we propose our own techniques to integrate social network and shared information to improve the performance of SNAM task.

Suppose organization Op has a social network Gp and organization OQ has another social network GQ, Op wants to integrate GQ with its own Gp to compute more accurate closeness centrality. We propose to achieve this goal without violating the privacy policies in three steps: (1) produce generalized social network <a onClick="popup('http://www.security-informatics.com/content/1/1/7/mathml/M19','MathML',630,470);return false;" target="_blank" href="http://www.security-informatics.com/content/1/1/7/mathml/M19">View MathML</a> and <a onClick="popup('http://www.security-informatics.com/content/1/1/7/mathml/M20','MathML',630,470);return false;" target="_blank" href="http://www.security-informatics.com/content/1/1/7/mathml/M20">View MathML</a>; (2) integrate <a onClick="popup('http://www.security-informatics.com/content/1/1/7/mathml/M21','MathML',630,470);return false;" target="_blank" href="http://www.security-informatics.com/content/1/1/7/mathml/M21">View MathML</a> and <a onClick="popup('http://www.security-informatics.com/content/1/1/7/mathml/M22','MathML',630,470);return false;" target="_blank" href="http://www.security-informatics.com/content/1/1/7/mathml/M22">View MathML</a> into GIntegrated; (3) estimate the distance between any two nodes of the integrated social network. Among these three steps, step one can be achieved by our proposed techniques in section 4.2 and 4.3. In step two, although the sub-graphs represented by a common insensitive node in <a onClick="popup('http://www.security-informatics.com/content/1/1/7/mathml/M23','MathML',630,470);return false;" target="_blank" href="http://www.security-informatics.com/content/1/1/7/mathml/M23">View MathML</a> and <a onClick="popup('http://www.security-informatics.com/content/1/1/7/mathml/M24','MathML',630,470);return false;" target="_blank" href="http://www.security-informatics.com/content/1/1/7/mathml/M24">View MathML</a> are different and the connectivity between these insensitive nodes are also different, according to our proposed techniques, <a onClick="popup('http://www.security-informatics.com/content/1/1/7/mathml/M25','MathML',630,470);return false;" target="_blank" href="http://www.security-informatics.com/content/1/1/7/mathml/M25">View MathML</a> and <a onClick="popup('http://www.security-informatics.com/content/1/1/7/mathml/M26','MathML',630,470);return false;" target="_blank" href="http://www.security-informatics.com/content/1/1/7/mathml/M26">View MathML</a> are represented by the same group of insensitive nodes since Gp and GQ share same insensitive nodes. As a result, we can combine <a onClick="popup('http://www.security-informatics.com/content/1/1/7/mathml/M27','MathML',630,470);return false;" target="_blank" href="http://www.security-informatics.com/content/1/1/7/mathml/M27">View MathML</a> and <a onClick="popup('http://www.security-informatics.com/content/1/1/7/mathml/M28','MathML',630,470);return false;" target="_blank" href="http://www.security-informatics.com/content/1/1/7/mathml/M28">View MathML</a> into GIntegrated by taking union of their edges. In this section, we focus on the step 3 which estimate distances between any two nodes based on Gp, GIntegrated and shared information of sub-graphs of <a onClick="popup('http://www.security-informatics.com/content/1/1/7/mathml/M29','MathML',630,470);return false;" target="_blank" href="http://www.security-informatics.com/content/1/1/7/mathml/M29">View MathML</a>.

To re-estimate the distance between two nodes vi and vp of Gp by making use of GIntegrated and the shared information of sub-graphs of <a onClick="popup('http://www.security-informatics.com/content/1/1/7/mathml/M30','MathML',630,470);return false;" target="_blank" href="http://www.security-informatics.com/content/1/1/7/mathml/M30">View MathML</a>, we first identify the two closest insensitive nodes for vi and vj in Gp, and then use GIntegrated and the generalized information of <a onClick="popup('http://www.security-informatics.com/content/1/1/7/mathml/M31','MathML',630,470);return false;" target="_blank" href="http://www.security-informatics.com/content/1/1/7/mathml/M31">View MathML</a> to re-estimate their distances. Formally speaking, let the closest insensitive node to vi in Gp be <a onClick="popup('http://www.security-informatics.com/content/1/1/7/mathml/M32','MathML',630,470);return false;" target="_blank" href="http://www.security-informatics.com/content/1/1/7/mathml/M32">View MathML</a>, and the second closest insensitive node to vi in Gp be <a onClick="popup('http://www.security-informatics.com/content/1/1/7/mathml/M33','MathML',630,470);return false;" target="_blank" href="http://www.security-informatics.com/content/1/1/7/mathml/M33">View MathML</a>. We set the weights λA and λA’ as

<a onClick="popup('http://www.security-informatics.com/content/1/1/7/mathml/M34','MathML',630,470);return false;" target="_blank" href="http://www.security-informatics.com/content/1/1/7/mathml/M34">View MathML</a>

(5)

<a onClick="popup('http://www.security-informatics.com/content/1/1/7/mathml/M35','MathML',630,470);return false;" target="_blank" href="http://www.security-informatics.com/content/1/1/7/mathml/M35">View MathML</a>

(6)

with <a onClick="popup('http://www.security-informatics.com/content/1/1/7/mathml/M36','MathML',630,470);return false;" target="_blank" href="http://www.security-informatics.com/content/1/1/7/mathml/M36">View MathML</a> and the weight of the closest insensitive node is higher.

Similarly, let the closest insensitive node to vj in GP be <a onClick="popup('http://www.security-informatics.com/content/1/1/7/mathml/M37','MathML',630,470);return false;" target="_blank" href="http://www.security-informatics.com/content/1/1/7/mathml/M37">View MathML</a>, and the second closest insensitive node to vj in GP be <a onClick="popup('http://www.security-informatics.com/content/1/1/7/mathml/M38','MathML',630,470);return false;" target="_blank" href="http://www.security-informatics.com/content/1/1/7/mathml/M38">View MathML</a>, we set the weights λB and λB’ as

<a onClick="popup('http://www.security-informatics.com/content/1/1/7/mathml/M39','MathML',630,470);return false;" target="_blank" href="http://www.security-informatics.com/content/1/1/7/mathml/M39">View MathML</a>

(7)

<a onClick="popup('http://www.security-informatics.com/content/1/1/7/mathml/M40','MathML',630,470);return false;" target="_blank" href="http://www.security-informatics.com/content/1/1/7/mathml/M40">View MathML</a>

(8)

with <a onClick="popup('http://www.security-informatics.com/content/1/1/7/mathml/M41','MathML',630,470);return false;" target="_blank" href="http://www.security-informatics.com/content/1/1/7/mathml/M41">View MathML</a>

In GIntegrated, <a onClick="popup('http://www.security-informatics.com/content/1/1/7/mathml/M42','MathML',630,470);return false;" target="_blank" href="http://www.security-informatics.com/content/1/1/7/mathml/M42">View MathML</a>, and <a onClick="popup('http://www.security-informatics.com/content/1/1/7/mathml/M43','MathML',630,470);return false;" target="_blank" href="http://www.security-informatics.com/content/1/1/7/mathml/M43">View MathML</a> are the centers of generalized sub-graphs GA, GA, GB, and GB, respectively. We estimate the distance between vi and vj, d(vi,vj), by integrating the estimated distances of the four possible paths going through these insensitive nodes by a linear combination with weights equal to λα × λb.

<a onClick="popup('http://www.security-informatics.com/content/1/1/7/mathml/M44','MathML',630,470);return false;" target="_blank" href="http://www.security-informatics.com/content/1/1/7/mathml/M44">View MathML</a>

(9)

D ( vi, vj) is the estimated distance between vi and vj on the path going through <a onClick="popup('http://www.security-informatics.com/content/1/1/7/mathml/M45','MathML',630,470);return false;" target="_blank" href="http://www.security-informatics.com/content/1/1/7/mathml/M45">View MathML</a> and <a onClick="popup('http://www.security-informatics.com/content/1/1/7/mathml/M46','MathML',630,470);return false;" target="_blank" href="http://www.security-informatics.com/content/1/1/7/mathml/M46">View MathML</a>, where a can be A or A’ and b can be B or B’.

<a onClick="popup('http://www.security-informatics.com/content/1/1/7/mathml/M47','MathML',630,470);return false;" target="_blank" href="http://www.security-informatics.com/content/1/1/7/mathml/M47">View MathML</a>

(10)

where Gk is a generalized node on the shortest path between Gα and Gb in GIntegrated If a ≠ b which means viand vj are not in the same subgraph, then D( vi, vj) is estimated by <a onClick="popup('http://www.security-informatics.com/content/1/1/7/mathml/M48','MathML',630,470);return false;" target="_blank" href="http://www.security-informatics.com/content/1/1/7/mathml/M48">View MathML</a>, and E( Gk). Otherwise, if viand vj are in the same subgraph then a =  b. In this case, D ( vi, vj) is estimated by <a onClick="popup('http://www.security-informatics.com/content/1/1/7/mathml/M49','MathML',630,470);return false;" target="_blank" href="http://www.security-informatics.com/content/1/1/7/mathml/M49">View MathML</a>.<a onClick="popup('http://www.security-informatics.com/content/1/1/7/mathml/M50','MathML',630,470);return false;" target="_blank" href="http://www.security-informatics.com/content/1/1/7/mathml/M50">View MathML</a> corresponds to the expected length of the distance between vi and the sub-graph gatekeeper within Gα. Similarly, <a onClick="popup('http://www.security-informatics.com/content/1/1/7/mathml/M51','MathML',630,470);return false;" target="_blank" href="http://www.security-informatics.com/content/1/1/7/mathml/M51">View MathML</a> corresponds to the expected length of the distance between vj and the sub-graph gatekeeper within Gb. In addition, E ( Gk) is the expected length of the distance between any two nodes of sub-graph Gk that the shortest path between vi and vj is going through. If vi is not the same as <a onClick="popup('http://www.security-informatics.com/content/1/1/7/mathml/M52','MathML',630,470);return false;" target="_blank" href="http://www.security-informatics.com/content/1/1/7/mathml/M52">View MathML</a> is computed by E ( Gα) and the percentage of nodes in Ga that is adjacent to the sub-graph that is immediately following Ga in the shortest path between vi and vj in GIntegrated. If vi is the same as <a onClick="popup('http://www.security-informatics.com/content/1/1/7/mathml/M53','MathML',630,470);return false;" target="_blank" href="http://www.security-informatics.com/content/1/1/7/mathml/M53">View MathML</a> is equal to the expected length of the distance between the insensitive node, <a onClick="popup('http://www.security-informatics.com/content/1/1/7/mathml/M54','MathML',630,470);return false;" target="_blank" href="http://www.security-informatics.com/content/1/1/7/mathml/M54">View MathML</a>, to the other nodes in Ga. Computation of <a onClick="popup('http://www.security-informatics.com/content/1/1/7/mathml/M55','MathML',630,470);return false;" target="_blank" href="http://www.security-informatics.com/content/1/1/7/mathml/M55">View MathML</a> is done similarly.

<a onClick="popup('http://www.security-informatics.com/content/1/1/7/mathml/M56','MathML',630,470);return false;" target="_blank" href="http://www.security-informatics.com/content/1/1/7/mathml/M56">View MathML</a>

(11)

where<a onClick="popup('http://www.security-informatics.com/content/1/1/7/mathml/M57','MathML',630,470);return false;" target="_blank" href="http://www.security-informatics.com/content/1/1/7/mathml/M57">View MathML</a> is the percentage of nodes in Gα as a gatekeeper which is adjacent to Gk and Gk is the sub-graph that immediately follows Gα in the shortest path between vi and vj in GIntegrated.

E ( Gk) represents the expected length of the distance between any two nodes of the sub-graph Gk, which is computed as:

<a onClick="popup('http://www.security-informatics.com/content/1/1/7/mathml/M58','MathML',630,470);return false;" target="_blank" href="http://www.security-informatics.com/content/1/1/7/mathml/M58">View MathML</a>

(12)

D” (vi,vj) corresponds to the estimated distance between vi and vj when both vi and vj are nodes of the same sub-graph. In this case, if any of vi or vj is the same as <a onClick="popup('http://www.security-informatics.com/content/1/1/7/mathml/M59','MathML',630,470);return false;" target="_blank" href="http://www.security-informatics.com/content/1/1/7/mathml/M59">View MathML</a>, D” (vi,vj) should equal to the expected length of the distance from the insensitive node to the other nodes in, Gα. Otherwise, D” (vi,vj) should equal to the expected length of the distance between two nodes of the sub-graph.

<a onClick="popup('http://www.security-informatics.com/content/1/1/7/mathml/M60','MathML',630,470);return false;" target="_blank" href="http://www.security-informatics.com/content/1/1/7/mathml/M60">View MathML</a>

(13)

Experiment and discussion

Practically, there isn’t any intelligence unit has a complete terrorist social network but each of them has a partial terrorist social network. The objective of this work is to support these intelligence units to share their social networks while preserving the sensitive information. In this section, we investigated our proposed techniques on a real-world dataset of terrorists. We extracted several social networks from the terrorist dataset to simulate the real-world problem. Intensive experiment was conducted under different settings to evaluate our proposed techniques.

Dataset

In this work, we employed the Global Salafi Jihad terrorist social network, denoted as G, in our experiment. The Global Salafi Jihad terrorist social network consists of 366 nodes (terrorists) and 1,275 edges (connection between terrorists)[23]. These terrorists come from four major groups, including Central Staff of al Qaeda (CSQ), Core Arab (CA), Southeast Asia (SA), and Maghreb Arab (MA). We randomly sample α percent of nodes from the Global Salafi Jihad terrorist social network as insensitive nodes, that their identities are known by all organizations. Suppose there are two independent organizations OP and OQ, we simulate GP for OP by randomly removing β percent of edges from the Global Salafi Jihad terrorist social network. Similarly, we randomly remove β percent of edges from the Global Salafi Jihad terrorist social network to simulate GQ for OQ. As a result, both GP and GQ are partial graph of G. Moreover, GP are different from GQ in terms of their edges.

Evaluation

As discussed before, there is no generic approach for privacy preservation since sensitive information can be defined in various ways. Moreover, shareable useful information is also different in terms of different SNAM tasks. In this work, we treat node identity as sensitive information and consider distance between nodes as useful information that we want to maintain. To evaluate our proposed technique, we assume that the SNAM task conducted by GP is to compute closeness centrality for each node. If GP is close to G, then distances between any two nodes in GP should be roughly equal to their distance in G, leading to similar closeness centrality for each node. Otherwise, nodes in GP should have different closeness centrality in G. In this work, closeness centrality for a node in GP is computed as:

<a onClick="popup('http://www.security-informatics.com/content/1/1/7/mathml/M61','MathML',630,470);return false;" target="_blank" href="http://www.security-informatics.com/content/1/1/7/mathml/M61">View MathML</a>

(14)

where n is the total number of nodes in GP.

Given a complete social network G and the integrated social network GIntegrated, the performance of our proposed technique is evaluated by the error function defined as:

<a onClick="popup('http://www.security-informatics.com/content/1/1/7/mathml/M62','MathML',630,470);return false;" target="_blank" href="http://www.security-informatics.com/content/1/1/7/mathml/M62">View MathML</a>

(15)

Experiment

Figure 4 demonstrates the average closeness centrality of nodes of original graph (G), integrated graph using EBB method ( GIntegrated (EBB)), integrated graph using KNN method ( GIntegrated (KNN)) and incomplete graph ( GP). In Figure 4, the blue line represents the average closeness centrality computed from G, which is a gold standard, so that the closer to this blue line the better it is.

thumbnailFigure 4 . Average closeness centrality of complete graph (G), integrated graph using EBB method<a onClick="popup('http://www.security-informatics.com/content/1/1/7/mathml/M65','MathML',630,470);return false;" target="_blank" href="http://www.security-informatics.com/content/1/1/7/mathml/M65">View MathML</a>, integrated graph using KNN method<a onClick="popup('http://www.security-informatics.com/content/1/1/7/mathml/M66','MathML',630,470);return false;" target="_blank" href="http://www.security-informatics.com/content/1/1/7/mathml/M66">View MathML</a>and incomplete graphGP: (a) α = 0.05, (b) α = 0.15, (c) α = 0.25, (d) α = 0.35, (e) α = 0.45, (f) α = 0.55 (g) α = 0.65, (h) α = 0.75, (i) α = 0.85, (j) α = 0.95.

For each α from 0.05 to 0.95, we increased β (percentage of edges randomly removed from G) from 0.2 to 0.8. We observed that the performance of GP (GIntegrated (KNN)) and ( GIntegrated (EBB)) decreased consistently when more edges are removed from the complete graph, no matter what the value of α is. Although our proposed technique integrates networks and estimates the average closeness centrality, the performance will not be as good as the average closeness centrality computed from the actual graph G. When more edges are removed before integration ( β increase), the performance will degrade.

We further investigated the performance of our proposed technique by increasing the percentage of insensitive nodes from 0.05 (Figure 4(a)) to 0.95 (Figure 4(j)). Similar patterns are observed from 4(a) to 4(j). In terms of average closeness centrality, increasing or decreasing the percentage of insensitive nodes in network did not make substantial impact to the performance of our purposed technique. One plausible explanation is that: the average closeness centrality used in this experiment only reflects the performance of our approach in an abstract level. Some nodes in the integrated network may have higher closeness centrality than its original closeness centrality in the complete graph while some nodes may have lower closeness centrality in the integrated network than in the complete graph. As a result, when we consider the average closeness centrality, the differences may be offset by each other.

Figure 5 (a) presents the error ratio of (GIntegrated (KNN)) with different α and β. Similarly, Figure 5 (b) presents the error ratio of GIntegrated (EBB) with different α and β. We compute the errors in closeness centrality obtained from the networks with and without integration ( Error( Gintegrated) and Error( Gp)) using the error function defined in 5.2. and the error ratio is defined as:

<a onClick="popup('http://www.security-informatics.com/content/1/1/7/mathml/M67','MathML',630,470);return false;" target="_blank" href="http://www.security-informatics.com/content/1/1/7/mathml/M67">View MathML</a>

(16)

thumbnailFigure 5. (a) error ratio of<a onClick="popup('http://www.security-informatics.com/content/1/1/7/mathml/M70','MathML',630,470);return false;" target="_blank" href="http://www.security-informatics.com/content/1/1/7/mathml/M70">View MathML</a>with different settings ofβand α; (b) error ratio of<a onClick="popup('http://www.security-informatics.com/content/1/1/7/mathml/M71','MathML',630,470);return false;" target="_blank" href="http://www.security-informatics.com/content/1/1/7/mathml/M71">View MathML</a>with different settings ofβand α.

Different from the average closeness centrality which we used as a measurement in Figure 4, the error function accumulates the closeness centrality difference for each individual node, so that the offset effect of average closeness will not occur. The experiment results of Figure 5 can be used to verify our explanation to the Figure 4 in the last paragraph.

The experiment results demonstrate that when α is high (means more insensitive nodes), the improvement of our proposed technique comparing to the partial graph is also higher. The highest improvement was achieved when α equals to 0.95. The improvement decreased slowly along with the decrease of α. This observation indicated that our explanation of Figure 4 is correct. With more insensitive nodes, the integrated network will be closer to the original network so that the improvement of our technique will be higher.

Last but not least, from both Figures 4 and 5, we do not observe any significant differences of the performance between using KNN or EBB to produce generalized network. However, as it is shown in section 4.2.2, the EBB algorithm is dominated by the step of calculating the edge betweenness which has time complexity O(N3). On the other hand, KNN is much more efficient which is only O(N). As a result, when the network size is huge, KNN is preferred. Moreover, in a fully connected network where several edges have the same betweenness weight, EBB will take longer to produce the generalized network. However, KNN also has its limitation. For example, KNN starts from each insensitive node to look for sensitive nodes in its neighborhood to form a sub-graph step by step. However, the search process is not fully simultaneous, but is controlled by a FOR loop. As a result, the sequence in the FOR loop is matter, especially for some nodes in the middle of two insensitive nodes. As a result, the division of sub-graph by using KNN is less natural than EBB method.

Conclusion

In this paper, we investigate the privacy preservation techniques for social network integration. We introduce a research framework which consists of three major steps. First of all, we propose the K-Nearest Neighborhood method and the Edge Betweenness Based method to decompose a social network into multiple sub-graphs. Secondly, we propose techniques to generalize a social network by sharing the probabilistic model of the generalized information. At third, we introduced the techniques of social network integration and distance estimation.

Using the Global Salafi Jihad terrorist social network as test bed, we thoroughly evaluated our proposed technique with different parameters and settings. The experiment results demonstrated that an organization can improve the accuracy of computing closeness centrality by sharing and integrating generalized information. Our proposed techniques were able to preserve the privacy as well as increase the utility of the shared social networks. We observed that KNN performed better than EBB but did not have substantial difference. Moreover, our proposed techniques were not sensitive to the number of insensitive node but relatively sensitive to the number of removed edges.

In the future, we will continue to examine our techniques in more datasets. We will explore other graph partition models and integration techniques to improve the performance of our technique. Moreover, we will also extend our work to maintain other useful information besides distance.

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

CCY proposed the research framework. CCY and XT developed the algorithms together. XT implemented the algorithms and conducted experiments. CCY and XT together drafted the manuscript. All authors read and approved the final manuscript.

References

  1. Yang CC, Sageman M: Analysis of terrorist social networks with fractal views.

    J Inf Sci 2009, 35(3):299-320. Publisher Full Text OpenURL

  2. Yang CC, Ng T: “Terrorism and crime related weblog social network: Link, content analysis and information visualization,”.

    IEEE Int Conf Intell Secur Inform IEEE55-58. OpenURL

  3. Yang CC, Liu N, Sageman M: “Analyzing the terrorist social networks with visualization tools,”.

    IEEE Inte Conf Intell Secur Inform331-342. OpenURL

  4. Yang CC, Tang X: Social networks integration and privacy preservation using subgraph generalization. In Proceedings of the ACM SIGKDD Workshop on CyberSecurity and Intelligence Informatics. ACM, ; 53-61. OpenURL

  5. Yang CC, Tang X: Information Integration for Terrorist or Criminal Social Networks.

    Ann Inform Syst 2010, 9:41-57. Publisher Full Text OpenURL

  6. Tang X, Yang CC: “Generalizing terrorist social networks with K-nearest neighbor and edge betweeness for social network integration and privacy preservation,”.

    IEEE Int Conf Intell Secur Inform IEEE49-54. OpenURL

  7. Yang CC, Thuraisingham B: Privacy-Preserved Social Network Integration and Analysis for Security Informatics.

    IEEE Intell Syst 2010, 25(3):88-90. OpenURL

  8. Campan A, Truta T: “A clustering approach for data and structural anonymity in social networks,”.

    Proceeding of the second ACM SIGKDD International Workshop on Privacy, Security, and Trust in KDD OpenURL

  9. Hay M, Miklau G, Jensen D, Weis P, Srivastava S: “Anonymizing social networks,”. University of Massachusetts Technical Report, ; 2007:07-19. OpenURL

  10. K. Liu, and E. Terzi, “Towards identity anonymization on graphs”, Proceedings of the: ACM SIGMOD international conference on Management of data. ACM New York, NY, USA; 2008:93-106. OpenURL

  11. Hay M, Miklau G, Jensen D, Towsley D, Weis P: Resisting structural re-identification in anonymized social networks.

    Proc VLDB Endowment Arch 2008, 1(1):102-114. OpenURL

  12. Zhou B, Pei J: “Preserving privacy in social networks against neighborhood attacks,”. In IEEE 24th International Conference on Data Engineering. ICDE, ; 2008:506-515. OpenURL

  13. Zheleva E, Getoor L: Preserving the privacy of sensitive relationships in graph data.

    Lecture Notes Comput Sci 2008, 4980:153. OpenURL

  14. Backstrom L, Dwork C, Kleinberg J: “Wherefore art thou r3579x?: anonymized social networks, hidden patterns, and structural steganography,”. In Proceedings of the 16th international conference on World Wide Web. ACM New York, NY, USA; 181-190. OpenURL

  15. Cormode G, Srivastava D, Yu T, Zhang Q: Anonymizing bipartite graph data using safe groupings.

    Proc VLDB Endowment Arch 2008, 1(1):833-844. OpenURL

  16. Ying X, Wu X:

    “Randomizing social networks: a spectrum preserving approach,” SIAM Conf. on Data Mining. . OpenURL

  17. Sweeney L: k-anonymity: A model for protecting privacy.

    Int J Uncertainty Fuzziness Knowledge Based Syst 2002, 10(5):557-570. Publisher Full Text OpenURL

  18. Samarati P: “Protecting respondents' identities in microdata release,”.

    IEEE Transac Knowledge Data Eng 2001, 1010-1027. OpenURL

  19. Machanavajjhala A, Kifer D, Gehrke J, Venkitasubramaniam M: “l-diversity: Privacy beyond k-anonymity”.

    ACM Trans Knowledge Discov Data (TKDD) 2007, 1(1):3. Publisher Full Text OpenURL

  20. X. Xiao, and Y. Tao, “Personalized privacy preservation”, Proceedings of the: ACM SIGMOD international conference on Management of data. ACM New York, NY, USA; 2006:229-240. OpenURL

  21. Wong R, Li J, Fu A, Wang K: “(alpha, k)-anonymity: an enhanced k-anonymity model for privacy preserving data publishing,”. ACM Press, ; 754-759. OpenURL

  22. Frikken K, Golle P: “Private social network analysis: How to assemble pieces of a graph privately,”. ACM New York, NY, USA; :89-98. OpenURL

  23. Sageman M: Understanding Terror Networks. University of Pennsylvania Press, ; 2004. OpenURL