Abstract
Social Network Analysis and Mining (SNAM) techniques have drawn significant attention in the recent years due to the popularity of online social media. With the advance of Web 2.0 and SNAM techniques, tools for aggregating, sharing, investigating, and visualizing social network data have been widely explored and developed. SNAM is effective in supporting intelligence and law enforcement force to identify suspects and extract communication patterns of terrorists or criminals. In our previous work, we have shown how social network analysis and visualization techniques are useful in discovering patterns of terrorist social networks. Attribute to the advance of SNAM techniques, relationships among social actors can be visualized through network structures explicitly and implicit patterns can be discovered automatically. Despite the advance of SNAM, the utility of a social network is highly affected by its d completeness. Missing edges or nodes in a social network will reduce the utility of the network. For example, SNAM techniques may not be able to detect groups of social actors if some of the relationships among these social actors are not available. Similarly, SNAM techniques may overestimate the distance between two social actors if some intermediate nodes or edges are missing. Unfortunately, it is common that an organization only have a partial social network due to its limited information sources. In public safety domain, each law enforcement unit has its own criminal social network constructed by the data available from the criminal intelligence and crime database but this network is only a part of the global criminal social network, which can be obtained by integrating criminal social networks from all law enforcement units. However, due to the privacy policy, law enforcement units are not allowed to share the sensitive information of their social network data. A naive and yet practical approach is anonymizing the social network data before publishing or sharing it. However, a modest privacy gains may reduce a substantial SNAM utility. It is a challenge to make a balance between privacy and utility in social network data sharing and integration. In order to share useful information among different organizations without violating the privacy policies and preserving sensitive information, we propose a generalization and probabilistic approach of social network integration in this paper. Particularly, we propose generalizing social networks to preserve privacy and integrating the probabilistic models of the shared information for SNAM. To preserve the identity of sensitive nodes in social network, a simple approach in the literature is removing all node identities. However, it only allows us to investigate of the structural properties of such anonymized social network, but the integration of multiple anonymized social networks will be impossible. To make a balance between privacy and utility, we introduce a social network integration framework which consists of three major steps: (i) constructing generalized subgraph, (ii) creating generalized information for sharing, and (iii) social networks integration and analysis. We also propose two subgraph generalization methods namely, edge betweenness based (EBB) and Knearest neighbor (KNN). We evaluated the effectiveness of these algorithms on the Global Salafi Jihad terrorist social network.
Introduction
Social Network Analysis and Mining (SNAM) techniques have drawn significant attention in the recent years due to the popularity of online social media. With the advance of Web 2.0 and SNAM techniques, tools for aggregating, sharing, investigating, and visualizing social network data have been widely explored and developed. SNAM is effective in supporting intelligence and law enforcement force to identify suspects and extract communication patterns of terrorists or criminals. In our previous work [13], we have shown how social network analysis and visualization techniques are useful in discovering patterns of terrorist social networks. Attribute to the advance of SNAM techniques, relationships among social actors can be visualized through network structures explicitly and implicit patterns can be discovered automatically.
Despite the advance of SNAM, the utility of a social network is highly affected by its d completeness. Missing edges or nodes in a social network will reduce the utility of the network. For example, SNAM techniques may not be able to detect groups of social actors if some of the relationships among these social actors are not available. Similarly, SNAM techniques may overestimate the distance between two social actors if some intermediate nodes or edges are missing. Unfortunately, it is common that an organization only have a partial social network due to its limited information sources. In public safety domain, each law enforcement unit has its own criminal social network constructed by the data available from the criminal intelligence and crime database but this network is only a part of the global criminal social network, which can be obtained by integrating criminal social networks from all law enforcement units. However, due to the privacy policy, law enforcement units are not allowed to share the sensitive information of their social network data. A naïve and yet practical approach is anonymizing the social network data before publishing or sharing it. However, a modest privacy gains may reduce a substantial SNAM utility. It is a challenge to make a balance between privacy and utility in social network data sharing and integration.
In order to share useful information among different organizations without violating the privacy policies and preserving sensitive information, we propose a generalization and probabilistic approach of social network integration in this paper. Particularly, we propose generalizing social networks to preserve privacy and integrating the probabilistic models of the shared information for SNAM. To preserve the identity of sensitive nodes in social network, a simple approach in the literature is removing all node identities. However, it only allows us to investigate of the structural properties of such anonymized social network, but the integration of multiple anonymized social networks will be impossible. To make a balance between privacy and utility, we introduce a social network integration framework which consists of three major steps: (i) constructing generalized subgraph, (ii) creating generalized information for sharing, and (iii) social networks integration and analysis. We also propose two subgraph generalization methods namely, edge betweenness based (EBB) and Knearest neighbor (KNN). We evaluated the effectiveness of these algorithms on the Global Salafi Jihad terrorist social network.
This paper is organized as follows. In the next section, we review the existing works about privacy preservation of social network. Previous techniques are classified based on their assumption of attack models the definition of sensitive information, and the privacy preservation techniques. In section 3, we introduce the researchd framework. Social network generalization and integration techniques are introduced in section 4. The experiment design, results and discussions are presented in section 5. We conclude our work and introduce future work in section 6.
Related work
Sensitive information of social network
Given a social network, the definition of sensitive information depends on the specific applications. In the literature, the social network sensitive information can be classified into node properties, neighborhood graphs, edge properties, and network properties in general.
Node properties
In a social network, identity of nodes can be an important type of sensitive property [47]. A node with sensitive identity means that its identity is private and should not be released. On the other hands, a node with insensitive identity means that the identity of this node can be released with no harm. Another type of sensitive property of a node can be its degree centrality [812]. Given a node, the degree centrality equals to the total number of edges connecting to this node, which is the number of friend in a social network. In a directed graph, edges can be further divided into inlinks and outlinks. Releasing the degree centrality of a given node, attacker can find out the number of nodes associated to this node which may further release its identity.
Neighborhood graphs
Node neighborhood graph is a concept highly related to degree centrality but with some differences [12]. Given a node and its neighbors, how these neighbors connect with each other can be unique. Publishing the neighborhood graph of a node may release the identify of this node.
Edge properties
Besides the properties of network nodes, Zheleva and Getoor also studied some sensitive properties related to network edges[13]. Two types of information of an edge can be potential sensitive information. One is the existence of an edge between two given nodes. The other is the label of a given edge which represents the type of relationship.
Network properties
Social network data has a set of important properties which can be considered as sensitive information in some cases, such as diameter, radius, betweenness, closeness, clustering coefficient etc.
Social network privacy attack model
To have a better protection against privacy attack, it is important to understand different types of privacy attack models. In this section, we introduce two categories of attack, active and passive attacks [11,14].
Active attacks
Backstorm et al. [14] introduced the active attack model. An adversary can actively select an arbitrary set of target actors, creates a small number of new actors with edges connecting to these targeted users, and then creates a pattern of links among the new actors. By planting new actors and connection patterns in the anonymized social network sophisticatedly, the adversary is able to identify the new actors as well as the targeted actors if the generated connection patterns are uniquely stand out in the anonymized network. Theoretically, the creation of nodes in an nnode network will begin compromising the privacy of the arbitrary targeted nodes. Backstorm et al., [14] further divided the active attacks into walkbased attack and cutbased attack. Both of them employed the strategy of inserting nodes into the target network and then link these nodes with the target nodes. The difference between them is the theoretical number of nodes used in the attack.
Passive attacks
Backstorm et al. [14] also investigated the passive attack model, where adversaries do not create any new nodes or edges. Backstorm et al. pointed out that attacker with certain knowledge can easily differentiate the target nodes or edges from the others due to their unique structural information. Most current studies focus their research on preventing passive attacks, which includes: (1) node passive attack [811], where adversaries are supposed to take advantage of node’s degree centrality information to uncover node’s identity; (2) edge passive attack [1315], where adversaries are supposed to know the existence of certain edges, leading to the disclosure of sensitive information by tracking the identify of other edges or nodes via known edges; (3) subgraph passive attack [9,11,12], where adversaries are supposed to make use of subgraph information known in advanced to identify sensitive information of node, such as node identity; (4) graph metrics passive attack [16], where the adversaries have certain background knowledge of the graph metrics, for example hub fingerprint, closeness centrality or betweenness centrality. With the knowledge of these graph metrics, it’s also possible that adversaries can uncover several sensitive information of the social network.
Privacy preservation models and algorithms
In the recent years, a number of approaches for preserving privacy of relational data have been studied extensively, which include kanonymity [17,18], ldiversity[19], Personalized anonymity[20], and (α,k)Anonymity[21]. One common objective of these algorithms is to ensure every node is indistinguishable to other (k1) nodes after anonymization. Although these methods work well in relational table data, most of them cannot deal with social network data due to the complex structure of social network and various background and attack model employed by an adversary. In the recent years, a few research groups have investigated the privacy preservation of social network data. They preserve the data privacy mainly by three approaches: perturbationbased approach, generalizationbased approach, and protocolbased approach. Different techniques correspond to different type of sensitive data as well as privacy requirement.
Perturbationbased technique
The perturbationbased technique perturbs a social network by adding, deleting or switching edges in a social network in order to increase the difficulty of identifying a node. Most of them are using greedy algorithm guided by an objective function to modify the social network step by step until the anonymized network satisfied some given conditions. Liu and Terzi proposed the Kdegree Anonymous Algorithm to ensure that each network node is indistinguishable to other (K1) nodes [10]. Starting from the original degree sequence d of input graph G, the algorithm constructs a new degree sequence which satisfies two conditions including: is kanonymous and is minimized. Zhou and Pei proposed the Kneighborhood Algorithm to make sure that node identity cannot be reidentified by an adversary with a confidence larger than 1/k, even though the adversary has background knowledge of the neighborhood graph [12]. The whole process is divided into two phases. First, the algorithm extracts the neighborhoods of all nodes in the network. To facilitate the comparisons among neighborhoods of different nodes, the researchers proposed a neighborhood component coding technique to represent the neighborhoods in a concise way. In the second step, the algorithm greedily organizes nodes into groups and anonymizes the neighborhoods of nodes in the same group. The greedy algorithm is guided by an anonymization cost which is measured by the similarity between the neighborhoods of two nodes. Ying and Wu proposed the Spectrum Preserving Algorithm which preserves the privacy by randomly perturbing edges in the network [16]. The whole process can be divided into three steps: at first, the eigenvalues of the input graph is computed; and then based on some proved theorems, the boundaries of eigenvalues are given; finally the algorithm perturbs the graph by adding, deleting or switching edges of the graph. If the eigenvalues of perturbed graph is within the given boundaries, the perturbation is accepted and continued for next perturbation. The algorithm terminates until the precondition is satisfied.
Generalizationbased technique
The generalizationbased technique preserves a social network by grouping certain number of nodes or edges together and then only release the general information of the groups of nodes or edges. Nodes within a group cannot be differentiated because they all share exactly the same properties of the group. In most cases, a generalizationbased technique divides nodes according to some predefined loss functions. Hay et al. proposed a node splittingbased technique to achieve kanonymity of the social network [9,11]. Starting from a single partition of a social network, the algorithm keeps on splitting the selected partition into two subgroups until all predefined criteria are satisfied. Similarly, Campan and Truta introduced a node clusteringbased approach to satisfy the kanonymity requirement and minimize the information loss [8]. In their algorithm, clusters are created one at a time. To form a new cluster, a node in V with the maximum degree but not yet be allocated to any cluster is selected as a seed for the new cluster. Then the algorithm puts nodes to this currently processed cluster until it reaches the desired cardinality k. At each step, the current cluster grows with one node. The selected node should not be assigned to any cluster yet but it should be able to minimize the growth of information loss of the current clusters. Zheleva and Getoor proposed an edge clusteringbased technique to hide the sensitive information on edges [13]. Their technique is divided into two phases. In the first phase, the technique provides a clustering of the nodes into m equivalence class (C1,C2,…,CM) such that each node is indistinguishable in its quasiidentifying attributes from K1 other nodes. In the second step, this work presents several techniques to protect sensitive information of the social network and then compare their performance, which includes partialedge removal, clusteredge anonymization, and clusteredge anonymization with constraints. Cormode and Srivastava proposed the safe groupings technique for a bipartite Graph G = (V,W,E), where V and W correspond to two types of objects [15]. In their work, a safe grouping of a bipartite graph partitions nodes into groups such that two nodes of the same group of V have no common neighbors in W and vice versa. A greedy algorithm is proposed to find K safe groups of V and L safe groups of W. For each node u, the algorithm attempts to assign u to the first group with fewer than n nodes. If it makes the grouping unsafe, the algorithm will try the second available group and so forth. If there is no group that meets the requirements, a new group will be created to contain this node. After getting K safe groups of V, the algorithm move forward to find L safe groups of W following a similar same process.
Protocolbased technique
The protocolbased technique is using the encryption approach rather than anonymizing the social network data. Social network data is encrypted by following a protocol before sharing with other parties. The protocol ensures that other parties are only able to obtain the insensitive information for their applications but the sensitive information is preserved. Frikken and Golle proposed the pieces assembling approach for private social network analysis [22].
Summary
In this section, we provide a summary of the literature by comparing the privacy preservation techniques and the preserving data in social network as shown in Table 1. In general, some privacy preservation techniques are developed for preserving specific information but are not applied to other information. The choice of the privacy preservation techniques also depends on the application of social network analysis.
Table 1. Classification of privacy preservation techniques based on sensitive information
Research problem
The existing works focus on preserving privacy of social network data for data publishing so that the global network structure can be analyzed. However, it has not considered how to integrate social network data from different sources so that social network analysis and mining can be conducted on the integrated data and yet the privacy of the shared data can be preserved. Individual published social network data only capture parts of the complete social network. Unless we can integrate multiple social networks and conduct SNAM on the integrated social network, the utility of the anonymized data is still limited. Given multiple law enforcement units and each of them has a criminal social network which captures a partial picture of a complete criminal social network, the objective of this work is preserving the privacy of the shared data from each law enforcement unit and conduct SNAM tasks on the integrated data. In this section, we first define formally the research problem, introduce what sensitive information to preserve, what insensitive information to share and what SNAM task can be conducted. Then, we proposed a research framework to address the research problem.
Problem definition
Given a set of network ℊ = {G_{1}, G_{2},…,G_{n}} in a distributed setting where each organization i owns its piece of G_{i}, assuming the complete network ) is unknown to each individual organization, the goal of this paper is to study how to anonymize each G_{i} into so that: 1) the sensitive identities of G_{i} can be protected; 2) can be shared with other organizations and the integrated anonymization graph can be used for SNAM task. Concretely, each network consists of both insensitive nodes and sensitive nodes. Node identities of those insensitive nodes are known to the public or the sharing parties while the node identities of those sensitive nodes are unknown to the public and needed to be protected. Our focus in this paper is to protect the node identities of those sensitive nodes. On the other hand, for SNAM purpose, some network properties, including topology, diameter and some other abstract features of the anonymized network, will be released and shared across organizations. Last but not least, it’s important to note that some network features cannot be preserved in our method, such as neighborhood information. Therefore, not all SNAM tasks can be achieved in our integrated anonymized network. In this paper, we only study how to preserve the usefulness of the integrated anonymized network regarding to distancerelated analysis, such as computing the closeness of each node. To summarize, although some existing works have studied how to anonymize network for data publishing, the research problem that we study here is different. We not only anonymize a given network to protect its’ node identity but also focus on integrating anonymized networks to achieve better SNAM results
Framework of social network integration with privacy preservation
To further motivate our research framework, we assume organization P ( O_{P}) has a social network G_{P} and organization Q ( O_{Q}) has another social network G_{Q}, both G_{P} and G_{Q} are partial networks of a complete social network which is unknown to any organization. O_{P} needs to conduct a Social Network Analysis and Mining (SNAM) but G_{P} is incomplete due to its limited sources of information. As a result, it will be difficult or even impossible for O_{P} to get accurate SNAM results. If there is no privacy concern between different organizations, one can integrate G_{P} and G_{Q} to generate an integrated G and obtain a better SNAM result. However, due to privacy concern, O_{Q} cannot share G_{Q} completely with O_{P}, but only shares the insensitive information of G_{Q} with O_{P} according to the privacy policies. At the same time, O_{P} does not need all data from O_{Q} but only those that are critical for the SNAM tasks. For these reasons, to integrate social networks of different organizations without violating privacy policies, we only need to share information that is critical to the performance of SNAM and yet preserve the sensitive information.
Figure 1 demonstrates the general framework of social network integration for SNAM. In this framework, O_{Q} employs subgraph generalization techniques to create a generalized social network, G_{Q}’, from G_{Q} without violating the privacy policy. The generalized social network only contains generalized information of G_{Q} without releasing any sensitive information. For example, a generalized social network cannot release the exact identity of each nodes or exact shortest distance between any two nodes. On the other hand, generalized information can include diameter of a subgraph, average number of adjacent nodes between two subgroups, degree of an insensitive node and other insensitive information. The generalized social network G_{Q}’ will then be integrated with G_{P} to support a social network analysis and mining task. Given the generalized information from G_{Q}, it is expected to achieve better performance on SNAM task than conducting the analysis and mining on G_{P} alone. There are two important subtasks in our proposed framework which we will address in the following sections:
Figure 1 . General framework of social network integration for SNAM.
Task 1
Given a social network G with sensitive information, produce generalized social network G^{’} and determine the generalized information which can be released.
Task 2
Integrate a generalized social network with the local social network, and then utilize shared generalized information to achieve better SNAM results.
Notations
In Table 2, we define a set of notations for the proposed social network integration techniques.
Table 2. Notations and definitions
Methodology
Social network generalization
In task one, given a social network G with sensitive information, we employ clusteringbased technique to produce a generalized social network G^{’}. We suppose G = ( V, E), where V is a set of nodes, E is a set of edges and  V = n, K of these nodes are insensitive nodes, and nK of these nodes are sensitive nodes. We generate a generalized social network in two steps. In the first step, we decompose G into K subgraphs G_{i} = (V_{i},E_{i}), where V = U_{i =1to K}V_{i} and each subgraph contains one insensitive node. In the second step, each subgraph will be transformed to a generalized node of the generalized graph G^{’}. Furthermore, two generalized nodes will be connected in G^{’} if and only if there is one or more edges connecting nodes from these two subgraphs respectively.
In this section, we propose two graph partition algorithms, Knearest neighbor ( KNN) method and Edge betweenness based (EBB) method, to generate a generalized social network G^{’} for sharing purpose. Both KNN and EBB methods are developed by following one common principle that the identity of insensitive nodes can be published safely while the identity of sensitive nodes cannot, so that, to produce a generalized social network, we need to divide the original network into several subgraphs each of which represented by an insensitive nodes, and the final generalized network should be also represented by these insensitive nodes.
Knearest neighbor (KNN) method
Given a social network G with K insensitive nodes KNN method divides G into K subgraphs by assigning each node v to its nearest insensitive node. Let SP^{D}(v, v_{i}^{C}) be the distance of the shortest path between v and v_{i}^{C}. Starting from the sensitive nodes adjacent to insensitive node, KNN method assign sensitive node, one node per time, to the closest subgraph G_{i} where SP^{D}(v, v_{i}^{C}) is shorter than or equal to SP^{D}(v, v_{j}^{C}) where j = 1, 2, .., K and j ≠ i. After dividing a social network into K subgraphs, we collapse all nodes of a subgraph into one generalized node, and represent this node with the identity of the insensitive node of this subgraph. Finally, for each possible pair of generalized nodes, say G_{i} and G_{j} in the generalized graph G’, an edge will be created if and only if there is one or more edges between any two nodes in G from subgraph G_{i} and G_{j} respectively.
Figure 2 presents a simple example to illustrate the idea of KNN and show how it works to produce generalized social network. Figure 2 (a) is the given social network which has seven nodes. Among them, v_{1} and v_{2} are insensitive nodes while the others are all sensitive nodes. By using KNN method, the given social network will be divided into two isolated social networks as shown in Figure 2 (b). Finally, one subgraph is represented by v_{1} and another subgraph is represented by v_{2}, Figure 2 (c) demonstrates the final generalized social network where two generalized nodes are connected together because v_{4} and v_{5} are connected in G. The KNN subgraph generation algorithm is presented below:
length = 1;
V = V  {v_{1}^{C}, v_{2}^{C}, … v_{K}^{C}};
While V ≠ Ø
For each v_{j} ∈ V
For each i = 1 to K
IF(SP^{D}(v_{j}, v_{i}^{C}) == length);
V_{i} = V_{i} + v_{j};
V = V – v_{j};
End For;
End For;
length++;
End While
For each (v_{i},v_{j}) ∈ E
IF(Subgraph( v_{i}) == Subgraph( v_{j}))
//Subgraph( v_{i}) is the subgraph such that v_{i} ∈ Subgraph( v_{i})
G_{k} = Subgraph( v_{i})
E_{k} = E_{k} + (v_{i},v_{j})
ELSE
Create an edge between Subgraph( v_{i}) and Subgraph( v_{j}) and add it to E’
End For
Figure 2 . Illustrations of generating subgraphs.
Edge betweenness based (EBB) method
Instead of assigning sensitive nodes to the closest subgraphs represented by insensitive nodes, the EBB method progressively remove edges with the highest betweenness and it also ensure that each separated subgraph contains exactly one insensitive node. The betweenness of an edge is defined as the number of shortest paths between pairs of nodes that pass through it. If a network consists of a few of dense communities which are only loosely connected by some intercommunity edges, these intercommunity edges will have high betweenness, so that removing them will naturally break the social network into multiple communities. The EBB algorithm is presented as follows:
//EBB(G), Edge Betweenness Based method
Initialize e = {};
While(there are more than one insensitive node in graph G)
Identify edge (v_{i},v_{j}) in G which is not an element of e and has the highest betweenness;
Remove (v_{i},v_{j}) from G;
IF(G is still connected after removing edges (v_{i},v_{j}))
EBB(G);
ELSE IF (G is disconnected and split to two graph G_{p} and G_{q})
IF(No insensitive node in G_{p}) or (No insensitive node in G_{q})
Add (v_{i},v_{j}) back to G;
e = e + (v_{i},v_{j});
Go Back to Step 2;
ELSE
EBB(G_{P});
EBB(G_{q});
End While;
//Add edge between generalized node to form generalized graph
For each (v_{j},v_{j}) ∈ E
IF(Subgraph( v_{i}) == Subgraph( v_{j}))
G_{k} = Subgraph( v_{i})
E_{k} = E_{k} + (v_{i},v_{j})
ELSE
Create an edge between Subgraph( v_{i}) and Subgraph( v_{j}) and add it to E’
End For
Figure 3 shows an example of how EBB method works to produce generalized social network. Given a social network with nine nodes, v_{1} and v_{2} are insensitive nodes while all other nodes are sensitive nodes. Since edge (v_{1}, v_{2}) has the highest Betweenness and it is safe to be removed, EBB method delete this edge to form two separated subgraphs each of them contains exactly one insensitive node, as shown in Figure 3 (b). Finally, the EBB method generalizes these two subgraphs into two generalized nodes, and then connects them to form the generalized graph as shown in Figure 3 (c).
Figure 3 . Illustrations of generalizing subgraph byEBB.
Generalized subgraph information
Given a generalized social network G_{i} and its center v_{i}^{C}, we select shareable network properties based on the information need and the privacy policy. In this paper, we treat node identity as sensitive information that we should protect, and consider distance between nodes to be useful information for SNAM task. Let v_{a} and v_{b} be any two nodes in G_{i} and the length of the shortest path between v_{a} and v_{b} be SP^{D}(v_{p},v_{q},G_{i}). We define the longest length of the shortest paths between any two nodes in G_{i}, denoted by L_SP^{D}(G_{i}), as
We also define the shortest length of the shortest paths between any two nodes in G_{i}, denoted by S_SP^{D}(G_{i}), as
To reduce the risk of releasing sensitive information, instead of sharing exact information of shortest path, we propose to share the expected length between two nodes within a generalized social network. Formally speaking, the length of any shortest paths in G_{i}, α, must be smaller or equal to L_SP^{D}(G_{i}) and larger or equal to S_SP^{D}(G_{i}), where S_SP^{D}(G_{i}) ≤ α ≤ L_SP^{D}(G_{i}). We compute and share the probability of the length of the shortest path between any two nodes in G_{i}, denoted as Prob(SP^{D}(G_{i}) = α), and 0 ≤ Prob(SP^{D}(G_{i}) = α) ≤ 1
Similarly, let the length of the shortest path between v_{a} and v_{i}^{C}, be SP^{D}(v_{a},v_{i}^{C},G_{i}). We define the longest length of the shortest paths between v_{i}^{C} and other nodes within G_{i}, denoted by L_SP^{D}(v_{i}^{C},G_{i}), as
We also define the shortest length of the shortest paths between v_{i}^{C} and other nodes within G_{i}, denoted by and S_SP^{D}(v_{i}^{C},G_{i}), as
Since the length of shortest paths between v_{i}^{C} and any other nodes in G_{i} must be smaller or equal to L_SP^{D}(v_{i}^{C}, G_{i}) and larger or equal to S_SP^{D}(v_{i}^{C}, G_{i}), denoted as S_SP^{D}(v_{i}^{C}, G_{i}) ≤ β ≤ L_SP^{D}(v_{i}^{C}, G_{i}). We compute the probability of the length of the shortest path between any node and v_{i}^{C}, Prob(SP^{D}(v_{i}^{C},G_{i}) = β), where 0 ≤ Prob(SP^{D}(G_{i}) = α) ≤ 1.
We also denote Num(G_{i}) as the number of nodes in G_{i} and Num(G_{i},G_{j}) as the number of nodes in G_{i} that are adjacent to another subgraph G_{j}.
The generalized subgraph information for sharing includes: (i) L_SP^{D}(G_{i}), (ii) S_SP^{D}(G_{i}), (iii) Prob(SP^{D}(G_{i}) = α), (iv) L_SP^{D}(v_{i}^{C},G_{i}), (v) S_SP^{D}(v_{i}^{C},G_{i}), (vi) Prob(SP^{D}(v_{i}^{C},G_{i}) = β), (vii) Num(G_{i}), and (viii) Num(G_{i},G_{j}).
Generalized graph integration and social network analysis
In section 4.2 and 4.3 we introduced how to divide a social network into subgraphs, and then generalize these subgraphs to nodes, then finally produce a generalized social network. We also discussed what kind of information will be shared along with the generalized social network. In this section, given a generalized social network G^{’} and the shareable information of the subgraphs of G^{’}, we propose our own techniques to integrate social network and shared information to improve the performance of SNAM task.
Suppose organization O_{p} has a social network G_{p} and organization O_{Q} has another social network G_{Q}, O_{p} wants to integrate G_{Q} with its own G_{p} to compute more accurate closeness centrality. We propose to achieve this goal without violating the privacy policies in three steps: (1) produce generalized social network and ; (2) integrate and into G_{Integrated}; (3) estimate the distance between any two nodes of the integrated social network. Among these three steps, step one can be achieved by our proposed techniques in section 4.2 and 4.3. In step two, although the subgraphs represented by a common insensitive node in and are different and the connectivity between these insensitive nodes are also different, according to our proposed techniques, and are represented by the same group of insensitive nodes since G_{p} and G_{Q} share same insensitive nodes. As a result, we can combine and into G_{Integrated} by taking union of their edges. In this section, we focus on the step 3 which estimate distances between any two nodes based on G_{p}, G_{Integrated} and shared information of subgraphs of .
To reestimate the distance between two nodes v_{i} and v_{p} of G_{p} by making use of G_{Integrated} and the shared information of subgraphs of , we first identify the two closest insensitive nodes for v_{i} and v_{j} in G_{p}, and then use G_{Integrated} and the generalized information of to reestimate their distances. Formally speaking, let the closest insensitive node to v_{i} in G_{p} be , and the second closest insensitive node to v_{i} in G_{p} be . We set the weights λ_{A} and λ_{A’} as
with and the weight of the closest insensitive node is higher.
Similarly, let the closest insensitive node to v_{j} in G_{P} be , and the second closest insensitive node to v_{j} in G_{P} be , we set the weights λ_{B} and λ_{B’} as
In G_{Integrated}, , and are the centers of generalized subgraphs G_{A}, G_{A}, G_{B}, and G_{B}, respectively. We estimate the distance between v_{i} and v_{j}, d(v_{i},v_{j}), by integrating the estimated distances of the four possible paths going through these insensitive nodes by a linear combination with weights equal to λ_{α} × λ_{b}.
D ( v_{i}, v_{j}) is the estimated distance between v_{i} and v_{j} on the path going through and , where a can be A or A’ and b can be B or B’.
where G_{k} is a generalized node on the shortest path between G_{α} and G_{b} in G_{Integrated} If a ≠ b which means v_{i}and v_{j} are not in the same subgraph, then D( v_{i}, v_{j}) is estimated by , and E( G_{k}). Otherwise, if v_{i}and v_{j} are in the same subgraph then a = b. In this case, D ( v_{i}, v_{j}) is estimated by . corresponds to the expected length of the distance between v_{i} and the subgraph gatekeeper within G_{α}. Similarly, corresponds to the expected length of the distance between v_{j} and the subgraph gatekeeper within G_{b}. In addition, E ( G_{k}) is the expected length of the distance between any two nodes of subgraph G_{k} that the shortest path between v_{i} and v_{j} is going through. If v_{i} is not the same as is computed by E ( G_{α}) and the percentage of nodes in G_{a} that is adjacent to the subgraph that is immediately following G_{a} in the shortest path between v_{i} and v_{j} in G_{Integrated}. If v_{i} is the same as is equal to the expected length of the distance between the insensitive node, , to the other nodes in G_{a}. Computation of is done similarly.
where is the percentage of nodes in G_{α} as a gatekeeper which is adjacent to G_{k} and G_{k} is the subgraph that immediately follows G_{α} in the shortest path between v_{i} and v_{j} in G_{Integrated}.
E ( G_{k}) represents the expected length of the distance between any two nodes of the subgraph G_{k}, which is computed as:
D” (v_{i},v_{j}) corresponds to the estimated distance between v_{i} and v_{j} when both v_{i} and v_{j} are nodes of the same subgraph. In this case, if any of v_{i} or v_{j} is the same as , D” (v_{i},v_{j}) should equal to the expected length of the distance from the insensitive node to the other nodes in, G_{α}. Otherwise, D” (v_{i},v_{j}) should equal to the expected length of the distance between two nodes of the subgraph.
Experiment and discussion
Practically, there isn’t any intelligence unit has a complete terrorist social network but each of them has a partial terrorist social network. The objective of this work is to support these intelligence units to share their social networks while preserving the sensitive information. In this section, we investigated our proposed techniques on a realworld dataset of terrorists. We extracted several social networks from the terrorist dataset to simulate the realworld problem. Intensive experiment was conducted under different settings to evaluate our proposed techniques.
Dataset
In this work, we employed the Global Salafi Jihad terrorist social network, denoted as G, in our experiment. The Global Salafi Jihad terrorist social network consists of 366 nodes (terrorists) and 1,275 edges (connection between terrorists)[23]. These terrorists come from four major groups, including Central Staff of al Qaeda (CSQ), Core Arab (CA), Southeast Asia (SA), and Maghreb Arab (MA). We randomly sample α percent of nodes from the Global Salafi Jihad terrorist social network as insensitive nodes, that their identities are known by all organizations. Suppose there are two independent organizations O_{P} and O_{Q}, we simulate G_{P} for O_{P} by randomly removing β percent of edges from the Global Salafi Jihad terrorist social network. Similarly, we randomly remove β percent of edges from the Global Salafi Jihad terrorist social network to simulate G_{Q} for O_{Q}. As a result, both G_{P} and G_{Q} are partial graph of G. Moreover, G_{P} are different from G_{Q} in terms of their edges.
Evaluation
As discussed before, there is no generic approach for privacy preservation since sensitive information can be defined in various ways. Moreover, shareable useful information is also different in terms of different SNAM tasks. In this work, we treat node identity as sensitive information and consider distance between nodes as useful information that we want to maintain. To evaluate our proposed technique, we assume that the SNAM task conducted by G_{P} is to compute closeness centrality for each node. If G_{P} is close to G, then distances between any two nodes in G_{P} should be roughly equal to their distance in G, leading to similar closeness centrality for each node. Otherwise, nodes in G_{P} should have different closeness centrality in G. In this work, closeness centrality for a node in G_{P} is computed as:
where n is the total number of nodes in G_{P}.
Given a complete social network G and the integrated social network G_{Integrated}, the performance of our proposed technique is evaluated by the error function defined as:
Experiment
Figure 4 demonstrates the average closeness centrality of nodes of original graph (G), integrated graph using EBB method ( G_{Integrated} (EBB)), integrated graph using KNN method ( G_{Integrated} (KNN)) and incomplete graph ( G_{P}). In Figure 4, the blue line represents the average closeness centrality computed from G, which is a gold standard, so that the closer to this blue line the better it is.
Figure 4 . Average closeness centrality of complete graph (G), integrated graph using EBB method, integrated graph using KNN methodand incomplete graphG_{P}: (a) α = 0.05, (b) α = 0.15, (c) α = 0.25, (d) α = 0.35, (e) α = 0.45, (f) α = 0.55 (g) α = 0.65, (h) α = 0.75, (i) α = 0.85, (j) α = 0.95.
For each α from 0.05 to 0.95, we increased β (percentage of edges randomly removed from G) from 0.2 to 0.8. We observed that the performance of G_{P} (G_{Integrated} (KNN)) and ( G_{Integrated} (EBB)) decreased consistently when more edges are removed from the complete graph, no matter what the value of α is. Although our proposed technique integrates networks and estimates the average closeness centrality, the performance will not be as good as the average closeness centrality computed from the actual graph G. When more edges are removed before integration ( β increase), the performance will degrade.
We further investigated the performance of our proposed technique by increasing the percentage of insensitive nodes from 0.05 (Figure 4(a)) to 0.95 (Figure 4(j)). Similar patterns are observed from 4(a) to 4(j). In terms of average closeness centrality, increasing or decreasing the percentage of insensitive nodes in network did not make substantial impact to the performance of our purposed technique. One plausible explanation is that: the average closeness centrality used in this experiment only reflects the performance of our approach in an abstract level. Some nodes in the integrated network may have higher closeness centrality than its original closeness centrality in the complete graph while some nodes may have lower closeness centrality in the integrated network than in the complete graph. As a result, when we consider the average closeness centrality, the differences may be offset by each other.
Figure 5 (a) presents the error ratio of (G_{Integrated} (KNN)) with different α and β. Similarly, Figure 5 (b) presents the error ratio of G_{Integrated} (EBB) with different α and β. We compute the errors in closeness centrality obtained from the networks with and without integration ( Error( G_{integrated}) and Error( G_{p})) using the error function defined in 5.2. and the error ratio is defined as:
Figure 5. (a) error ratio ofwith different settings ofβand α; (b) error ratio ofwith different settings ofβand α.
Different from the average closeness centrality which we used as a measurement in Figure 4, the error function accumulates the closeness centrality difference for each individual node, so that the offset effect of average closeness will not occur. The experiment results of Figure 5 can be used to verify our explanation to the Figure 4 in the last paragraph.
The experiment results demonstrate that when α is high (means more insensitive nodes), the improvement of our proposed technique comparing to the partial graph is also higher. The highest improvement was achieved when α equals to 0.95. The improvement decreased slowly along with the decrease of α. This observation indicated that our explanation of Figure 4 is correct. With more insensitive nodes, the integrated network will be closer to the original network so that the improvement of our technique will be higher.
Last but not least, from both Figures 4 and 5, we do not observe any significant differences of the performance between using KNN or EBB to produce generalized network. However, as it is shown in section 4.2.2, the EBB algorithm is dominated by the step of calculating the edge betweenness which has time complexity O(N^{3}). On the other hand, KNN is much more efficient which is only O(N). As a result, when the network size is huge, KNN is preferred. Moreover, in a fully connected network where several edges have the same betweenness weight, EBB will take longer to produce the generalized network. However, KNN also has its limitation. For example, KNN starts from each insensitive node to look for sensitive nodes in its neighborhood to form a subgraph step by step. However, the search process is not fully simultaneous, but is controlled by a FOR loop. As a result, the sequence in the FOR loop is matter, especially for some nodes in the middle of two insensitive nodes. As a result, the division of subgraph by using KNN is less natural than EBB method.
Conclusion
In this paper, we investigate the privacy preservation techniques for social network integration. We introduce a research framework which consists of three major steps. First of all, we propose the KNearest Neighborhood method and the Edge Betweenness Based method to decompose a social network into multiple subgraphs. Secondly, we propose techniques to generalize a social network by sharing the probabilistic model of the generalized information. At third, we introduced the techniques of social network integration and distance estimation.
Using the Global Salafi Jihad terrorist social network as test bed, we thoroughly evaluated our proposed technique with different parameters and settings. The experiment results demonstrated that an organization can improve the accuracy of computing closeness centrality by sharing and integrating generalized information. Our proposed techniques were able to preserve the privacy as well as increase the utility of the shared social networks. We observed that KNN performed better than EBB but did not have substantial difference. Moreover, our proposed techniques were not sensitive to the number of insensitive node but relatively sensitive to the number of removed edges.
In the future, we will continue to examine our techniques in more datasets. We will explore other graph partition models and integration techniques to improve the performance of our technique. Moreover, we will also extend our work to maintain other useful information besides distance.
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
CCY proposed the research framework. CCY and XT developed the algorithms together. XT implemented the algorithms and conducted experiments. CCY and XT together drafted the manuscript. All authors read and approved the final manuscript.
References

Yang CC, Sageman M: Analysis of terrorist social networks with fractal views.
J Inf Sci 2009, 35(3):299320. Publisher Full Text

Yang CC, Ng T: “Terrorism and crime related weblog social network: Link, content analysis and information visualization,”.

Yang CC, Liu N, Sageman M: “Analyzing the terrorist social networks with visualization tools,”.

Yang CC, Tang X: Social networks integration and privacy preservation using subgraph generalization. In Proceedings of the ACM SIGKDD Workshop on CyberSecurity and Intelligence Informatics. ACM, ; 5361.

Yang CC, Tang X: Information Integration for Terrorist or Criminal Social Networks.
Ann Inform Syst 2010, 9:4157. Publisher Full Text

Tang X, Yang CC: “Generalizing terrorist social networks with Knearest neighbor and edge betweeness for social network integration and privacy preservation,”.

Yang CC, Thuraisingham B: PrivacyPreserved Social Network Integration and Analysis for Security Informatics.

Campan A, Truta T: “A clustering approach for data and structural anonymity in social networks,”.
Proceeding of the second ACM SIGKDD International Workshop on Privacy, Security, and Trust in KDD

Hay M, Miklau G, Jensen D, Weis P, Srivastava S: “Anonymizing social networks,”. University of Massachusetts Technical Report, ; 2007:0719.

K. Liu, and E. Terzi, “Towards identity anonymization on graphs”, Proceedings of the: ACM SIGMOD international conference on Management of data. ACM New York, NY, USA; 2008:93106.

Hay M, Miklau G, Jensen D, Towsley D, Weis P: Resisting structural reidentification in anonymized social networks.

Zhou B, Pei J: “Preserving privacy in social networks against neighborhood attacks,”. In IEEE 24th International Conference on Data Engineering. ICDE, ; 2008:506515.

Zheleva E, Getoor L: Preserving the privacy of sensitive relationships in graph data.

Backstrom L, Dwork C, Kleinberg J: “Wherefore art thou r3579x?: anonymized social networks, hidden patterns, and structural steganography,”. In Proceedings of the 16th international conference on World Wide Web. ACM New York, NY, USA; 181190.

Cormode G, Srivastava D, Yu T, Zhang Q: Anonymizing bipartite graph data using safe groupings.

“Randomizing social networks: a spectrum preserving approach,” SIAM Conf. on Data Mining. .

Sweeney L: kanonymity: A model for protecting privacy.
Int J Uncertainty Fuzziness Knowledge Based Syst 2002, 10(5):557570. Publisher Full Text

Samarati P: “Protecting respondents' identities in microdata release,”.

Machanavajjhala A, Kifer D, Gehrke J, Venkitasubramaniam M: “ldiversity: Privacy beyond kanonymity”.
ACM Trans Knowledge Discov Data (TKDD) 2007, 1(1):3. Publisher Full Text

X. Xiao, and Y. Tao, “Personalized privacy preservation”, Proceedings of the: ACM SIGMOD international conference on Management of data. ACM New York, NY, USA; 2006:229240.

Wong R, Li J, Fu A, Wang K: “(alpha, k)anonymity: an enhanced kanonymity model for privacy preserving data publishing,”. ACM Press, ; 754759.

Frikken K, Golle P: “Private social network analysis: How to assemble pieces of a graph privately,”. ACM New York, NY, USA; :8998.

Sageman M: Understanding Terror Networks. University of Pennsylvania Press, ; 2004.