Final: Question 2

Please use the Enron dataset you imported for the previous problem. For this question, you will use the aggregation framework to figure out pairs of people that tend to communicate a lot. To do this, you will need to unwind the To list for each message.

This problem is a little tricky because a recipient may appear more than once in the To list for a message. You will need to fix that in a stage of the aggregation before doing your grouping and counting of (sender, recipient) pairs.

Which pair of people have the greatest number of messages in the dataset?

Solution:   Query to run as per the question is below as per my understanding:

db.messages.aggregate([
	{$project: {
		from: "$headers.From",
		to: "$headers.To"
	}},
	{$unwind: "$to"},
	{$project: {
		pair: {
			from: "$from",
			to: "$to"
		},
		count: {$add: [1]}
	}},
	{$group: {
		_id: "$pair",
		count: {$sum: 1}
	}},
	{$sort: {
		count: -1
	}},
	{$limit: 2},
	{$skip: 1}
])

Results that I found is below:

{ "_id" : { "from" : "susan.mara@enron.com", "to" : "richard.shapiro@enron.com" }, "count" : 974 }
final_exam_question_and-answer-2

Let me know if you guys find extra so that we can discuss it.

Thanks

2 COMMENTS

LEAVE A REPLY

Please enter your comment!
Please enter your name here