How to remove duplicate values inside a list in mongodb

Well's you can do this using the aggregation framework as follows:

collection.aggregate([
    { "$project": {
        "name": 1,
        "code": 1,
        "abbreviation": 1,
        "bill_codes": { "$setUnion": [ "$bill_codes", [] ] }
    }}
])

The $setUnion operator is a "set" operator, therefore to make a "set" then only the "unique" items are kept behind.

If you are still using a MongoDB version older than 2.6 then you would have to do this operation with $unwind and $addToSet instead:

collection.aggregate([
    { "$unwind": "$bill_codes" },
    { "$group": {
        "_id": "$_id",
        "name": { "$first": "$name" },
        "code": { "$first": "$code" },
        "abbreviation": { "$first": "$abbreviation" },
        "bill_codes": { "$addToSet": "$bill_codes" }
    }}
])

It's not as efficient but the operators are supported since version 2.2.

Of course if you actually want to modify your collection documents permanently then you can expand on this and process the updates for each document accordingly. You can retrieve a "cursor" from .aggregate(), but basically following this shell example:

db.collection.aggregate([
    { "$project": {
        "bill_codes": { "$setUnion": [ "$bill_codes", [] ] },
        "same": { "$eq": [
            { "$size": "$bill_codes" },
            { "$size": { "$setUnion": [ "$bill_codes", [] ] } }
        ]}
    }},
    { "$match": { "same": false } }
]).forEach(function(doc) {
    db.collection.update(
        { "_id": doc._id },
        { "$set": { "bill_codes": doc.bill_codes } }
    )
})

A bit more involved for earlier versions:

db.collection.aggregate([
    { "$unwind": "$bill_codes" },
    { "$group": {
        "_id": { 
            "_id": "$_id",
            "bill_code": "$bill_codes"
        },
        "origSize": { "$sum": 1 }
    }},
    { "$group": {
        "_id": "$_id._id",
        "bill_codes": { "$push": "$_id.bill_code" },
        "origSize": { "$sum": "$origSize" },
        "newSize": { "$sum": 1 }
    }},
    { "$project": {
        "bill_codes": 1,
        "same": { "$eq": [ "$origSize", "$newSize" ] }
    }},
    { "$match": { "same": false } }
]).forEach(function(doc) {
    db.collection.update(
        { "_id": doc._id },
        { "$set": { "bill_codes": doc.bill_codes } }
    )
})

With the added operations in there to compare if the "de-duplicated" array is the same as the original array length, and only return those documents that had "duplicates" removed for processing on updates.


Probably should add the "for python" note here as well. If you don't care about "identifying" the documents that contain duplicate array entries and are prepared to "blast" the whole collection with updates, then just use python .set() in the client code to remove the duplicates:

for doc in collection.find():
    collection.update(
       { "_id": doc["_id"] },
       { "$set": { "bill_codes": list(set(doc["bill_codes"])) } }
    )

So that's quite simple and it depends on which is the greater evil, the cost of finding the documents with duplicates or updating every document whether it needs it or not.

This at least covers techniques.


You can use a foreach loop with some javascript:

db.bill.find().forEach(function(entry){
     var arr = entry.bill_codes;
     var uniqueArray = arr.filter(function(elem, pos) {
        return arr.indexOf(elem) == pos;
     }); 
     entry.bill_codes = uniqueArray;
     db.bill.save(entry);
})