Skip to content

Duplicate ERA content found by Google Search Console #3372

@jefferya

Description

@jefferya

Related to #3289

When the sitemap filter is applied to Google Search Console "Duplicate without user-selected canonical", three items appeared where Google thinks the content is similar to another item in the sitemap. Upon investigating the Google Search Console URL inspection, the "User-declared canonical" and "Google-selected canonical" appear very similar. E-mail sent to the erahelp team for advice (Jan 31; ref. #3289 (comment)).

The next week, the Google Search Console reported additional items. These items seem to be older (i.e., not added in the last week).

Question: is there a way to test for duplicates more efficiently than Google?

Attempt 1.: use the Active Storage database table active_storage_blob column checksum to verify each attachment is unique therefore finding any duplicate items.

irb(main):016:0> ActiveRecord::Base.connection.execute("select count(*) from active_storage_attachments")[0]
=> {"count"=>351758}
irb(main):017:0> ActiveRecord::Base.connection.execute("select count(*) from active_storage_blobs")[0]
=> {"count"=>388052}

irb(main):018:0> ActiveRecord::Base.connection.execute("select count(*) from theses")[0]
=> {"count"=>34452}
irb(main):019:0> ActiveRecord::Base.connection.execute("select count(*) from items")[0]
=> {"count"=>44781}

The number of blobs and attachments seems high relative to the number of items and thesis. Could this be related to #3248?

Let's test, each active_storage_blob should appear only once for each unique attachment, right?

irb(main):014:0> ActiveRecord::Base.connection.execute("SELECT a_blob.checksum, count(*) FROM active_storage_blobs as a_blob GROUP BY a_blob.checksum HAVING count(a_blob.checksum)>1").count()
=> 26520

Why are there so many blobs with the same checksum? Let's filter the attachment count by record_type

ActiveRecord::Base.connection.execute("SELECT blob.checksum FROM active_storage_attachments AS attach, active_storage_blobs AS blob WHERE attach.blob_id=blob.id AND (attach.record_type = 'Item' OR attach.record_type='Thesis') GROUP BY blob.checksum HAVING count(blob.checksum) > 1")
=> #<PG::Result:0x00007f97dd5fcc98 status=PGRES_TUPLES_OK ntuples=374 nfields=1 cmd_tuples=374>

Why the decrease in numbers?

irb(main):065:0>  results = ActiveRecord::Base.connection.execute("SELECT record_type FROM  active_storage_attachments group by record_type")
=> #<PG::Result:0x00007f97dd922fd0 status=PGRES_TUPLES_OK ntuples=7 nfields=1 cmd_tuples=7>
irb(main):067:0> results.values
=>
[["ActiveStorage::Blob"],
 ["ActiveStorage::VariantRecord"],
 ["Community"],
 ["DraftItem"],
 ["DraftThesis"],
 ["Item"],
 ["Thesis"]]

Maybe due to the filter on the record_type? The answer seems like "yes" from the below

ActiveRecord::Base.connection.execute("SELECT blob.checksum FROM active_storage_attachments AS attach, active_storage_blobs AS blob WHERE attach.blob_id=blob.id GROUP BY blob.checksum HAVING count(blob.checksum) > 1")
=> #<PG::Result:0x00007f97de6f34e8 status=PGRES_TUPLES_OK ntuples=34829 nfields=1 cmd_tuples=34829>

In the list of duplicated checksums, let's find all the record_ids that have attachments to a duplicated checksum (Item or Thesis record types with attachment name = "file". This output will return draft items if they are attached to a duplicate checksum.

irb(main):055:0>  results = ActiveRecord::Base.connection.execute("SELECT * FROM active_storage_attachments AS a, active_storage_blobs AS b WHERE a.blob_id=b.id AND b.checksum IN ( SELECT blob.checksum FROM active_storage_attachments AS attach, active_storage_blobs AS blob WHERE attach.blob_id=blob.id AND (attach.record_type = 'Item' OR attach.record_type='Thesis') and attach.name='files' GROUP BY blob.checksum HAVING count(blob.checksum) > 1) ORDER BY b.checksum" )
=> #<PG::Result:0x00007f97e5553780 status=PGRES_TUPLES_OK ntuples=1118 nfields=16 cmd_tuples=1118>
 {"id"=>"76e3991d-e5dd-4e83-9356-08ee8e49b013",
  "name"=>"files",
  "created_at"=>"2020-08-07T16:41:30.790Z",
  "record_type"=>"Item",
  "fileset_uuid"=>"819ebccf-476e-43e7-8712-17dcf5c0ddf8",
  "record_id"=>"bff082da-6ba2-4005-a625-bb9645ed50eb",
  "blob_id"=>"76e3991d-e5dd-4e83-9356-08ee8e49b013",
  "key"=>"7qEQQyAXtHZJPSVwi25PaMcc",
  "filename"=>"Moving Beyond Ideology Contemporary Recreation and the Neoliberal Discourses.docx",
  "content_type"=>"application/vnd.openxmlformats-officedocument.wordprocessingml.document",
  "metadata"=>"{\"identified\":true,\"analyzed\":true}",
  "byte_size"=>63687,
  "checksum"=>"zylBLkb9vux0Qk+qThBC9A==",
  "service_name"=>"local"},
 {"id"=>"76e3991d-e5dd-4e83-9356-08ee8e49b013",
  "name"=>"files",
  "created_at"=>"2020-08-07T16:41:30.790Z",
  "record_type"=>"DraftItem",
  "fileset_uuid"=>nil,
  "record_id"=>"f4244872-c911-46ee-a56c-f013accec73d",
  "blob_id"=>"76e3991d-e5dd-4e83-9356-08ee8e49b013",
  "key"=>"7qEQQyAXtHZJPSVwi25PaMcc",
  "filename"=>"Moving Beyond Ideology Contemporary Recreation and the Neoliberal Discourses.docx",
  "content_type"=>"application/vnd.openxmlformats-officedocument.wordprocessingml.document",
  "metadata"=>"{\"identified\":true,\"analyzed\":true}",
  "byte_size"=>63687,
  "checksum"=>"zylBLkb9vux0Qk+qThBC9A==",
  "service_name"=>"local"},
 {"id"=>"57a6f5c1-7ad0-4fd9-8ed5-bd8b211d8f39",
  "name"=>"files",
  "created_at"=>"2020-12-23T17:44:09.337Z",
  "record_type"=>"DraftItem",
  "fileset_uuid"=>nil,
  "record_id"=>"ad7db058-22e2-44f7-a71f-53f2097cbd4d",
  "blob_id"=>"57a6f5c1-7ad0-4fd9-8ed5-bd8b211d8f39",
  "key"=>"lujere8kg5fpoad2cpslxec22q7o",
  "filename"=>"Moving Beyond Ideology_PostPrintVersion_2020-05-15.docx",
  "content_type"=>"application/vnd.openxmlformats-officedocument.wordprocessingml.document",
  "metadata"=>"{\"identified\":true,\"analyzed\":true}",
  "byte_size"=>63687,
  "checksum"=>"zylBLkb9vux0Qk+qThBC9A==",
  "service_name"=>"local"}]

Let's filter out the DraftItems and DraftThesis

irb(main):058:0>  results = ActiveRecord::Base.connection.execute("SELECT record_id, record_type, checksum, a.created_at, b.created_at, filename, content_type FROM active_storage_attachments AS a, active_storage_blobs AS b WHERE a.blob_id=b.id AND (a.record_type = 'Item' OR a.record_type='Thesis') and a.name='files' and b.checksum IN ( SELECT blob.checksum FROM active_storage_attachments AS attach, active_storage_blobs AS blob WHERE attach.blob_id=blob.id AND (attach.record_type = 'Item' OR attach.record_type='Thesis') and attach.name='files' GROUP BY blob.checksum HAVING count(blob.checksum) > 1) ORDER BY b.checksum" )
=> #<PG::Result:0x00007f97dd710710 status=PGRES_TUPLES_OK ntuples=835 nfields=16 cmd_tuples=835>
irb(main):059:0> results.values
...
 {"record_id"=>"bff082da-6ba2-4005-a625-bb9645ed50eb",
  "record_type"=>"Item",
  "checksum"=>"zylBLkb9vux0Qk+qThBC9A==",
  "created_at"=>"2020-08-07T16:41:30.790Z",
  "filename"=>"Moving Beyond Ideology Contemporary Recreation and the Neoliberal Discourses.docx",
  "content_type"=>"application/vnd.openxmlformats-officedocument.wordprocessingml.document"},
 {"record_id"=>"20121c56-ed6e-4b1c-986e-7014401e83c5",
  "record_type"=>"Item",
  "checksum"=>"zylBLkb9vux0Qk+qThBC9A==",
  "created_at"=>"2020-12-23T17:44:09.337Z",
  "filename"=>"Moving Beyond Ideology_PostPrintVersion_2020-05-15.docx",
  "content_type"=>"application/vnd.openxmlformats-officedocument.wordprocessingml.document"}

Let's write this to a CSV file

CSV.open('/era_tmp/delete_me.csv', 'wb') do |csv|
  results.each do |row|
    csv << row.values
  end
end

Let's check if there are records (Item & Thesis) with multiple attachments with the same checksum (i.e., a file attached to a record multiple times):

irb(main):050:0> results = ActiveRecord::Base.connection.execute("SELECT blob.checksum, attach.record_id FROM active_storage_attachments AS attach, active_storage_blobs AS blob WHERE attach.blob_id=b`lob.id AND (attach.record_type = 'Item' OR attach.record_type='Thesis') and attach.name='files' GROUP BY blob.checksum,attach.record_id HAVING count(blob.checksum) > 1")
=> #<PG::Result:0x00007f97dd75b5d0 status=PGRES_TUPLES_OK ntuples=12 nfields=2 cmd_tuples=12>
irb(main):051:0> results.as_json
=>
[{"checksum"=>"Zl2l9EGyAmAksQ5tS+B/iA==", "record_id"=>"e02cab87-1cab-413c-8bd8-c50764896a5e"},
 {"checksum"=>"CgOhJouY1r4jDj0HQAXm2w==", "record_id"=>"18ecea79-c21a-44be-be60-fe9fbccdb036"},
 {"checksum"=>"1rjlByqkT2HV70S4rQsqTw==", "record_id"=>"ebef783e-da83-478b-8467-d5cc99709272"},
 {"checksum"=>"HljKqB6+a4z7yjfqllx4kQ==", "record_id"=>"6e08f261-6dd1-4498-a4be-c3cfb1e46393"},
 {"checksum"=>"qtt8obM5FdBOZKgmhECXFg==", "record_id"=>"9b515e88-e783-44b4-89ef-7ea4d3dc02fc"},
 {"checksum"=>"MxCGonF7giYkVKN4dx+rKg==", "record_id"=>"4653e6af-12d3-4767-be78-2713dec4d1f6"},
 {"checksum"=>"BY3gLzkSPTF18A1JX67kBg==", "record_id"=>"e02cab87-1cab-413c-8bd8-c50764896a5e"},
 {"checksum"=>"xzbqWsiVx+4hVQiY64SAVA==", "record_id"=>"25520184-11ad-4465-a039-18695e42a92d"},
 {"checksum"=>"aL/ITdTrXnTksq0IU5ABcA==", "record_id"=>"fae6573a-934c-4b60-93e6-38a7ea82651c"},
 {"checksum"=>"KBSFkcK7+prvJwAqpJTAaQ==", "record_id"=>"18ecea79-c21a-44be-be60-fe9fbccdb036"},
 {"checksum"=>"36r3b8SWaEEHSDmUDWYUZw==", "record_id"=>"fc272f44-17bb-4980-bb31-3b86097abab0"},
 {"checksum"=>"wXGvEO+P2YiZJuaB6CPwYA==", "record_id"=>"0b800874-9bc7-49e1-9f08-aeba43cfde77"}]

Are these intentional?

Let's output nicely in a similar format to the duplicate records finder

results = ActiveRecord::Base.connection.execute("SELECT  record_id, record_type, checksum, a.created_at, b.created_at, filename, content_type FROM active_storage_attachments AS a, active_storage_blobs AS b WHERE a.blob_id=b.id and (b.checksum,a.record_id) IN (SELECT blob.checksum, attach.record_id FROM active_storage_attachments AS attach, active_storage_blobs AS blob WHERE attach.blob_id=blob.id AND (attach.record_type = 'Item' OR attach.record_type='Thesis') and attach.name='files' GROUP BY blob.checksum,attach.record_id HAVING count(blob.checksum) > 1)")

CSV.open('/era_tmp/delete_me.csv', 'wb') do |csv|
  results.each do |row|
    csv << row.values
  end
end

Google sheet shared: https://docs.google.com/spreadsheets/d/1khOWEk2XusG98vafWBgzACmbM-a5TR7K4Xzy1VcZl6M/edit#gid=1219983193

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions