-
Notifications
You must be signed in to change notification settings - Fork 11
Description
Related to #3289
When the sitemap filter is applied to Google Search Console "Duplicate without user-selected canonical", three items appeared where Google thinks the content is similar to another item in the sitemap. Upon investigating the Google Search Console URL inspection, the "User-declared canonical" and "Google-selected canonical" appear very similar. E-mail sent to the erahelp team for advice (Jan 31; ref. #3289 (comment)).
The next week, the Google Search Console reported additional items. These items seem to be older (i.e., not added in the last week).
Question: is there a way to test for duplicates more efficiently than Google?
Attempt 1.: use the Active Storage database table active_storage_blob column checksum to verify each attachment is unique therefore finding any duplicate items.
irb(main):016:0> ActiveRecord::Base.connection.execute("select count(*) from active_storage_attachments")[0]
=> {"count"=>351758}
irb(main):017:0> ActiveRecord::Base.connection.execute("select count(*) from active_storage_blobs")[0]
=> {"count"=>388052}
irb(main):018:0> ActiveRecord::Base.connection.execute("select count(*) from theses")[0]
=> {"count"=>34452}
irb(main):019:0> ActiveRecord::Base.connection.execute("select count(*) from items")[0]
=> {"count"=>44781}
The number of blobs and attachments seems high relative to the number of items and thesis. Could this be related to #3248?
Let's test, each active_storage_blob should appear only once for each unique attachment, right?
irb(main):014:0> ActiveRecord::Base.connection.execute("SELECT a_blob.checksum, count(*) FROM active_storage_blobs as a_blob GROUP BY a_blob.checksum HAVING count(a_blob.checksum)>1").count()
=> 26520
Why are there so many blobs with the same checksum? Let's filter the attachment count by record_type
ActiveRecord::Base.connection.execute("SELECT blob.checksum FROM active_storage_attachments AS attach, active_storage_blobs AS blob WHERE attach.blob_id=blob.id AND (attach.record_type = 'Item' OR attach.record_type='Thesis') GROUP BY blob.checksum HAVING count(blob.checksum) > 1")
=> #<PG::Result:0x00007f97dd5fcc98 status=PGRES_TUPLES_OK ntuples=374 nfields=1 cmd_tuples=374>
Why the decrease in numbers?
irb(main):065:0> results = ActiveRecord::Base.connection.execute("SELECT record_type FROM active_storage_attachments group by record_type")
=> #<PG::Result:0x00007f97dd922fd0 status=PGRES_TUPLES_OK ntuples=7 nfields=1 cmd_tuples=7>
irb(main):067:0> results.values
=>
[["ActiveStorage::Blob"],
["ActiveStorage::VariantRecord"],
["Community"],
["DraftItem"],
["DraftThesis"],
["Item"],
["Thesis"]]
Maybe due to the filter on the record_type? The answer seems like "yes" from the below
ActiveRecord::Base.connection.execute("SELECT blob.checksum FROM active_storage_attachments AS attach, active_storage_blobs AS blob WHERE attach.blob_id=blob.id GROUP BY blob.checksum HAVING count(blob.checksum) > 1")
=> #<PG::Result:0x00007f97de6f34e8 status=PGRES_TUPLES_OK ntuples=34829 nfields=1 cmd_tuples=34829>
In the list of duplicated checksums, let's find all the record_ids that have attachments to a duplicated checksum (Item or Thesis record types with attachment name = "file". This output will return draft items if they are attached to a duplicate checksum.
irb(main):055:0> results = ActiveRecord::Base.connection.execute("SELECT * FROM active_storage_attachments AS a, active_storage_blobs AS b WHERE a.blob_id=b.id AND b.checksum IN ( SELECT blob.checksum FROM active_storage_attachments AS attach, active_storage_blobs AS blob WHERE attach.blob_id=blob.id AND (attach.record_type = 'Item' OR attach.record_type='Thesis') and attach.name='files' GROUP BY blob.checksum HAVING count(blob.checksum) > 1) ORDER BY b.checksum" )
=> #<PG::Result:0x00007f97e5553780 status=PGRES_TUPLES_OK ntuples=1118 nfields=16 cmd_tuples=1118>
{"id"=>"76e3991d-e5dd-4e83-9356-08ee8e49b013",
"name"=>"files",
"created_at"=>"2020-08-07T16:41:30.790Z",
"record_type"=>"Item",
"fileset_uuid"=>"819ebccf-476e-43e7-8712-17dcf5c0ddf8",
"record_id"=>"bff082da-6ba2-4005-a625-bb9645ed50eb",
"blob_id"=>"76e3991d-e5dd-4e83-9356-08ee8e49b013",
"key"=>"7qEQQyAXtHZJPSVwi25PaMcc",
"filename"=>"Moving Beyond Ideology Contemporary Recreation and the Neoliberal Discourses.docx",
"content_type"=>"application/vnd.openxmlformats-officedocument.wordprocessingml.document",
"metadata"=>"{\"identified\":true,\"analyzed\":true}",
"byte_size"=>63687,
"checksum"=>"zylBLkb9vux0Qk+qThBC9A==",
"service_name"=>"local"},
{"id"=>"76e3991d-e5dd-4e83-9356-08ee8e49b013",
"name"=>"files",
"created_at"=>"2020-08-07T16:41:30.790Z",
"record_type"=>"DraftItem",
"fileset_uuid"=>nil,
"record_id"=>"f4244872-c911-46ee-a56c-f013accec73d",
"blob_id"=>"76e3991d-e5dd-4e83-9356-08ee8e49b013",
"key"=>"7qEQQyAXtHZJPSVwi25PaMcc",
"filename"=>"Moving Beyond Ideology Contemporary Recreation and the Neoliberal Discourses.docx",
"content_type"=>"application/vnd.openxmlformats-officedocument.wordprocessingml.document",
"metadata"=>"{\"identified\":true,\"analyzed\":true}",
"byte_size"=>63687,
"checksum"=>"zylBLkb9vux0Qk+qThBC9A==",
"service_name"=>"local"},
{"id"=>"57a6f5c1-7ad0-4fd9-8ed5-bd8b211d8f39",
"name"=>"files",
"created_at"=>"2020-12-23T17:44:09.337Z",
"record_type"=>"DraftItem",
"fileset_uuid"=>nil,
"record_id"=>"ad7db058-22e2-44f7-a71f-53f2097cbd4d",
"blob_id"=>"57a6f5c1-7ad0-4fd9-8ed5-bd8b211d8f39",
"key"=>"lujere8kg5fpoad2cpslxec22q7o",
"filename"=>"Moving Beyond Ideology_PostPrintVersion_2020-05-15.docx",
"content_type"=>"application/vnd.openxmlformats-officedocument.wordprocessingml.document",
"metadata"=>"{\"identified\":true,\"analyzed\":true}",
"byte_size"=>63687,
"checksum"=>"zylBLkb9vux0Qk+qThBC9A==",
"service_name"=>"local"}]
Let's filter out the DraftItems and DraftThesis
irb(main):058:0> results = ActiveRecord::Base.connection.execute("SELECT record_id, record_type, checksum, a.created_at, b.created_at, filename, content_type FROM active_storage_attachments AS a, active_storage_blobs AS b WHERE a.blob_id=b.id AND (a.record_type = 'Item' OR a.record_type='Thesis') and a.name='files' and b.checksum IN ( SELECT blob.checksum FROM active_storage_attachments AS attach, active_storage_blobs AS blob WHERE attach.blob_id=blob.id AND (attach.record_type = 'Item' OR attach.record_type='Thesis') and attach.name='files' GROUP BY blob.checksum HAVING count(blob.checksum) > 1) ORDER BY b.checksum" )
=> #<PG::Result:0x00007f97dd710710 status=PGRES_TUPLES_OK ntuples=835 nfields=16 cmd_tuples=835>
irb(main):059:0> results.values
...
{"record_id"=>"bff082da-6ba2-4005-a625-bb9645ed50eb",
"record_type"=>"Item",
"checksum"=>"zylBLkb9vux0Qk+qThBC9A==",
"created_at"=>"2020-08-07T16:41:30.790Z",
"filename"=>"Moving Beyond Ideology Contemporary Recreation and the Neoliberal Discourses.docx",
"content_type"=>"application/vnd.openxmlformats-officedocument.wordprocessingml.document"},
{"record_id"=>"20121c56-ed6e-4b1c-986e-7014401e83c5",
"record_type"=>"Item",
"checksum"=>"zylBLkb9vux0Qk+qThBC9A==",
"created_at"=>"2020-12-23T17:44:09.337Z",
"filename"=>"Moving Beyond Ideology_PostPrintVersion_2020-05-15.docx",
"content_type"=>"application/vnd.openxmlformats-officedocument.wordprocessingml.document"}
Let's write this to a CSV file
CSV.open('/era_tmp/delete_me.csv', 'wb') do |csv|
results.each do |row|
csv << row.values
end
end
Let's check if there are records (Item & Thesis) with multiple attachments with the same checksum (i.e., a file attached to a record multiple times):
irb(main):050:0> results = ActiveRecord::Base.connection.execute("SELECT blob.checksum, attach.record_id FROM active_storage_attachments AS attach, active_storage_blobs AS blob WHERE attach.blob_id=b`lob.id AND (attach.record_type = 'Item' OR attach.record_type='Thesis') and attach.name='files' GROUP BY blob.checksum,attach.record_id HAVING count(blob.checksum) > 1")
=> #<PG::Result:0x00007f97dd75b5d0 status=PGRES_TUPLES_OK ntuples=12 nfields=2 cmd_tuples=12>
irb(main):051:0> results.as_json
=>
[{"checksum"=>"Zl2l9EGyAmAksQ5tS+B/iA==", "record_id"=>"e02cab87-1cab-413c-8bd8-c50764896a5e"},
{"checksum"=>"CgOhJouY1r4jDj0HQAXm2w==", "record_id"=>"18ecea79-c21a-44be-be60-fe9fbccdb036"},
{"checksum"=>"1rjlByqkT2HV70S4rQsqTw==", "record_id"=>"ebef783e-da83-478b-8467-d5cc99709272"},
{"checksum"=>"HljKqB6+a4z7yjfqllx4kQ==", "record_id"=>"6e08f261-6dd1-4498-a4be-c3cfb1e46393"},
{"checksum"=>"qtt8obM5FdBOZKgmhECXFg==", "record_id"=>"9b515e88-e783-44b4-89ef-7ea4d3dc02fc"},
{"checksum"=>"MxCGonF7giYkVKN4dx+rKg==", "record_id"=>"4653e6af-12d3-4767-be78-2713dec4d1f6"},
{"checksum"=>"BY3gLzkSPTF18A1JX67kBg==", "record_id"=>"e02cab87-1cab-413c-8bd8-c50764896a5e"},
{"checksum"=>"xzbqWsiVx+4hVQiY64SAVA==", "record_id"=>"25520184-11ad-4465-a039-18695e42a92d"},
{"checksum"=>"aL/ITdTrXnTksq0IU5ABcA==", "record_id"=>"fae6573a-934c-4b60-93e6-38a7ea82651c"},
{"checksum"=>"KBSFkcK7+prvJwAqpJTAaQ==", "record_id"=>"18ecea79-c21a-44be-be60-fe9fbccdb036"},
{"checksum"=>"36r3b8SWaEEHSDmUDWYUZw==", "record_id"=>"fc272f44-17bb-4980-bb31-3b86097abab0"},
{"checksum"=>"wXGvEO+P2YiZJuaB6CPwYA==", "record_id"=>"0b800874-9bc7-49e1-9f08-aeba43cfde77"}]
Are these intentional?
Let's output nicely in a similar format to the duplicate records finder
results = ActiveRecord::Base.connection.execute("SELECT record_id, record_type, checksum, a.created_at, b.created_at, filename, content_type FROM active_storage_attachments AS a, active_storage_blobs AS b WHERE a.blob_id=b.id and (b.checksum,a.record_id) IN (SELECT blob.checksum, attach.record_id FROM active_storage_attachments AS attach, active_storage_blobs AS blob WHERE attach.blob_id=blob.id AND (attach.record_type = 'Item' OR attach.record_type='Thesis') and attach.name='files' GROUP BY blob.checksum,attach.record_id HAVING count(blob.checksum) > 1)")
CSV.open('/era_tmp/delete_me.csv', 'wb') do |csv|
results.each do |row|
csv << row.values
end
end
Google sheet shared: https://docs.google.com/spreadsheets/d/1khOWEk2XusG98vafWBgzACmbM-a5TR7K4Xzy1VcZl6M/edit#gid=1219983193