Duplicate files, Deduper in Google Drive

내 구글 드라이브에는 많은 파일이 있습니다.

https://drive.google.com/drive/my-drive

storage drive.google.com drive my-drive.png


주로 Drive cse app으로 동영상을 찾아서 G 폴더 복사 app으로 팀 드라이브에 복사합니다. 팀 드라이브는 공유 드라이브로 이름을 바꾸는 중입니다.


복사가 완료되면 팀 드라이브의 폴더를 내 드라이브로 옮깁니다.
이렇게 공유 드라이브를 거치는 이유는 구글 서비스 할당량 때문에 복사가 중단될 수 있고 이때 같은 공유 드라이브의 멤버 계정으로 로그인해서 이어서 복사할 수 있기 때문입니다. 자세한 내용은 복사 앱 설명서를 참고하세요.





Quotas for Google Services
Apps Script | Google Developers

https://developers.google.com/apps-script/guides/services/quotas
Apps Script services impose daily quotas and hard limitations on some features. If you exceed a quota or limitation, your script throws an exception and execution terminates.

복사하기 전에 내가 가지고 있는 파일인지 확인하기 위해서 Drive sheet app에서 파일명으로 검색해서 확인할 수 있지만 귀찮은 일입니다.

구글 드라이브에 내가 가진 파일을 크기순으로 정렬해서 볼수있는 기능이 있습니다.

단순히 인터넷 브라우저에서 구글 드라이브 사이트로 이동해서 저장용량을 클릭하면 됩니다. 위에 있는 이미지에서 저장용량의 위치를 확인하세요.


Clear Google Drive space & increase storage

https://support.google.com/drive/answer/6374270


Google Drive
Clear space in Google Drive by deleting large files that you don't need. To sort your files by file size:

1. Use a computer to see your files listed from largest to smallest.
2. Put files you don't want in your trash, then permanently delete them. Learn how to delete files.
3. Within 24 hours, the items you deleted will show in the available space in your Google Drive account.


구글 드라이브에 있는 파일을 크기 순서로 정렬해서 스프레드시트에 기록하면 중복된 파일을 확인하기 편해서 만들어 볼 생각입니다. md5Checksum 항목도 기록합니다.


Google Drive APIs REST v2, Files

Resource representations

The metadata for a file.
https://developers.google.com/drive/api/v2/reference/files





eojji.com App

  
https://www.eojji.com/app



Search for files by name

Drive sheet app
drive.eojji.com/sheet



Drive Deduper
  
https://github.com/gsuitedevs/drive-utils/tree/master/deduper

Finds duplicate files in Drive based on md5 checksum and offers to trash them.


Install

sh install.sh



Run

sh run.sh



dedup.py

https://github.com/gsuitedevs/drive-utils/blob/master/deduper/dedup.py

# Copyright 2018 Google Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""Dedupes identical files in Google Drive."""

import os
from collections import defaultdict
from apiclient import discovery
from apiclient.http import BatchHttpRequest
from httplib2 import Http
from oauth2client.client import flow_from_clientsecrets
from oauth2client.file import Storage
from oauth2client import tools

ONE_GIG = float(1073741824) # so we can do non-integer arithmetic

def auth(http):
  """Authorize an http client, asking the user if required.

  Args:
    http: an httplib2.Http instance to authorize.
  """
  storage = Storage(os.path.expanduser('~/.drive-deduper.dat'))
  credentials = storage.get()
  if credentials is None or credentials.invalid:
    flow = flow_from_clientsecrets(
        'client_secrets.json',
        scope='https://www.googleapis.com/auth/drive')
    flags = tools.argparser.parse_args(args=[])
    credentials = tools.run_flow(flow, storage, flags)
  credentials.authorize(http)


def create_client():
  """Creates an authorized Drive api client.

  Returns:
    Authorized drive client.
  """
  http = Http()
  auth(http)
  return discovery.build('drive', 'v2', http=http)


def fetch_all_metadata(client):
  """Fetches all the files.

  Args:
    client: Authorized drive api client.
  Returns:
  """
  results = []
  page = ended = None
  while not ended:
    resp = client.files().list(pageToken=page, maxResults=100,
        q='trashed=false',
        fields='nextPageToken,items(id,md5Checksum,title,alternateLink,quotaBytesUsed)'
        ).execute()
    page = resp.get('nextPageToken')
    ended = page == None
    for item in resp['items']:
      if 'md5Checksum' in item:
        results.append(item)
    print 'Fetched: {}'.format(len(results))
  return results


def find_dupes(files):
  """Find the duplicates."""
  index = defaultdict(list)
  dupes = []
  for f in files:
    index[f['md5Checksum']].append(f)
  for k, v in index.items():
    if len(v) > 1:
      dupes.append(v)
  return dupes


def main():
  """Main entrypoint."""
  client = create_client()
  files = fetch_all_metadata(client)
  dupes = find_dupes(files)
  print '{} duplicates found. '.format(len(dupes))
  if len(dupes) == 0:
    print 'We are done.'
    return
  print 'Please check them.'
  total = 0
  for dupeset in dupes:
    print '--'
    for dupe in dupeset:
      print dupe['alternateLink'], dupe['title']
    for dupe in dupeset[1:]:
      total += int(dupe['quotaBytesUsed'])
  print '--'
  print '{} Gigabytes wasted.'.format(total / ONE_GIG)
  conf = raw_input('Great. Now trash the extras? (y/n) ')
  if conf.strip() == 'y':
    print 'Trashing.'
    batch = BatchHttpRequest()
    for dupeset in dupes:
      for dupe in dupeset[1:]:
        batch.add(client.files().trash(fileId=dupe['id']))
        if len(batch._order) == 1000: # batch maxes out at 1k
          batch.execute()
          batch = BatchHttpRequest()
    batch.execute()
    print 'We are done. Check the trash for your files.'
  else:
    print 'Not touching anything.'


if __name__ == '__main__':
  main()

댓글

익명님의 메시지…
Find Duplicate files
https://drive.eojji.com/find-duplicate

이 블로그의 인기 게시물