Duplicate files, Deduper in Google Drive
내 구글 드라이브에는 많은 파일이 있습니다.
https://drive.google.com/drive/my-drive
주로 Drive cse app으로 동영상을 찾아서 G 폴더 복사 app으로 팀 드라이브에 복사합니다. 팀 드라이브는 공유 드라이브로 이름을 바꾸는 중입니다.
복사가 완료되면 팀 드라이브의 폴더를 내 드라이브로 옮깁니다.
이렇게 공유 드라이브를 거치는 이유는 구글 서비스 할당량 때문에 복사가 중단될 수 있고 이때 같은 공유 드라이브의 멤버 계정으로 로그인해서 이어서 복사할 수 있기 때문입니다. 자세한 내용은 복사 앱 설명서를 참고하세요.
Quotas for Google Services
Apps Script | Google Developers
https://developers.google.com/apps-script/guides/services/quotas
구글 드라이브에 내가 가진 파일을 크기순으로 정렬해서 볼수있는 기능이 있습니다.
단순히 인터넷 브라우저에서 구글 드라이브 사이트로 이동해서 저장용량을 클릭하면 됩니다. 위에 있는 이미지에서 저장용량의 위치를 확인하세요.
Clear Google Drive space & increase storage
https://support.google.com/drive/answer/6374270
구글 드라이브에 있는 파일을 크기 순서로 정렬해서 스프레드시트에 기록하면 중복된 파일을 확인하기 편해서 만들어 볼 생각입니다. md5Checksum 항목도 기록합니다.
Google Drive APIs REST v2, Files
Resource representations
The metadata for a file.
https://developers.google.com/drive/api/v2/reference/files
https://www.eojji.com/app
Search for files by name
Drive sheet app
drive.eojji.com/sheet
Drive Deduper
https://github.com/gsuitedevs/drive-utils/tree/master/deduper
Finds duplicate files in Drive based on md5 checksum and offers to trash them.
Install
sh install.sh
Run
sh run.sh
dedup.py
https://github.com/gsuitedevs/drive-utils/blob/master/deduper/dedup.py
# Copyright 2018 Google Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Dedupes identical files in Google Drive."""
import os
from collections import defaultdict
from apiclient import discovery
from apiclient.http import BatchHttpRequest
from httplib2 import Http
from oauth2client.client import flow_from_clientsecrets
from oauth2client.file import Storage
from oauth2client import tools
ONE_GIG = float(1073741824) # so we can do non-integer arithmetic
def auth(http):
"""Authorize an http client, asking the user if required.
Args:
http: an httplib2.Http instance to authorize.
"""
storage = Storage(os.path.expanduser('~/.drive-deduper.dat'))
credentials = storage.get()
if credentials is None or credentials.invalid:
flow = flow_from_clientsecrets(
'client_secrets.json',
scope='https://www.googleapis.com/auth/drive')
flags = tools.argparser.parse_args(args=[])
credentials = tools.run_flow(flow, storage, flags)
credentials.authorize(http)
def create_client():
"""Creates an authorized Drive api client.
Returns:
Authorized drive client.
"""
http = Http()
auth(http)
return discovery.build('drive', 'v2', http=http)
def fetch_all_metadata(client):
"""Fetches all the files.
Args:
client: Authorized drive api client.
Returns:
"""
results = []
page = ended = None
while not ended:
resp = client.files().list(pageToken=page, maxResults=100,
q='trashed=false',
fields='nextPageToken,items(id,md5Checksum,title,alternateLink,quotaBytesUsed)'
).execute()
page = resp.get('nextPageToken')
ended = page == None
for item in resp['items']:
if 'md5Checksum' in item:
results.append(item)
print 'Fetched: {}'.format(len(results))
return results
def find_dupes(files):
"""Find the duplicates."""
index = defaultdict(list)
dupes = []
for f in files:
index[f['md5Checksum']].append(f)
for k, v in index.items():
if len(v) > 1:
dupes.append(v)
return dupes
def main():
"""Main entrypoint."""
client = create_client()
files = fetch_all_metadata(client)
dupes = find_dupes(files)
print '{} duplicates found. '.format(len(dupes))
if len(dupes) == 0:
print 'We are done.'
return
print 'Please check them.'
total = 0
for dupeset in dupes:
print '--'
for dupe in dupeset:
print dupe['alternateLink'], dupe['title']
for dupe in dupeset[1:]:
total += int(dupe['quotaBytesUsed'])
print '--'
print '{} Gigabytes wasted.'.format(total / ONE_GIG)
conf = raw_input('Great. Now trash the extras? (y/n) ')
if conf.strip() == 'y':
print 'Trashing.'
batch = BatchHttpRequest()
for dupeset in dupes:
for dupe in dupeset[1:]:
batch.add(client.files().trash(fileId=dupe['id']))
if len(batch._order) == 1000: # batch maxes out at 1k
batch.execute()
batch = BatchHttpRequest()
batch.execute()
print 'We are done. Check the trash for your files.'
else:
print 'Not touching anything.'
if __name__ == '__main__':
main()
https://drive.google.com/drive/my-drive
storage drive.google.com drive my-drive.png |
주로 Drive cse app으로 동영상을 찾아서 G 폴더 복사 app으로 팀 드라이브에 복사합니다. 팀 드라이브는 공유 드라이브로 이름을 바꾸는 중입니다.
복사가 완료되면 팀 드라이브의 폴더를 내 드라이브로 옮깁니다.
이렇게 공유 드라이브를 거치는 이유는 구글 서비스 할당량 때문에 복사가 중단될 수 있고 이때 같은 공유 드라이브의 멤버 계정으로 로그인해서 이어서 복사할 수 있기 때문입니다. 자세한 내용은 복사 앱 설명서를 참고하세요.
Quotas for Google Services
Apps Script | Google Developers
https://developers.google.com/apps-script/guides/services/quotas
Apps Script services impose daily quotas and hard limitations on some features. If you exceed a quota or limitation, your script throws an exception and execution terminates.
복사하기 전에 내가 가지고 있는 파일인지 확인하기 위해서 Drive sheet app에서 파일명으로 검색해서 확인할 수 있지만 귀찮은 일입니다.
단순히 인터넷 브라우저에서 구글 드라이브 사이트로 이동해서 저장용량을 클릭하면 됩니다. 위에 있는 이미지에서 저장용량의 위치를 확인하세요.
Clear Google Drive space & increase storage
https://support.google.com/drive/answer/6374270
Google DriveClear space in Google Drive by deleting large files that you don't need. To sort your files by file size:1. Use a computer to see your files listed from largest to smallest.2. Put files you don't want in your trash, then permanently delete them. Learn how to delete files.3. Within 24 hours, the items you deleted will show in the available space in your Google Drive account.
구글 드라이브에 있는 파일을 크기 순서로 정렬해서 스프레드시트에 기록하면 중복된 파일을 확인하기 편해서 만들어 볼 생각입니다. md5Checksum 항목도 기록합니다.
Google Drive APIs REST v2, Files
Resource representations
The metadata for a file.
https://developers.google.com/drive/api/v2/reference/files
eojji.com App
Search for files by name
Drive sheet app
drive.eojji.com/sheet
Drive Deduper
https://github.com/gsuitedevs/drive-utils/tree/master/deduper
Finds duplicate files in Drive based on md5 checksum and offers to trash them.
Install
sh install.sh
Run
sh run.sh
dedup.py
https://github.com/gsuitedevs/drive-utils/blob/master/deduper/dedup.py
# Copyright 2018 Google Inc.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Dedupes identical files in Google Drive."""
import os
from collections import defaultdict
from apiclient import discovery
from apiclient.http import BatchHttpRequest
from httplib2 import Http
from oauth2client.client import flow_from_clientsecrets
from oauth2client.file import Storage
from oauth2client import tools
ONE_GIG = float(1073741824) # so we can do non-integer arithmetic
def auth(http):
"""Authorize an http client, asking the user if required.
Args:
http: an httplib2.Http instance to authorize.
"""
storage = Storage(os.path.expanduser('~/.drive-deduper.dat'))
credentials = storage.get()
if credentials is None or credentials.invalid:
flow = flow_from_clientsecrets(
'client_secrets.json',
scope='https://www.googleapis.com/auth/drive')
flags = tools.argparser.parse_args(args=[])
credentials = tools.run_flow(flow, storage, flags)
credentials.authorize(http)
def create_client():
"""Creates an authorized Drive api client.
Returns:
Authorized drive client.
"""
http = Http()
auth(http)
return discovery.build('drive', 'v2', http=http)
def fetch_all_metadata(client):
"""Fetches all the files.
Args:
client: Authorized drive api client.
Returns:
"""
results = []
page = ended = None
while not ended:
resp = client.files().list(pageToken=page, maxResults=100,
q='trashed=false',
fields='nextPageToken,items(id,md5Checksum,title,alternateLink,quotaBytesUsed)'
).execute()
page = resp.get('nextPageToken')
ended = page == None
for item in resp['items']:
if 'md5Checksum' in item:
results.append(item)
print 'Fetched: {}'.format(len(results))
return results
def find_dupes(files):
"""Find the duplicates."""
index = defaultdict(list)
dupes = []
for f in files:
index[f['md5Checksum']].append(f)
for k, v in index.items():
if len(v) > 1:
dupes.append(v)
return dupes
def main():
"""Main entrypoint."""
client = create_client()
files = fetch_all_metadata(client)
dupes = find_dupes(files)
print '{} duplicates found. '.format(len(dupes))
if len(dupes) == 0:
print 'We are done.'
return
print 'Please check them.'
total = 0
for dupeset in dupes:
print '--'
for dupe in dupeset:
print dupe['alternateLink'], dupe['title']
for dupe in dupeset[1:]:
total += int(dupe['quotaBytesUsed'])
print '--'
print '{} Gigabytes wasted.'.format(total / ONE_GIG)
conf = raw_input('Great. Now trash the extras? (y/n) ')
if conf.strip() == 'y':
print 'Trashing.'
batch = BatchHttpRequest()
for dupeset in dupes:
for dupe in dupeset[1:]:
batch.add(client.files().trash(fileId=dupe['id']))
if len(batch._order) == 1000: # batch maxes out at 1k
batch.execute()
batch = BatchHttpRequest()
batch.execute()
print 'We are done. Check the trash for your files.'
else:
print 'Not touching anything.'
if __name__ == '__main__':
main()
댓글
https://drive.eojji.com/find-duplicate