如何找到只有不同名称的重复文件?

FSlint可以找到重复的文件。 但是假设有一万首歌曲或图像,并且只想找到那些相同但名字不同的文件? 现在,我得到一个包含数百个dupes(在不同文件夹中)的列表。 我希望名称保持一致,所以我希望只看到具有不同名称的相同文件,而不是具有相同名称的相同文件。

具有高级参数(或不同程序)的FSlint可以实现此目的吗?

我还有另一种更灵活易用的解决方案!

复制下面的脚本并将其粘贴到/usr/local/bin/dupe-check (或任何其他位置和文件名,您需要root权限才能使用此脚本)。
通过运行此命令使其可执行:

 sudo chmod +x /usr/local/bin/dupe-check 

由于/usr/local/bin位于每个用户的PATH中,现在每个人都可以直接运行它而不指定位置。

首先,您应该查看我的脚本的帮助页面:

 $ dupe-check --help usage: dupe-check [-h] [-s COMMAND] [-r MAXDEPTH] [-e | -d] [-0] [-v | -q | -Q] [-g] [-p] [-V] [directory] Check for duplicate files positional arguments: directory the directory to examine recursively (default '.') optional arguments: -h, --help show this help message and exit -s COMMAND, --hashsum COMMAND external system command to generate hashes (default 'sha256sum') -r MAXDEPTH, --recursion-depth MAXDEPTH the number of subdirectory levels to process: 0=only current directory, 1=max. 1st subdirectory level, ... (default: infinite) -e, --equal-names only list duplicates with equal file names -d, --different-names only list duplicates with different file names -0, --no-zero do not list 0-byte files -v, --verbose print hash and name of each examined file -q, --quiet suppress status output on stderr -Q, --list-only only list the duplicate files, no summary etc. -g, --no-groups do not group equal duplicates -p, --path-only only print the full path in the results list, otherwise format output like this: `'FILENAME' (FULL_PATH)´ -V, --version show program's version number and exit 

您可以看到,要获取当前目录(以及所有子目录)中具有不同文件名的所有文件的列表,您需要-d标志和任何有效的格式化选项组合。

我们仍然假设相同的测试环境。 具有相似名称(和不同数字)的文件具有相同的内容:

 . ├── dir1 │ ├── uname1 │ └── uname3 ├── grps ├── lsbrelease ├── lsbrelease2 ├── uname1 └── uname2 

所以我们只需运行:

 $ dupe-check Checked 7 files in total, 6 of them are duplicates by content. Here's a list of all duplicate files: 'lsbrelease' (./lsbrelease) 'lsbrelease2' (./lsbrelease2) 'uname1' (./dir1/uname1) 'uname1' (./uname1) 'uname2' (./uname2) 'uname3' (./dir1/uname3) 

以下是脚本:

 #! /usr/bin/env python3 VERSION_MAJOR, VERSION_MINOR, VERSION_MICRO = 0, 4, 1 RELEASE_DATE, AUTHOR = "2016-02-11", "ByteCommander" import sys import os import shutil import subprocess import argparse class Printer: def __init__(self, normal=sys.stdout, stat=sys.stderr): self.__normal = normal self.__stat = stat self.__prev_msg = "" self.__first = True self.__max_width = shutil.get_terminal_size().columns def __call__(self, msg, stat=False): if not stat: if not self.__first: print("\r" + " " * len(self.__prev_msg) + "\r", end="", file=self.__stat) print(msg, file=self.__normal) print(self.__prev_msg, end="", flush=True, file=self.__stat) else: if len(msg) > self.__max_width: msg = msg[:self.__max_width-3] + "..." if not msg: print("\r" + " " * len(self.__prev_msg) + "\r", end="", flush=True, file=self.__stat) elif self.__first: print(msg, end="", flush=True, file=self.__stat) self.__first = False else: print("\r" + " " * len(self.__prev_msg) + "\r", end="", file=self.__stat) print("\r" + msg, end="", flush=True, file=self.__stat) self.__prev_msg = msg def file_walker(top, maxdepth=None): dirs, files = [], [] for name in os.listdir(top): (dirs if os.path.isdir(os.path.join(top, name)) else files).append(name) yield top, files if maxdepth != 0: for name in dirs: for x in file_walker(os.path.join(top, name), maxdepth-1): yield x printx = Printer() argparser = argparse.ArgumentParser(description="Check for duplicate files") argparser.add_argument("directory", action="store", default=".", nargs="?", help="the directory to examine recursively " "(default '%(default)s')") argparser.add_argument("-s", "--hashsum", action="store", default="sha256sum", metavar="COMMAND", help="external system command to " "generate hashes (default '%(default)s')") argparser.add_argument("-r", "--recursion-depth", action="store", type=int, default=-1, metavar="MAXDEPTH", help="the number of subdirectory levels to process: " "0=only current directory, 1=max. 1st subdirectory " "level, ... (default: infinite)") arggroupn = argparser.add_mutually_exclusive_group() arggroupn.add_argument("-e", "--equal-names", action="store_const", const="e", dest="name_filter", help="only list duplicates with equal file names") arggroupn.add_argument("-d", "--different-names", action="store_const", const="d", dest="name_filter", help="only list duplicates with different file names") argparser.add_argument("-0", "--no-zero", action="store_true", default=False, help="do not list 0-byte files") arggroupo = argparser.add_mutually_exclusive_group() arggroupo.add_argument("-v", "--verbose", action="store_const", const=0, dest="output_level", help="print hash and name of each examined file") arggroupo.add_argument("-q", "--quiet", action="store_const", const=2, dest="output_level", help="suppress status output on stderr") arggroupo.add_argument("-Q", "--list-only", action="store_const", const=3, dest="output_level", help="only list the duplicate files, no summary etc.") argparser.add_argument("-g", "--no-groups", action="store_true", default=False, help="do not group equal duplicates") argparser.add_argument("-p", "--path-only", action="store_true", default=False, help="only print the full path in the results list, " "otherwise format output like this: " "`'FILENAME' (FULL_PATH)´") argparser.add_argument("-V", "--version", action="version", version="%(prog)s {}.{}.{} ({} by {})".format( VERSION_MAJOR, VERSION_MINOR, VERSION_MICRO, RELEASE_DATE, AUTHOR)) argparser.set_defaults(name_filter="a", output_level=1) args = argparser.parse_args() hashes = {} dupe_counter = 0 file_counter = 0 try: for root, filenames in file_walker(args.directory, args.recursion_depth): if args.output_level <= 1: printx("--> {} files ({} duplicates) processed - '{}'".format( file_counter, dupe_counter, root), stat=True) for filename in filenames: path = os.path.join(root, filename) file_counter += 1 filehash = subprocess.check_output( [args.hashsum, path], universal_newlines=True).split()[0] if args.output_level == 0: printx(" ".join((filehash, path))) if filehash in hashes: dupe_counter += 1 if len(hashes[filehash]) > 1 else 2 hashes[filehash].append((filename, path)) if args.output_level <= 1: printx("--> {} files ({} duplicates) processed - '{}'" .format(file_counter, dupe_counter, root), stat=True) else: hashes[filehash] = [(filename, path)] except FileNotFoundError: printx("ERROR: Directory not found!") exit(1) except KeyboardInterrupt: printx("USER ABORTED SEARCH!") printx("Results so far:") if args.output_level <= 1: printx("", stat=True) if args.output_level == 0: printx("") if args.output_level <= 2: printx("Checked {} files in total, {} of them are duplicates by content." .format(file_counter, dupe_counter)) if dupe_counter == 0: exit(0) elif args.output_level <= 2: printx("Here's a list of all duplicate{} files{}:".format( " non-zero-byte" if args.no_zero else "", " with different names" if args.name_filter == "d" else " with equal names" if args.name_filter == "e" else "")) first_group = True for filehash in hashes: if len(hashes[filehash]) > 1: if args.no_zero and os.path.getsize(hashes[filehash][0][0]) == 0: continue first_group = False if args.name_filter == "a": filtered = hashes[filehash] else: filenames = {} for filename, path in hashes[filehash]: if filename in filenames: filenames[filename].append(path) else: filenames[filename] = [path] filtered = [(filename, path) for filename in filenames if ( args.name_filter == "e" and len(filenames[filename]) > 1 or args.name_filter == "d" and len(filenames[filename]) == 1) for path in filenames[filename]] if len(filtered) == 0: continue if (not args.no_groups) and (args.output_level <= 2 or not first_group): printx("") for filename, path in sorted(filtered): if args.path_only: printx(path) else: printx("'{}' ({})".format(filename, path)) 

如果您可以使脚本打印具有相同和不同文件名的所有重复文件,则可以使用以下命令行:

 find . -type f -exec sha256sum {} \; | sort | uniq -w64 --all-repeated=separate | cut -b 67- 

对于示例运行,我使用以下目录结构。 具有相似名称(和不同数字)的文件具有相同的内容:

 . ├── dir1 │  ├── uname1 │  └── uname3 ├── grps ├── lsbrelease ├── lsbrelease2 ├── uname1 └── uname2 

现在让我们看看我们的命令做一些魔术:

 $ find . -type f -exec sha256sum {} \; | sort | uniq -w64 --all-repeated=separate | cut -b 67- ./lsbrelease ./lsbrelease2 ./dir1/uname1 ./dir1/uname3 ./uname1 ./uname2 

由新行分隔的每个组由具有相同内容的文件组成。 未列出非重复文件。

Byte Commander的优秀脚本有效,但没有给我相当的行为(列出所有包含至少一个具有不同名称的重复文件)。 我做了以下更改,现在它完全符合我的目的(并且节省了我很多时间)! 我将第160行改为:

args.name_filter == "d" and len(filenames[filename]) >= 1 and len(filenames[filename]) != len(hashes[filehash]))