识别文件中的重复行而不删除它们？

我将我的引用作为带有一长条目列表的文本文件，每个都有两个（或更多）字段。

第一列是引用的url; 第二列是标题，可能会有所不同，具体取决于条目的制作方式。对于可能存在或不存在的第三个字段也是如此。

我想识别但不删除第一个字段（引用URL）相同的条目。我知道sort -k1,1 -u但是会自动（非交互式）删除除第一次击中之外的所有内容。有没有办法让我知道所以我可以选择保留哪些？

在下面的三行具有相同的第一个字段（ http://unix.stackexchange.com/questions/49569/ ）的摘录中，我想保留第2行，因为它有其他标记（排序，CLI）和删除行＃1和＃3：

 http://unix.stackexchange.com/questions/49569/ unique-lines-based-on-the-first-field http://unix.stackexchange.com/questions/49569/ Unique lines based on the first field sort, CLI http://unix.stackexchange.com/questions/49569/ Unique lines based on the first field

有没有一个程序来帮助识别这样的“重复”？然后，我可以通过亲自删除第1行和第3行来手动清理？

如果我理解你的问题，我认为你需要这样的东西：

 for dup in $(sort -k1,1 -u file.txt | cut -d' ' -f1); do grep -n -- "$dup" file.txt; done

要么：

 for dup in $(cut -d " " -f1 file.txt | uniq -d); do grep -n -- "$dup" file.txt; done

其中file.txt是包含您感兴趣的数据的文件。

在输出中，您将看到第一个字段被找到两次或更多次的行数和行数。

这是一个可以使用uniq命令解决的经典问题。 uniq可以检测重复的连续行并删除重复项（ -u ， – --unique ）或仅保留重复项（ -d ， – --repeated ）。

由于重复行的排序对您来说并不重要，因此您应该先对其进行排序。然后使用uniq仅打印唯一的行：

 sort yourfile.txt | uniq -u

还有一个-c （– --count ）选项，用于打印-d选项的重复项数。有关详细信息，请参见uniq的手册页。

如果您真的不关心第一个字段后面的部分，可以使用以下命令查找重复的键并为其打印每个行号（添加另一个| sort -n以使输出按行排序）：

  cut -d ' ' -f1 .bash_history | nl | sort -k2 | uniq -s8 -D

由于您希望看到重复的行（使用第一个字段作为键），因此您无法直接使用uniq 。使自动化变得困难的问题是标题部分有所不同，但程序无法自动确定哪个标题应被视为最终标题。

这是一个AWK脚本（保存到script.awk ），它将您的文本文件作为输入并打印所有重复的行，以便您可以决定删除哪个。（ awk -f script.awk yourfile.txt ）

 #!/usr/bin/awk -f { # Store the line ($0) grouped per URL ($1) with line number (NR) as key lines[$1][NR] = $0; } END { for (url in lines) { # find lines that have the URL occur multiple times if (length(lines[url]) > 1) { for (lineno in lines[url]) { # Print duplicate line for decision purposes print lines[url][lineno]; # Alternative: print line number and line #print lineno, lines[url][lineno]; } } } }

如果我读得正确，你所需要的就像是

 awk '{print $1}' file | sort | uniq -c | while read num dupe; do [[ $num > 1 ]] && grep -n -- "$dupe" file; done

这将打印出包含欺骗行和行本身的行号。例如，使用此文件：

 foo bar baz http://unix.stackexchange.com/questions/49569/ unique-lines-based-on-the-first-field bar foo baz http://unix.stackexchange.com/questions/49569/ Unique lines based on the first field sort, CLI baz foo bar http://unix.stackexchange.com/questions/49569/ Unique lines based on the first field

它会产生这个输出：

 2:http://unix.stackexchange.com/questions/49569/ unique-lines-based-on-the-first-field 4:http://unix.stackexchange.com/questions/49569/ Unique lines based on the first field sort, CLI 6:http://unix.stackexchange.com/questions/49569/ Unique lines based on the first field

要打印只有行号，你可以这样做

 awk '{print $1}' file | sort | uniq -c | while read num dupe; do [[ $num > 1 ]] && grep -n -- "$dupe" file; done | cut -d: -f 1

并且只打印线：

 awk '{print $1}' file | sort | uniq -c | while read num dupe; do [[ $num > 1 ]] && grep -n -- "$dupe" file; done | cut -d: -f 2-

说明：

awk脚本只打印文件的第一个空格分隔字段。使用$N打印第N个字段。 sort对它sort排序， uniq -c计算每一行的出现次数。

然后将其传递给while循环，它将出现次数保存为$num ，将行保存为$dupe ，如果$num大于1（因此它至少重复一次），它将搜索该行的文件，使用-n打印行号。 --告诉grep ，后面的内容不是命令行选项，对于$dupe可以以-开头时有用。

毫无疑问，列表中最冗长的一个可能更短：

 #!/usr/bin/python3 import collections file = "file.txt" def find_duplicates(file): with open(file, "r") as sourcefile: data = sourcefile.readlines() splitlines = [ (index, data[index].split(" ")) for index in range(0, len(data)) ] lineheaders = [item[1][0] for item in splitlines] dups = [x for x, y in collections.Counter(lineheaders).items() if y > 1] dupsdata = [] for item in dups: occurrences = [ splitlines_item[0] for splitlines_item in splitlines\ if splitlines_item[1][0] == item ] corresponding_lines = [ "["+str(index)+"] "+data[index] for index in occurrences ] dupsdata.append((occurrences, corresponding_lines)) # printing output print("found duplicates:\n"+"-"*17) for index in range(0, len(dups)): print(dups[index], dupsdata[index][0]) lines = [item for item in dupsdata[index][1]] for line in lines: print(line, end = "") find_duplicates(file)

给出一个文本文件，如：

 monkey banana dog bone monkey banana peanut cat mice dog cowmeat

输出如：

 found duplicates: ----------------- dog [1, 4] [1] dog bone [4] dog cowmeat monkey [0, 2] [0] monkey banana [2] monkey banana peanut

选择要删除的行后：

 removelist = [2,1] def remove_duplicates(file, removelist): removelist = sorted(removelist, reverse=True) with open(file, "r") as sourcefile: data = sourcefile.readlines() for index in removelist: data.pop(index) with open(file, "wt") as sourcefile: for line in data: sourcefile.write(line) remove_duplicates(file, removelist)

请参阅以下排序的file.txt ：

 addons.mozilla.org/en-US/firefox/addon/click-to-play-per-element/ ::: C2P per-element addons.mozilla.org/en-us/firefox/addon/prospector-oneLiner/ ::: OneLiner askubuntu.com/q/21033 ::: What is the difference between gksudo and gksu? askubuntu.com/q/21148 ::: openoffice calc sheet tabs (also askubuntu.com/q/138623) askubuntu.com/q/50540 ::: What is Ubuntu's Definition of a "Registered Application"? askubuntu.com/q/53762 ::: How to use lm-sensors? askubuntu.com/q/53762 ::: how-to-use-to-use-lm-sensors stackoverflow.com/q/4594319 ::: bash - shell replace cr\lf by comma stackoverflow.com/q/4594319 ::: shell replace cr\lf by comma wiki.ubuntu.com/ClipboardPersistence ::: ClipboardPersistence wiki.ubuntu.com/ClipboardPersistence ::: ClipboardPersistence - Ubuntu Wiki www.youtube.com/watch?v=1olY5Qzmbk8 ::: Create new mime types in Ubuntu www.youtube.com/watch?v=2hu9JrdSXB8 ::: Change mouse cursor www.youtube.com/watch?v=Yxfa2fXJ1Wc ::: Mouse cursor size

因为列表很短，我可以看到（排序后）有三组重复。

然后，例如，我可以选择保留：

 askubuntu.com/q/53762 ::: How to use lm-sensors?

而不是

 askubuntu.com/q/53762 ::: how-to-use-to-use-lm-sensors

但是对于更长的清单，这将是困难的。基于两个答案，一个建议uniq ，另一个建议cut ，我发现这个命令给了我想要的输出：

 $ cut -d " " -f1 file.txt | uniq -d askubuntu.com/q/53762 stackoverflow.com/q/4594319 wiki.ubuntu.com/ClipboardPersistence $

她是我解决它的方式：

file_with_duplicates：

 1,a,c 2,a,d 3,a,e <--duplicate 4,a,t 5,b,k <--duplicate 6,b,l 7,b,s 8,b,j 1,b,l 3,a,d <--duplicate 5,b,l <--duplicate

文件按第1列和第2列排序和重复：

 sort -t',' -k1,1 -k2,2 -u file_with_duplicates

文件仅按第1列和第2列排序：

 sort -t',' -k1,1 -k2,2 file_with_duplicates

仅显示差异：

 diff <(sort -t',' -k1,1 -k2,2 -u file_with_duplicates) <(sort -t',' -k1,1 -k2,2 file_with_duplicates) 3a4 3,a,d 6a8 5,b,l

识别文件中的重复行而不删除它们？

什么是/ bin / rbash？

如何找到我正在运行的bash版本？

如何从命令行设置活动gnome-terminal的标题？

通过另一台计算机连接连接到Internet

shell打印所有目录名称匹配模式的修改日期

在命令提示符中发生奇怪的事情

apt-get install / remove / reinstall不起作用。 Python错误

使用终端查找包名称

终端’隐身模式’？

如何通过命令行锁定桌面屏幕？