I don’t have much time to explain this today, but ... if you want to see how to use the sed
command on a Mac OS X (macOS) system to search for newline characters in the input pattern and replace them with something else in the replacement pattern, this example might point you in the right direction.
The problem
My problem was that I have a bunch of files with dozens to hundreds of paragraphs that look like this:
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
(Those are very short sentences and paragraphs for this example.)
What I want are continuous paragraphs with no unnecessary line breaks, so I want to use sed
to create output like this:
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.
The solution
To solve the problem I first put this sed
command in a file named sed.cmds:
s/([a-zA-Z,`])\n([a-zA-Z`])/\1 \2/g
When I then tried to run the command like this:
sed -E -f sed.cmds Input.txt > Output.txt
the command wouldn’t work properly. After a lot of searching I finally found this Stack Overflow thread, and in short, the solution is to run this sed
command instead:
sed -e ':a' -e 'N' -e '$!ba' -E -f sed.cmds Input.txt > Output.txt
When I run that sed
command with my sed.cmds file, it successfully finds the newline characters in the sed
input stream with the \n
pattern, and then I replace the newline character with a blank space in the replacement pattern.
Using the search pattern in the replacement pattern
One other note: The \1
and \2
in the replacement pattern let me use the two patterns in the search pattern that I “capture.” Here’s a quick look at how they relate:
\1 ([a-zA-Z,`]) \2 ([a-zA-Z,`])
The regex inside the ()
parentheses is a capture group, and then \1
and \2
are variables that you can use in the replacement pattern.
2023 Update: Working with LaTeX
As a brief update, I can confirm that this command worked today as I am currently working with LaTeX documents and the pandoc
command:
# [1] sed.cmds file # use this to convert LaTeX sentences s/([a-zA-Z0-9,“’‘} ])\n([a-zA-Z0-9“‘{ ])/\1 \2/g # [2] the pandoc+sed command i use $ pandoc HOFs.tex --to=plain | sed -e ':a' -e 'N' -e '$!ba' -E -f sed.cmds
In this example, the pandoc
command converts an input LaTeX document into plain text, and then sed
converts multi-line paragraphs like this:
four score and seven years ago
into a single paragraph like this:
four score and seven years ago
which is what I need today.
That’s all (for now)
I haven’t looked into all of those sed
command line options to see which ones are truly needed and which ones aren’t, but again, at the moment I can confirm that this works with the Unix system on macOS 10.12.1 (Sierra), properly finding the newline characters in sed
’s input stream.