Advanced usage #
Ignoring (parts of) code in your project with Semgrep #
Semgrep identifies programming languages based on their file extensions rather than content analysis.
Use the --scan-unknown-extensions
flag and the --lang
flag to specify the language you want Semgrep
to use when scanning files with non-standard extensions. For example:
semgrep --config /path/to/your/config --lang python --scan-unknown-extensions /path/to/your/file.xyz
In this example, Semgrep will scan the /path/to/your/file.xyz
file as a Python file,
even though the .xyz
extension is not a standard Python file extension.
See also the Allow user to specify file extensions for languages #3090 GitHub issue to work around restrictions if you want to use Semgrep against your specific language, even if the file extension is not standard.
Files/directories #
- By default, Semgrep follows the default .semgrepignore file.
- If present, Semgrep will look at the repository’s
.gitignore
file. - In case of a conflict between the two files, the
.semgrepignore
file takes precedence. This means that if the.gitignore
file includes a file and the.semgrepignore
file excludes it, Semgrep will not analyze the file.
Before starting a scan, it is recommended that you review the files and directories in your project directory.
Note that certain paths may be excluded by default. If you want to change the default exclusion behavior,
such as including third-party libraries or unit tests in the scan, you can create a custom .semgrepignore
file.
Excluding code sections #
To prevent Semgrep from flagging incorrect code patterns, insert a comment in your code immediately before or on the line
preceding the pattern match (e.g., // nosemgrep: rule-id
). It is crucial to have a space between //
and nosemgrep
.
As a best practice, remember to:
- Exclude only particular findings in your comments rather than disabling all rules with a generic
// nosemgrep
comment. - Explain why you disabled a rule or justify your risk acceptance decision.
- If you encounter a false positive and want to ignore a Semgrep rule, provide feedback to either the Semgrep development team or your internal development team responsible for the specific rule. This will help improve the accuracy of the rule and reduce the chances of future false positives.
For more information on how to use nosemgrep
to ignore code blocks for a particular rule, refer to the
Semgrep documentation on ignoring code.
Writing custom rules #
While Semgrep offers a library of pre-built rules, creating custom rules can significantly enhance your security testing by tailoring it to your specific codebase and requirements. However, creating effective Semgrep rules can be challenging without proper guidance and understanding. This section will give you the essential knowledge and skills to create high-quality Semgrep rules. You will learn about the rule language’s syntax and how to develop effective patterns, handle edge cases, and create powerful custom Semgrep rules. This will aid in detecting potential security vulnerabilities early on, ultimately improving your testing process.
Example custom rule #
As a starting point for creating a custom rule, use the following schema to create the custom_rule.yaml
file.
1rules:
2 - id: rule-id
3 languages: [go]
4 message: Some message
5 severity: ERROR # INFO / WARNING / ERROR
6 pattern: test(...)
Running custom rules #
- To run the above-mentioned rule as a single file, use the following command:
semgrep --config custom_rule.yaml
- To run a set of rules in a directory:
semgrep --config path/
ABCs of writing custom rules #
To start writing custom Semgrep rules, it is crucial to understand a few key concepts and tools:
- Familiarize yourself with Semgrep syntax: Begin by exploring the official Learn Semgrep Syntax page, which provides a comprehensive guide on the fundamentals of Semgrep rule writing.
- Refer to language-specific pattern examples: Consult the Semgrep Pattern Examples by Language for examples tailored to specific programming languages.
- Use the Semgrep Playground: The
Semgrep Playground is a convenient online tool
for writing and testing rules. However, it is essential to consider the following points when using the Playground:
Be cautious of privacy concerns: The Semgrep Playground allows users to experiment with code without downloading or installing software on their local machine. While this platform is helpful for testing and debugging rules, it may expose sensitive information such as passwords, API keys, or other secrets contained in the code you submit for scanning. Always use a local development environment with proper security and privacy controls for sensitive code.
- Employ the
simple mode
: The Semgrep Playground’s simple mode makes it easy to combine rule patterns. - Use the
Share
button: Share your rule and test code with others using the Share button. - Add tests to your test code: Incorporate
tests
(e.g.,
# ruleid: <id>
) into your test code to evaluate your rule’s effectiveness while working in the Semgrep Playground (see example). - Note the limitations with comments: Be aware that the Semgrep Playground does not retain comments when sharing a link or “forking” a rule (Ctrl+S). Refer to this GitHub issue for more information.
- Employ the
Building blocks #
Ellipses (...
)
#
Purpose: The ellipsis (...
) is used to match zero or more arguments, statements, parameters,
and so on, allowing for greater flexibility in pattern matching.
Here is an example rule for Python:
1rules:
2 - id: rule-id
3 languages: [Python]
4 message: Some message
5 severity: INFO
6 pattern: requests.get(..., verify=False, ...)
Here, the ellipsis before and after the verify=False
argument allows the pattern to match
any number of arguments before and after the verify
parameter. This ensures that the pattern
can match function calls with various argument combinations, as long as the verify=False
argument is present.
This pattern matches the following code snippets:
1requests.get(verify=False, url=URL)
2requests.post(verify=False, url=URL)
3requests.get(URL, verify=False, timeout=3)
4requests.head()
5requests.get(URL)
6requests.get(URL, verify=False)
In the second example, the ellipsis is used to create a pattern that matches an if
statement
followed by an unnecessary else
block after a return
statement within the if
block.
Below is the unnecessary-if-else-pattern
rule for Python:
1rules:
2 - id: unnecessary-if-else-pattern
3 languages: [Python]
4 message: Unnecessary else after return $X
5 severity: INFO
6 pattern: |
7 if ...:
8 return ...
9 else:
10 ...
Now, let’s break down the pattern components:
if ...:
: This part of the pattern matches anyif
statement, regardless of the condition being tested. The ellipsis within theif
statement is a wildcard that matches any expression or code structure used as the condition. This flexibility ensures that the pattern can detect a wide range ofif
statements with various conditions.return ...
: Within the matchedif
block, thereturn
statement is followed by an ellipsis. This wildcard matches any expression or value being returned. This allows the pattern to detectreturn
statements with different values or expressions, such asreturn True
,return False
,return x
, orreturn calculate_result()
....
within theelse
block: The ellipsis in theelse
block is a wildcard that matches any number of statements.
This pattern matches the following code snippet:
1if a > b:
2 return True
3else:
4 print("a is not greater than b")
By including the ellipsis (...
) in your Semgrep rules, you can create more flexible and comprehensive patterns that account
for variations in code structure.
Metavariables #
Purpose: Metavariables are used to match and track values across a specific code scope. They are denoted by a dollar sign followed by a capitalized letters (e.g.,$X
,$Y
,$COND
).
Here is an example pattern in Golang:
pattern: $X.($TYPE)
The metavariable $X
matches:
1msg, ok := m.(*MsgDonate) // $X = m
2p := val.(types.Pool) // $X = val
3x := val
4msg, ok = m
Metavariables can also be interpolated into the output message of a Semgrep rule. For instance, consider the following rule:
1rules:
2 - id: metavariable-example-rule
3 patterns:
4 - pattern: func $X(...) { ... }
5 message: Found $X function
6 languages: [golang]
7 severity: WARNING
For the following code:
1func test123(input string) {
2 fmt.Println("test")
3}
This returns the Found test123 function
message in the Semgrep output, as follows:
$ semgrep -f rule.yml
# (...)
metavariable-example-rule
Found test123 function
1┆ func test123(input string) {
2┆ fmt.Println("test")
3┆ }
Metavariables help create more dynamic and versatile Semgrep rules by capturing values that can be used for further pattern matching or validation.
Leveraging metavariables #
Metavariables can be used in a variety of ways to enhance Semgrep rules, making them more dynamic and adaptable when analyzing code. Some common use cases include:
Matching variable names: Metavariables can be used to match variable names in the code, allowing the rule to be flexible and applicable to various situations. For example:
pattern: $X := $Y
This pattern would match assignments like
a := b
orresult := calculation()
.Capturing function calls: Metavariables can be employed to capture function calls and their arguments. This can be useful for detecting potentially unsafe or deprecated functions. For example:
pattern: $FUNC($ARG)
This pattern would match function calls like
dangerousFunc(input)
ordeprecatedFunc(arg1, arg2)
.Matching control structures: Metavariables can help identify specific control structures, such as loops or conditionals, with a particular focus on the expressions used within these structures. For example:
pattern: for $INDEX := $INIT; $COND; $UPDATE { ... }
This pattern would match for-loops like
for i := 0; i < 10; i++ { ... }
.Comparing code patterns: Metavariables can be used to compare different parts of the code to ensure consistency or prevent potential bugs. For example, you can detect cases where the same assignment is made in both branches of an
if-else
statement:pattern: if $COND { $X = $Y } else { $X = $Y }
This pattern would match code like:
1if someCondition { 2 x = y 3} else { 4 x = y 5}
Identifying patterns across multiple lines: Metavariables can be employed to match and track values across multiple lines of code, making it possible to detect patterns that span several statements. For example:
pattern: | $VAR1 := $EXPR1 $VAR2 := $VAR1
This pattern would match code like the following:
1a := b + c 2d := a
In conclusion, metavariables offer a powerful way to create dynamic and adaptable Semgrep rules. They help capture and track values across code scopes, enabling the identification of complex patterns and providing informative output messages for developers and security professionals.
Nested metavariables #
Purpose: Nested metavariables allow you to match a pattern with a metavariable that also contains another metavariable meeting certain conditions.
Here is an example rule:
1rules:
2 - id: metavariable-pattern-nest
3 languages: [python]
4 message: substraction in foo(bar(...))
5 patterns:
6 - pattern: foo($X, ...)
7 # First metavariable-pattern
8 - metavariable-pattern:
9 metavariable: $X
10 patterns:
11 - pattern: bar($Y)
12 # Nested metavariable pattern
13 - metavariable-pattern:
14 metavariable: $Y
15 patterns:
16 - pattern: ... - ...
17 severity: WARNING
This rule matches the following Python code:
1foo(bar(1-2))
2foo(bar(bar(1-2)))
Nested metavariables allow for more complex and precise pattern matching in Semgrep rules by allowing you to define relationships between multiple metavariables.
Using metavariable-pattern
for polyglot file scanning
#
Purpose: To match patterns across different languages within a single file (e.g., JavaScript embedded in HTML).
Example: Find all instances of JavaScript’s eval function used in an HTML file ( example).
1rules:
2 - id: metavariable-pattern-nest
3 languages: [html]
4 message: eval in JS
5 patterns:
6 - pattern: <script ...>$Y</script>
7 - metavariable-pattern:
8 metavariable: $Y
9 language: javascript
10 patterns:
11 - pattern: eval(...)
12 severity: WARNING
This rule matches the following HTML code:
1<script>
2 console.log('test123');
3 eval(1+1);
4</script>
Using metavariable-pattern
allows for cross-language pattern matching in polyglot files, enabling you to identify
specific code patterns within mixed-language files.
Using metavariable-pattern
+ pattern-either
#
Purpose: To specify multiple alternative patterns that can match a metavariable.
Example: Flag instances where a variable declaration uses one of several specific types ( example / trailofbits.go.string-to-int-signedness-cast.string-to-int-signedness-cast rule).
1rules:
2 - id: metavariable-pattern-multiple-or
3 languages: [go]
4 message: xyz
5 patterns:
6 - pattern: var $A $TYPE = ...
7 - metavariable-pattern:
8 metavariable: $TYPE
9 pattern-either:
10 - pattern: uint8
11 - pattern: uint16
12 - pattern: uint32
13 - pattern: int8
14 - pattern: int16
15 - pattern: int32
16 severity: WARNING
This rule matches the following Go code:
1var a uint8 = 255
2var b uint16 = 65535
3var c uint32 = 4294967295
4var d int8 = -128
5var e int16 = -32768
6var f int32 = -2147483648
7var g string = "xyz"
Combining metavariable-pattern
with pattern-either
allows you to create Semgrep rules that match a metavariable
if
it meets any of the specified conditions.
Metavariable-pattern + patterns #
Purpose: Usemetavariable-pattern
andpatterns
to flag instances where a metavariable$X
meets all conditions (patterns
) ( example / lxml-in-pandas rule)
Here is an example rule:
1rules:
2 - id: metavariable-pattern-and-patterns
3 languages:
4 - go
5 message: xyz1
6 patterns:
7 - pattern: var $A $TYPE = $Z
8 - metavariable-pattern:
9 metavariable: $Z
10 patterns:
11 - pattern-not: |
12 -128
13 - pattern-not: |
14 -32768
15 severity: WARNING
This rule matches the following Go code:
1var b uint16 = 65535
2var d int8 = -128
3var c uint32 = 4294967295
4var e int16 = -32768
Constant propagation #
Constant propagation in Semgrep refers to the process of matching instances where a metavariable
holds a specific value
or relation.
Matching instances where a metavariable holds a specific value #
Purpose: To match instances where a metavariable holds a specific value or relation, use
the metavariable-comparison
key.
Example: Match cases where the variable $X
is greater than 1337
(
example).
1rules:
2 - id: metavariable-comparison
3 languages: [python]
4 message: $X is higher than 1337
5 patterns:
6 - pattern: function($X)
7 - metavariable-comparison: # Match when $X > 1337
8 metavariable: $X
9 comparison: $X > 1337
10 severity: WARNING
This rule matches the following Python code:
1n = 1339
2function(n) # Match (n > 1337)
3function(1338) # Match (constant > 1337)
4function(123)
Comparing specific metavariables #
Purpose: Compare specific metavariables.
Example: Match functions where the first argument is lower than the second one ( example).
1rules:
2 - id: metavariable-comparison-rule
3 patterns:
4 - pattern: f($A, $B)
5 - metavariable-comparison:
6 comparison: int($A) < int($B)
7 metavariable: $A
8 message: $A < $B
9 languages: [python]
10 severity: WARNING
This rule matches the following Python code:
1f(1,2)
2f(2,3)
3f(4,3)
4f(12312,1)
Deep expression operator #
Purpose: To match deeply nested expressions in the code.Deep expression operator is useful when you want to identify specific patterns that are buried within complex structures like conditional statements, loops, or function calls. Using the deep expression operator, you can create rules that target specific code patterns regardless of how deep they are in the code structure.
The deep expression operator is represented by <... ...>
. It acts as a wildcard that matches any code structure between
the opening and closing ellipses. By using the deep expression operator, you can create Semgrep rules that match patterns
in any level of nesting.
Example: Matching a function call nested within an if
statement (
example).
Suppose you want to match any instance of a specific function call (e.g., user.is_admin()
) within an if
statement,
regardless of how deeply nested it is.
1rules:
2- id: deep-expression-example
3 pattern: |
4 if <... user.is_admin() ...>:
5 print(...)
6 message: if statement with is_admin() check
7 languages: [python]
8 severity: WARNING
This rule matches the following Python code:
1if user.authenticated() and user.is_admin() and user.has_group(gid):
2 print("hello")
Understanding pattern-inside
and pattern-not-inside
#
Using pattern-inside
#
By using pattern-inside
, you can create rules that match patterns only when they appear
within a certain code construct, like a function, or class definition, a loop, or a conditional block.
Here’s an example of how you might use pattern-inside
to detect cases where a sensitive function is called within a loop:
1rules:
2- id: sensitive_function_in_loop
3 languages:
4 - python
5 message: "Sensitive function called inside a loop"
6 severity: WARNING
7 patterns:
8 - pattern-inside: |
9 for ... in ...:
10 ...
11 - pattern: |
12 sensitive_function(...)
In this example, the pattern-inside
operator is used to match any for
loop in Python, and the second
pattern matches calls to sensitive_function()
. The rule will trigger only if both patterns are matched,
meaning that the sensitive_function
is called inside a loop.
Here’s an example of Python code that would trigger the sensitive_function_in_loop
rule:
1def sensitive_function(data):
2 # Process sensitive data
3 pass
4
5def main():
6 data_list = ['data1', 'data2', 'data3']
7
8 for data in data_list:
9 # Call to sensitive_function is inside a loop
10 sensitive_function(data)
11
12def second(data):
13 sensitive_function(data)
Using pattern-not-inside
#
pattern-not-inside
is the opposite of pattern-inside
and is used to match a pattern only when it
does not appear within a specified context. This operator helps you to exclude certain parts of the
code from your analysis, further refining your rules and reducing false positives.
For instance, you can use pattern-not-inside
to detect calls to the print_debug()
function when they occur outside a if debug:
block:
1rules:
2- id: print_debug_outside_debug_block
3 languages:
4 - python
5 message: "print_debug() should be called inside a 'if debug:' block"
6 severity: WARNING
7 patterns:
8 - pattern-not-inside: |
9 if debug:
10 ...
11 - pattern: |
12 print_debug(...)
Here is a Python code example demonstrating the use of this rule:
1debug = True
2
3def print_debug(msg):
4 print("DEBUG:", msg)
5
6def correct_usage():
7 if debug:
8 print_debug("This is a debug message inside a 'if debug:' block")
9
10def incorrect_usage():
11 print_debug("This is a debug message outside a 'if debug:' block")
12
13def main():
14 correct_usage()
15 incorrect_usage()
Combining pattern-inside
and pattern-not-inside
#
In some cases, you might want to create rules that use both pattern-inside
and pattern-not-inside
operators to capture instances where a specific pattern
is found within a particular context but not within another.
Example: Detecting print()
calls in functions but not in main()
.
Suppose you want to enforce a rule where print()
calls are allowed only within
the main()
function and not in any other functions. You can create a rule that
combines pattern-inside
and pattern-not-inside
operators to achieve this.
1rules:
2- id: print_calls_outside_main
3 languages:
4 - python
5 message: "print() calls should only be inside the main() function"
6 severity: WARNING
7 patterns:
8 - pattern-inside: |
9 def $X(...):
10 ...
11 - pattern-not-inside: |
12 def main(...):
13 ...
14 - pattern: |
15 print(...)
In this example, the pattern-inside
operator matches any function definition, while
the pattern-not-inside
operator ensures that the main()
function is excluded.
The final pattern matches calls to the print()
function. The rule will trigger only
when a print()
call is found inside a function other than main()
.
Here’s an example of Python code that triggers the print_calls_outside_main
rule:
1def sample_function():
2 # print() call inside a function other than main()
3 print("This is a sample function")
4
5def main():
6 print("This is the main function")
7 sample_function()
8
9def other_function():
10 some_function()
11 print("XYZ")
Taint mode #
Taint mode is a powerful feature in Semgrep that can track the flow of data from one location to another. By using taint mode, you can:
- Track data flow across multiple variables: Taint mode enables you to trace how data moves across different variables, functions, components, and allows you to easily identify insecure flow paths (e.g., situations where a specific sanitizer is not used).
- Find injection vulnerabilities: Taint mode is particularly useful for identifying injection vulnerabilities such as SQL injection, command injection, and XSS attacks.
- Write simple and resilient Semgrep rules: Taint mode simplifies the process of writing Semgrep rules that are resilient
to certain code patterns nested in
if
statements, loops, and other structures.
To use taint mode, you need to set the mode: taint
and specify pattern-sources
/pattern-sinks
fields in your custom
Semgrep rule.
See this example:
1rules:
2 - id: taint-tracking-example1
3 mode: taint
4 pattern-sources:
5 - pattern: getData()
6 pattern-sinks:
7 - pattern: printToUser(...)
8 message: data flows from getData to printToUser
9 languages: [python]
10 severity: WARNING
Optionally, you can use additional fields in your Semgrep rule to further refine your taint analysis:
pattern-propagators
: This field allows you to specify functions or methods that propagate tainted data ( example). You can also refer to sanitizers by side-effect for more information.pattern-sanitizers
: This field allows you to specify functions or methods that sanitize tainted data. For more information, see the taint mode documentation.
Combining patterns #
When writing Semgrep rules, you may encounter situations where a single pattern (e.g., pattern: evil_function(...)
)
isn’t sufficient to capture the behavior you want to detect. In these cases, you can use one of the following to combine
patterns:
patterns
: This method combines multiple patterns with a logical AND (&&). In other words, all patterns must match for the rule to trigger. This is useful when you want to detect code snippets that satisfy multiple conditions simultaneously.pattern-either
: This method combines multiple patterns with a logical OR (||). In other words, if any of the patterns match, the rule triggers. This is useful when you want to detect code snippets satisfying at least one specified condition.Suppose you want to detect calls to two insecure functions,
insecure_function_1()
andinsecure_function_2()
. You can use thepattern-either
operator to achieve this.1rules: 2- id: insecure_function_calls 3 languages: 4 - python 5 message: "Call to an insecure function detected" 6 severity: WARNING 7 patterns: 8 - pattern-either: 9 - pattern: | 10 insecure_function_1(...) 11 - pattern: | 12 insecure_function_2(...)
In this example, the
pattern-either
operator is used to match calls to eitherinsecure_function_1()
orinsecure_function_2()
. The rule will trigger if any of these patterns are matched.Here’s an example of Python code that triggers the
insecure_function_calls
rule:1def insecure_function_1(): 2 print("Insecure function 1 called") 3 4def insecure_function_2(): 5 print("Insecure function 2 called") 6 7def main(): 8 # Call to insecure_function_1() triggers the rule 9 insecure_function_1() 10 11 # Call to insecure_function_2() also triggers the rule 12 insecure_function_2()
pattern-regex
: This matches code with a PCRE-compatible pattern in multiline mode. In other words, it matches code using a regular expression pattern.
Rule syntax diagram #
The following diagram will help you understand the relationship between the relevant fields in the rule. While writing a rule, you can use the advanced mode in the Semgrep Playground to test and refine it. The playground highlights any errors in your rules, providing immediate feedback.
flowchart TB Fields{Rule Fields} ---->|Only one is allowed| Required{Required} click Fields "https://semgrep.dev/docs/writing-rules/rule-syntax/#optional:~:text=Rule%20syntax-,Rule%20syntax,-TIP" Required ==> id click id "https://semgrep.dev/docs/writing-rules/rule-syntax/#optional:~:text=Description-,id,-string" Required ==> message click message "https://semgrep.dev/docs/writing-rules/rule-syntax/#optional:~:text=no%2Dunused%2Dvariable-,message,-string" Required ==> severity click severity "https://semgrep.dev/docs/writing-rules/rule-syntax/#optional:~:text=Rule%20messages.-,severity,-string" Required ==> languages((languages)) click languages "https://semgrep.dev/docs/writing-rules/rule-syntax/#language-extensions-and-tags" Required ===>|Only one is required| Pattern_Fields{Pattern Fields} click Pattern_Fields "https://semgrep.dev/docs/writing-rules/rule-syntax/#optional:~:text=pattern*,in%20multiline%20mode" click Required "https://semgrep.dev/docs/writing-rules/rule-syntax/#required" Pattern_Fields ==> pattern click pattern "https://semgrep.dev/docs/writing-rules/rule-syntax/#pattern" Pattern_Fields ==> pattern-regex[pattern-regex] click pattern-regex "https://semgrep.dev/docs/writing-rules/rule-syntax/#pattern-regex" Pattern_Fields ==> pattern-either((pattern-either)) click pattern-either "https://semgrep.dev/docs/writing-rules/rule-syntax/#pattern-either" Pattern_Fields ==> patterns((patterns)) click patterns "https://semgrep.dev/docs/writing-rules/rule-syntax/#patterns" pattern-either -.-> pattern-regex pattern-either -.-> pattern pattern-either -.-> pattern-inside click pattern-inside "https://semgrep.dev/docs/writing-rules/rule-syntax/#pattern-inside" pattern-either <-.-> patterns patterns -.-> pattern-inside patterns <-..-> metavariable-pattern{metavariable-pattern} click metavariable-pattern "https://semgrep.dev/docs/writing-rules/rule-syntax/#metavariable-pattern" metavariable-pattern --> metavariable2[metavariable] metavariable-pattern -.-> language metavariable-pattern -.-> pattern metavariable-pattern -.-> pattern-either metavariable-pattern -.-> pattern-regex patterns -.-> metavariable-regex{metavariable-regex} click metavariable-regex "https://semgrep.dev/docs/writing-rules/rule-syntax/#metavariable-regex" metavariable-regex --> metavariable metavariable-regex --> regex patterns -.-> metavariable-comparison{metavariable-comparison} click metavariable-comparison "https://semgrep.dev/docs/writing-rules/rule-syntax/#metavariable-comparison" metavariable-comparison --> metavariable3[metavariable] metavariable-comparison --> comparison metavariable-comparison -.-> base metavariable-comparison -.-> strip patterns -.-> pattern patterns -.-> pattern-not click pattern-not "https://semgrep.dev/docs/writing-rules/rule-syntax/#pattern-not" patterns -.-> pattern-not-inside click pattern-not-inside "https://semgrep.dev/docs/writing-rules/rule-syntax/#pattern-not-inside" patterns -.-> pattern-not-regex click pattern-not-regex "https://semgrep.dev/docs/writing-rules/rule-syntax/#pattern-not-regex" Fields -.-> Optional{Optional} Optional -.-> options(options) Optional -.-> fix(fix) Optional -.-> metadata(metadata) Optional -.-> paths(paths) click Optional "https://semgrep.dev/docs/writing-rules/rule-syntax/#optional" click options "https://semgrep.dev/docs/writing-rules/rule-syntax/#options" click fix "https://semgrep.dev/docs/writing-rules/rule-syntax/#fix" click metadata "https://semgrep.dev/docs/writing-rules/rule-syntax/#metadata" click paths "https://semgrep.dev/docs/writing-rules/rule-syntax/#paths"
Example #1:
Looking at the chart, you can see that the pattern-either
and pattern-not
fields are not directly connected.
However, you can combine them using the patterns
field, which performs a logical AND operation on all the patterns included.
Example #2:
For instance, if you want to use pattern-either
to combine multiple patterns with a logical OR and exclude a specific
pattern using pattern-not
, you can do so by including both of them under the same patterns
field.
The resulting combination of patterns will match only code that satisfies all of the patterns included in
the pattern-either
field, except for the pattern specified in pattern-not
.
See the example
exclude-when-using-secure-option
rule.
Generic pattern matching #
It is possible to match generic patterns in unsupported languages/contexts.
Use the generic
language for configuration files, XML, etc., and combine it with the specific extension
through the paths
- include
fields to reduce false positives.
For example, see the
nsc-allows-plaintext-traffic
rule,
which scans the Android manifest XML file for potential misconfiguration:
1rules:
2 - id: nsc-allows-plaintext-traffic
3 languages: [generic]
4 patterns:
5 - pattern: |
6 <base-config ... cleartextTrafficPermitted="true" ... >
7 - pattern-not-inside: |
8 <!-- ... -->
9 - pattern-not-inside: >
10 <network-security-config ... InsecureBaseConfiguration ... > ... ...
11 ... ... ... ... ... ... ... ... </network-security-config>
12 severity: INFO
13 paths:
14 include:
15 - "*.xml"
Metadata #
Metadata fields are a feature in Semgrep that allow you to attach additional information to your rules. By including metadata fields in your rules, you can give developers more context and guidance on addressing potential issues. This information can include details such as the rule’s severity level, recommended fixes, or the author’s contact information. By including metadata, you can make your rules more informative and actionable for developers who encounter them. This can help them prioritize and fix issues more efficiently, ultimately improving the overall security of your codebase.
In addition to providing context and guidance to developers, there are several other reasons why an organization might want to use Semgrep metadata:
- Standardization. Using metadata fields consistently across all of your organization’s Semgrep rules ensures that
developers see the same types of information and recommendations no matter which rules they encounter.
This can help standardize the security review process and simplify prioritizing and addressing issues.
- Example:
By including fields required by the security category in the Semgrep Registry,
developers will prioritize findings with high
confidence
and highimpact
metadata.
- Example:
By including fields required by the security category in the Semgrep Registry,
developers will prioritize findings with high
- Collaboration. Including author information in your Semgrep rules can make it easier for other organization members
to collaborate on security issues.
- Example: Suppose someone has a question or needs more information about a particular rule. In that case, they can
contact the
author
directly for clarification.
- Example: Suppose someone has a question or needs more information about a particular rule. In that case, they can
contact the
- Compliance. Suppose your organization needs to comply with specific security regulations or standards.
In this case, you could include a
compliance
metadata field in your Semgrep rules, indicating which regulation or standard the rule relates to. This helps ensure that your codebase complies with all relevant requirements.
You can create any metadata field, as demonstrated in the hooray-taint-mode rule.
We recommend including the following metadata fields required by the security category in the Semgrep Registry:
cwe
: A Common Weakness Enumeration identifier that classifies the security issue.confidence
: An assessment of the rule’s accuracy, represented as high, medium, or low.likelihood
: An estimation of the probability that the detected issue will be exploited, represented as high, medium, or low.impact
: A measure of the potential damage caused by exploiting the detected issue, represented as high, medium, or low.subcategory
: A more specific classification of the rule, falling under one of the following categories: vuln, audit, or guardrail.
By including these metadata fields, you provide valuable context and help users better understand the security implications of the issues detected by your rule.
Various tips #
Matching an array with a non-string element #
This Semgrep rule aims to detect JavaScript or TypeScript arrays that contain at least one non-string element. See this array-with-a-non-string-element example.
1rules:
2 - id: array-with-a-non-string-element
3 languages: [js]
4 message: array with element that is not a string
5 severity: WARNING
6 patterns:
7 - metavariable-pattern:
8 metavariable: $A
9 patterns:
10 - pattern-not: "..."
11 - pattern: [..., $A, ...]
“Removing” negative pattern from pattern-either #
This Semgrep rule aims to detect Python code snippets where a function a(...)
, b(...)
, or c(...)
is called,
but it should not match the case where function a()
is called with the argument x
.
See this
pattern-not-with-pattern-either example
1rules:
2- id: pattern-not-in-pattern-either
3 patterns:
4 - pattern-either:
5 - pattern: a(...)
6 - pattern: b(...)
7 - pattern: c(...)
8 - pattern-not: a(x)
9 message: pattern either with one negative pattern
10 languages: [python]
11 severity: WARNING
Maintaining good quality of Semgrep rules #
Before publishing a new rule or updating an existing one, it is crucial to ensure that it meets specific standards and is effective. To help with this, we’ve created a Development Practices checklist in our Contributing to Trail of Bits Semgrep Rules document that you can follow to make sure your custom rule is ready for publication.
Help with writing custom rules #
Warning: Be careful about asking for external assistance for writing rules or sharing rule output that may be specific to a sensitive and/or private codebase. Doing so could inadvertently disclose the identity of the code owner, portions of the code, or particular bugs.
When running into issues while working on custom rules, several resources are available to help you. Two of the most valuable resources are the following:
- The Semgrep Community Slack is a great place to ask for help with custom rule development. The channel is staffed by knowledgeable developers familiar with Semgrep’s architecture and syntax. They are usually quick to respond to questions. They can guide you in structuring your rules and in debugging any issues that arise. Additionally, the Slack channel is a great place to connect with other developers working on similar projects, allowing you to learn from others’ experiences and share your insights.
- Use Semgrep GitHub issues to report bugs, suggest new features, and ask for help with specific issues.
Thoroughly testing Semgrep rules for optimal performance #
Creating comprehensive tests for your Semgrep rules is essential to ensure they perform as expected and cover a wide range of test cases. By thoroughly testing the rules against various code samples, you can confirm that they accurately identify intended vulnerabilities, potential errors, or coding standard violations. This ultimately leads to more reliable and effective security and code quality analysis.
Designing comprehensive test cases #
A well-rounded test suite for a custom Semgrep rule should cover multiple aspects of the rule’s functionality.
When designing test cases, consider the following:
- Create a file containing code samples: Create a file containing code with the same name as the rule.
For example, if your rule filename is
unsafe-exec.yml
, create a correspondingunsafe-exec.py
file with sample code. - Incorporate a diverse range of code samples: Adhere to the following guidelines when adding code samples to the
test file:
- Include at least one true positive comment (e.g.,
// ruleid: id-of-your-rule
). - Include at least one true negative comment (e.g.,
// ok: id-of-your-rule
). - Start with simple, descriptive examples that are easy to understand.
- Progress to more advanced, complex examples, such as those involving nested structures (e.g., inside an
if
statement) or deep expressions. - Include edge cases that may challenge the rule’s accuracy or efficiency, such as large input values, complex code structures, or unusual data types.
- Test the rule against different language features and constructs, including loops, conditionals, classes, and functions.
- Intentionally create code samples that should not trigger the rule, and ensure that the rule does not produce false positives in these cases.
- Include at least one true positive comment (e.g.,
- Ensure all tests pass: Run the
$ semgrep --test
command to verify that all test cases pass. - Evaluate the rule against real-world code: Test the rule against actual code from your projects, open-source repositories, or other codebases to assess its effectiveness in real-life scenarios.
Testing custom rules in CI #
GitHub Actions #
The following workflow can be used to test custom Semgrep rules in GitHub Actions:
name: Test Semgrep rules
on: [push, pull_request]
jobs:
semgrep-test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v4
with:
python-version: "3.11"
cache: "pip"
- run: python -m pip install -r requirements.txt
- run: semgrep --test --test-ignore-todo ./path/to/rules/
Make sure to include semgrep
in your requirements.txt
(or
poetry
or pipenv
equivalents)
file to speed up workflow runs by caching the dependency. Note, we include
--test-ignore-todo
here so we do not fail CI runs on
TODO tests,
which are a valuable form of documentation for future rule improvements.
Autofix feature #
The autofix feature can automatically correct identified vulnerabilities, potential errors, or coding standard violations.
There are many benefits to using the autofix feature:
- Training every developer on all the best practices for large code bases is not feasible. Autofixes can help fill in the gaps and provide guidance as needed.
- Autofixes maintain developer focus by removing monotonous changes, allowing them to concentrate on more complex tasks.
- Adding autofixes allows developers to be educated and trained on new best practices as they are introduced into the codebase.
- Autofixes can provide on-demand fixes and are much more actionable and educational than simple lint warnings.
- Without making developers aware of a deprecation, they won’t know not to use a deprecated component, and they won’t know what to use instead. Autofixes can help make these transitions smoother.
Creating a Semgrep rule with the autofix feature #
Follow these steps to develop a rule with the autofix feature (see the ioutil-readdir-deprecated rule with the autofix feature implemented):
Add the
fix
key to a rule, specifying the replacement pattern for the identified vulnerability.Here is an example rule with the autofix feature:
1rules: 2 - id: ioutil-readdir-deprecated 3 languages: [golang] 4 message: ioutil.ReadDir is deprecated. Use more efficient os.ReadDir. 5 severity: WARNING 6 pattern: ioutil.ReadDir($X) 7 fix: os.ReadDir($X)
For the following Golang code:
1package main 2 3import ( 4 "fmt" 5 "io/ioutil" 6 "log" 7 "os" 8) 9 10func main() { 11 // ruleid: ioutil-readdir-deprecated 12 files, err := ioutil.ReadDir(".") 13 if err != nil { 14 log.Fatal(err) 15 } 16 17 for _, file := range files { 18 fmt.Println(file.Name()) 19 } 20}
Run the rule using the standard command to confirm that the rule is detecting the intended issue:
$ semgrep -f rule.yaml # (...) Findings: readdir.go ioutil-readdir-deprecated ioutil.ReadDir is deprecated. Use more efficient os.ReadDir. ▶▶┆ Autofix ▶ os.ReadDir(".") 11┆ files, err := ioutil.ReadDir(".") # (...)
Run the rule with the
--dryrun
and the--autofix
options to preview the behavior of the autofix feature on the code without making any changes to the analyzed code:$ semgrep -f rule.yaml --dryrun --autofix # (...) Findings: readdir.go ioutil-readdir-deprecated ioutil.ReadDir is deprecated. Use more efficient os.ReadDir. ▶▶┆ Autofix ▶ os.ReadDir(".") 11┆ files, err := os.ReadDir(".") # (...)
Create a new test file for the autofix by adding the
.fixed
suffix in front of the file extension (e.g.,readdir.go
->readdir.fixed.go
). This file should contain the expected output after the autofix is applied.Content of the
readdir.fixed.go
file:1package main 2 3import ( 4 "fmt" 5 "io/ioutil" 6 "log" 7 "os" 8) 9 10func main() { 11 // ruleid: ioutil-readdir-deprecated 12 files, err := os.ReadDir(".") 13 if err != nil { 14 log.Fatal(err) 15 } 16 17 for _, file := range files { 18 fmt.Println(file.Name()) 19 } 20}
Run the test to confirm that the autofix is working as expected:
$ semgrep --test 1/1: ✓ All tests passed 1/1: ✓ All fix tests passed
Now you are ready to apply autofix to the analyzed file with the
--autofix
option.$ semgrep -f rule.yaml --autofix # (...) Findings: readdir.go ioutil-readdir-deprecated ioutil.ReadDir is deprecated. Use more efficient os.ReadDir. ▶▶┆ Autofix ▶ os.ReadDir(".") 11┆ files, err := ioutil.ReadDir(".") # (...)
By following these steps, you can create a custom Semgrep rule with an effective autofix feature that identifies issues and provides a solution to fix them.
Regular expression-based autofix #
The fix
field presented above allows you to specify a simple string replacement, while the fix-regex
field enables
more complex regular expression-based replacements. For more information, refer to the official documentation
on
Autofix with regular expression replacement.
Optimizing Semgrep rules #
Improve rule performance and minimize false positives through repeatable processes.
Optimizing your Semgrep rules is crucial for maintaining high performance and minimizing false positives. This section will guide how to create efficient and accurate Semgrep rules.
Analyze time summary: To include a time summary with the results, use the
--time
flag. This will provide the following information:- Total time / Config time / Core time
- Semgrep-core time
- Total CPU time
- File parse time
- Rule parse time
- Matching time
- Slowest five analyzed files
- Slowest five rules to match
Narrow down findings to specific file paths: Assess whether findings should be limited to specific file paths (e.g., Dockerfiles).
You can apply particular rules to certain paths using the
paths
keyword. For example, the avoid-apt-get-upgrade rule targets only Dockerfiles:17 paths: 18 include: 19 - "*dockerfile*" 20 - "*Dockerfile*"
Use
pattern-inside
andpattern-not-inside
: Thepattern-inside
andpattern-not-inside
clauses allow you to specify a context in which a pattern should or should not be matched, respectively.Consider a scenario where you want to identify calls to
insecure_function()
within a loop, followed by a specific statement, such as a call tolog_data()
, but only when the log level is set toDEBUG
.Initially, you can achieve this by using one
pattern
statement:1rules: 2- id: insecure_function_in_loop_followed_by_debug_log 3 languages: [python] 4 message: | 5 Insecure function called within a loop 6 followed by log_data() with log level DEBUG 7 severity: WARNING 8 pattern: | 9 for ... in ...: 10 ... 11 insecure_function(...) 12 ... 13 log_data("DEBUG", ...)
Here’s an example of Python code that triggers the
insecure_function_in_loop_followed_by_debug_log
rule:1def insecure_function(): 2 print("Insecure function called") 3 4def log_data(log_level, msg): 5 if log_level == "DEBUG": 6 print("DEBUG:", msg) 7 8def main(): 9 data_list = ['data1', 'data2', 'data3'] 10 11for data in data_list: 12 # Call to insecure_function() within a loop, 13 # followed by log_data() with log level DEBUG triggers the rule 14 insecure_function() 15 other_function() 16 function1337() 17 log_data("DEBUG", "Insecure function called with data: " + data)
Running the
insecure_function_in_loop_followed_by_debug_log
rule may not provide the clearest output, as it displays the entirefor
loop:$ semgrep -f insecure_function_in_loop_followed_by_debug_log.yml # (...) insecure_function_in_loop_followed_by_debug_log Insecure function called within a loop followed by log_data() with log level DEBUG 11┆ for data in data_list: 12┆ # Call to insecure_function() within a loop, 13┆ # followed by log_data() with log level DEBUG triggers the rule 14┆ insecure_function() 15┆ other_function() 16┆ function1337() 17┆ log_data("DEBUG", "Insecure function called with data: " + data)
For such findings, only the calls to
insecure_function()
might be of critical importance. To improve the output, you can use the following clauses instead:patterns
: This clause combines two sub-patterns with a logical AND operator, meaning all sub-patterns must match:a.
pattern-inside
: This clause matches anyfor
loop in the Python code, establishing the context for the subsequent patterns. It sets a condition that must be met for the rule to trigger, acting as the first part of a logical AND operation.b.
pattern
: This sub-pattern matches calls to any function followed by a call tolog_data("DEBUG", ...)
. The rule potentially triggers if thispattern
and the previouspattern-inside
match.c.
focus-metavariable
: This operator focuses the finding on the line of code matched by$FUNC
.d.
metavariable-pattern
: This sub-pattern restricts$FUNC
to functions calledinsecure_function
.
Here is a fixed version of the
insecure_function_in_loop_followed_by_debug_log
rule:1rules: 2- id: insecure_function_in_loop_followed_by_debug_log_fixed 3 languages: [python] 4 message: | 5 Insecure function called within a loop 6 followed by log_data() with log level DEBUG 7 severity: WARNING 8 patterns: 9 - pattern-inside: | 10 for ... in ...: 11 ... 12 - pattern: | 13 $FUNC(...) 14 ... 15 log_data("DEBUG", ...) 16 - focus-metavariable: $FUNC 17 - metavariable-pattern: 18 metavariable: $FUNC 19 pattern: insecure_function
Running the
insecure_function_in_loop_followed_by_debug_log_fixed
Semgrep rule will produce a more concise and focused output:$ semgrep -f insecure_function_in_loop_followed_by_debug_log_fixed.yml # (...) insecure_function_in_loop_followed_by_debug_log_fixed Insecure function called within a loop followed by log_data() with log level DEBUG 13┆ insecure_function()
Minimize the use of ellipses
...
: While ellipses are a powerful tool for matching a wide range of code snippets, they can lead to performance issues and false positives when overused. Limit the use of ellipses to situations necessary for accurate pattern matching.Determine the necessity of metavariables: Before using a metavariable in your rule, determine if it is truly necessary. Metavariables can be useful for capturing and comparing values, but if a metavariable is unnecessary for your rule to function correctly, consider removing it.
For example, consider the following Semgrep rule that uses a metavariable
$X
:1rules: 2 - id: unnecessary_metavariable_example 3 languages: [python] 4 message: The variable is assigned the value 123 5 pattern: $X = 123 6 severity: WARNING
This rule matches any variable assignment with the value
123
. However, the metavariable$X
might be unnecessary if you don’t need to capture the variable name. In this case, you can use the...
operator instead, which matches any expression:1rules: 2 - id: without_metavariable_example 3 languages: [python] 4 message: A variable is assigned the value 123 5 pattern: ... = 123 6 severity: WARNING
By replacing the
$X
metavariable with the...
operator, you can reduce the complexity and improve the performance of your rule without losing the intended functionality. This approach should be used when the metavariable is not essential for the rule’s purpose or subsequent comparisons or checks.Test your rules with real-world code: To ensure the effectiveness of your rules, test them with real-world code samples. This lets you identify potential issues and false positives before deploying your rules in a production environment.