Splunk Enterprise — Index Data Search

D.M.
12 min readMay 31, 2022

--

Table of Contents

  1. Preface
  2. Splunk Job Inspector
  3. Splunk SPL Commenting
  4. Splunk Architecture
  5. Splunk Streaming Commands
  6. Splunk Breakers and Segmentation
  7. Splunk Makeresults
  8. Splunk Fieldsummary
  9. Informational Functions
  10. Debrief

1. Preface

I only made this blog in order to provide common Q&A information to anyone interested in using Splunk. It is also great as a reference. Please visit Splunk for the official learning courses

Splunk Q&A Study Guide:

Part 1

Part 2

Part 3

Splunk Enterprise — Q&A — Fields

Splunk Enterprise — Reports/Alerts

2. Splunk Job Inspector

the search job inspector tool is a helpful tool to allow us to analyze and troubleshoot the overall performance of our search and really get a good understanding of how much impact these various different knowledge objects and components of our search are having in terms of performance

the job inspector can also help us get a better understanding of how to write better searches and take a look at where Splunk spent the majority of its time

when a search is run the job dropdown menu has an inspect job option that will open up the search job inspector where we can start performing the analysis and troubleshooting

the job inspector will be available for any existing search job and any search jobs that have expired will no longer have the job inspector as an option

search job inspector has 3 components, Header (result count, events, scanning, and time to execute), Execution costs, and Search job properties

execution costs provides details on cost to retrieve results, such as

  • command.search.index time to search the index for the location to read in rawdata files
  • command.search.filter time to filter out events that do not match
  • command search.rawdata time to read events from the daw data files

Execution costs has columns for duration, component, invocations, input count, output count

search job properties provide some detailed information about the search job itself in terms of the overall disk usage that search consumed in bytes and information of the earliest and latest time for the search

use to determine length of time items took to perform a search during troubleshooting

3. Splunk SPL Commenting

SPL commenting used to run complicated searches, adding comments to SPL searches in Splunk allows users to troubleshoot their SPL by slowly commenting out pieces of their search, along with allowing the user to provide notes to themselves and to other users that they’re sharing their search with, wrapped in three backtick symbols ``` comment``` and will be highlight in gray and have no limit or restrictions

can be used to comment out commands in order to trouble shoot why the commands are not providing the desired results

4. Splunk Architecture

Searches are run from a search head and sent out to an indexer, or a group of indexers also known as search peers, the peers then execute the search and return the results to the search head, where they are processed

Splunk deployments can often include multiple indexes created by users with an administrator role, each index represents a logical grouping of data such as two separate indexes with one containing sales data and another containing game server data

indexed data is help in a group of data buckets and when the events are indexed in Splunk they are placed into data buckets

Hot buckets store data as it arrives and then rolled into read-only warm, then cold buckets and then eventually become frozen bucket for deletion or archiving

each bucket contains a compress journal in time-series index files, the journal is where Splunk stores our raw event data and tis composed of many smaller compress slices each about 128kB in size

the time-series index files or TSIDX, are index keys to our journal file and tell Splunk which slices of data to open up in order to find the events we’re looking for in a search

TSIDX files are made by Splunk taking raw events and finding each of the unique terms that exist in your data, it tokenizes those terms, and writes them out to a lexicon that exists inside each TSIDX file

Each lexicon has a unique posting list array and that array gives Splunk the location of the events we’re looking for inside the journal

The bucket stores its raw data inside a GZIP file and creates TSIDX files as a reference to that raw data

When Splunk executes a search it uses the TSIDX files to know where it should retrieve data from a bucket

When buckets roll over from hot to warm Splunk creates a bloom filter based on TSIDX files

A bloom filter is a data structure used to quickly eliminate data that doesn’t match the search and Splunk does not need to read data from a disk that doesn’t match query, including TSIDX files, a bucket’s bloom filter lets Splunk rule out buckets that don’t have the query information

To build a bucket’s bloom filter each term from a TSIDX file’s lexicon is run through a set of hashing algorithms, the resulting hash sets bits in the bloom filter to zero or 1, then a search is run it generates its own bloom filter based on the search terms

the algorithm computed from the search is compared against a bucket’s filters which is faster than reading all of the TSIDX files, this way Splunk can avoid entire files it doesn’t need to read

5. Splunk Streaming Commands

Can apply SPL commands that perform statistical analysis, filtering and formatting through transforming and steaming commands

transforming commands operate on the entire result set of data, because they have to wait for the results to arrive they are executed on the search head, once complete transforming command changes event data into results ordered in a table examples: stats, timechart, chart, top and rare commands

streaming command have two types: centralized and distributable streaming commands

centralized streaming commands, like transforming commands, execute on the search head, they apply transformation to each event returned by a search and the results depend on the order that the events come in, you may sometimes see them described as stateful streaming commands

centralized streaming command examples: transaction, streamstats

distributable streaming commands can execute without waiting for the entire result set of event data, the order of incoming events doesn’t matter to them, for example an eval command operates on each event individually so it doesn’t matter whether that event came first or last, because of this distributable streaming command can be execute or distributed on multiple indexers that improves the processing time, however if it’s preceded by any commands that run on the search head, a distributable streaming command must run there as well

distributable streaming commands: rename, eval, fields, and regex

the search head parses a search into distributed and centralized parts, distributed parts of the search are sent to the indexers, Splunk uses bloom filters, TSIDX files, and search terms to find the events searched for,

distributable commands are executed on the indexer, those results are sent back to the search head, the search head applies centralized streaming and transforming commands and then displays the results

6. Splunk Breakers and Segmentation

Splunk tokenizes search terms at search time, the process is known as segmentation, during search it compares the search tokens to event tokens looking for a match between the two, when it finds one it reads the matching events from disk

segmentation happens in two stages, splitting events up by finding characters that we call major breaker and minor breakers

major breakers are used to isolate words, phrases, terms, and numerical data during the first part of segmentation that include: spaces, new lines, carriage returns, tabs, brackets, exclamation points, and commas

the second pass of segmentation finds minor breakers within the results of the first pass to break them up further to include: forward slashes, colons, periods, hyphens, and dollar signs

the point of segmentation is to quickly return a set of tokens, at index time these tokens become part of the TSIDX files’ lexicons and are used to build bloom filters when buckets roll from hot to warm, at search time Splunk uses the tokens to build a search time bloom filter and match these terms in TSIDX files

lispy expressions — knowing how segmentation affects performance and why some searches work faster than others

in the search log find the expressions that will be used to build a bloom filter and check against TSIDX files is called lispy expressions, look through the search log and find the phrase “base lispy” and has the actual lispy expression Splunk created from our search query, you can see parts of the search using major and minor breakers, Splunk has tokenized, separated, and reordered our search string as a lispy expression, Splunk uses the lispy expression to build bloom filters and locate terms in TSIDX files,

lispy expressions use a variant of prefix notation, a mathematical notation used to write equations, in prefix notation, the operators like AND, OR, and NOT appear before the operands or search terms

index=security AND failed AND user=root

| timechart count span=1h

| stats avg(count) as HourlyAverage

search:

index=web clientip=76.169.7.252

lispy:

[AND 169 252 7 76 index: :web]

lispy would end up returning all the events that include these numbers not just the ip address, same thing with phrases

index=security “failed password”

[AND failed index: : web password]

TERM directive

index=web clientip=TERM(76.169.7.252)

[AND 76.169.7.252 index: :web]

terms must be bound by major breakers in its raw data, you can user TERM directive when searching for field values but will not work with aliased fields, for example src_ip would need to be changed to c_ip to use a non normalized field, could be an issue if all data was normalized to adhere to Splunk CIM causing splunk to read more events than needed to complete searches

you can also add the TERM before the field in order to return the ip as a token while keeping normalized fields in the search while still limiting the events read from disk

using NOT to exclude terms can work in lispy expressions

this search for “NOT password” would return tokens successfully but negating terms that include minor breakers is not helpful

index=security NOT password

remember to tokenize breakers such as in ip addresses in order to produce a refined specific search so it doesn’t just exclude the individual numbers

a wildcard in the middle or end of a string can be acceptable but a wildcard at the beginning of a search will not create tokens and you cannot use wildcards to look for breakers

n*io will not find net.io, instead would need to be searched as n*.io

wildcards in the middle of a string can be costly and inconsistent

whenever possible, we should restrict our search to use unique values and specific terms and not use true, false, yes, no as it could bring too many results

administrators on a Splunk deployment can choose to create other index time field extractions but will increase the size of TSIDX files and will be extracted for all sourcetypes so they should be created when absolutely necessary

greater than and less than operators will not tokenize the field values because = designates the exact term

Splunk can interpret searches that rely on lookups into lispy expressions

if values in lookup fields contain major breakers they will not be included

7. Splunk Makeresults

there are commands and functions for troubleshooting searches,

| makeresults command is used to generate events in memory using the _time field

  • generally used with one or more eval commands
  • must be first command in search succeeding a | pipe

if you have static data that needs to be added at search time and is not in your index you can use the makeresults command, makeresults allows you to create an event containing whatever you’d like: a string, a numeric field, or a random number using the random function, it can be used to create some fake data and get practice using specific commands

makeresults is very helpful in terms of creating some fake data practicing commands such as the rex command and others on this fake data and then generating a set of results

8. Splunk Fieldsummary

Fieldsummary command calculates a variety of summary statistics for all or a subset of fields, and displays summary information as a results table, it is followed by a couple of options built into it which are in the brackets, it is preceded by a search and fieldsummary performing common statistical calculations on all of the fields that its working with and those fields values unless otherwise specified by providing a field list and the very end of the command

maxvals specifies the maximum number of unique values to display for each field (optional; defaults to 100)

it is another option instead of using the stats command and the functions of the stats command to compute statistics, the benefit with fieldsummary is you don’t have to specify each stats function, and as i said it must be preceded by a search and then we pipe the results of that search to the fieldsummary command, we can pipe to fieldsummary alone or pass one or more fields in the field list along with this optional argument of maxvals which is set to a number value,

the default of maxvals is 100 where it will default to a maximum of displaying 100 unique values of each field,

fieldsummary command output fields

  • field: field name
  • count: number of events with that field
  • distinct_count: number of unique values in field
  • is_exact: boolean (0 or 1) indicates whether distinct_count
  • max: (if field numeric)
  • mean (if field is numeric)
  • min (if field is numeric)
  • numeric_count: count of numeric values in field
  • stdev (if field numeric)
  • values: distinct values of field and count of each value

all of the fields that appear in the sidebar will be displayed underneath fields columns now

with additional columns with count, distinct_count, is_exact, max, mean, min, numeric_count, standard deviation and values

the count column is equivalent to running a stats count with that field passed in as an argument to the count function and tell the number of events contained in that field

distinct_count will display unique counts of that field, it will not count duplicates

maxvals will return 0 or 1 in is_exact if there is no exact count match with distinct_count per the

unique values will be listed in the values column

this will return only these two fields

| fieldsummary user src_ip

this is filtering to just the bcg_ip and scr_ip fields and looking at a maximum of ten values for those two fields

can also be used after a stats command where it will only return the stats command in that table (sales status)

9. Informational Functions

If you need to get information about a value in your data, you can use an informational function

  • isbool()
  • isint()
  • isnotnull()
  • isnum()
  • isstr()
  • typeof()

they will return a boolean value of true or false

the typeof() function returns a string representation stating the type of the value supplied

isnull() evaluates X for null value and returns TRUE or FALSE depending on if the value is null or not for the field passed in as an argument

if the first argument being a boolean expression returns true it will take the second argument of yes otherwise it will take the third argument of no

The typeof function will return the data type of the field that is passed in as an argument, so possible results will be number, string, or boolean

create a new field using the eval command called TypeOfData where this will set to an if function, the if function takes in three arguments, first argument being a boolean expression where if this argument returns to true then use the isnull function, looking to see whether the action field exists within each event and if does not, TypeOfData fields will take the value of Null, if it returns false we are going to take and display the data type of that action field

| eval TypeOfData = if(isnull(action), “Null”, typeof(action))

10. Debrief

I hoped this helped answer some general starter questions for anyone just learning Splunk. I really enjoyed doing and this and will be making more notes in the future.

--

--

D.M.

Experienced Cyber Security/Intelligence Analyst with a demonstrated history of working in the US Military and IT industry.